WP65: A Summary

The WP65 compilation scheme was developped in 1990 to provide the programmer of Transputer networks with a unique address space. It is based on a shared-memory emulation using one half of the target machine processors for pure computation, and the other half for data management. It is equaivalent to a full-software cache. INMOS T9000 processor and C104 hardware router provided the necessary hardware support for multithreading computations and communications, and for uniform point-to-point communication. Efficient use of the hardware is possible when the memory emulation is statically compiled and when computation and communication can overlap. The project was funded by Esprit project 2701 (PUMA - WorkPackage 6.5) and DRET, and developed by Corinne ANCOURT and François IRIGOIN.

Input

Fortran 77 with one main program.
Set of loop nests (maybe non-perfectly nested).
Affine loop bounds.
Neither guards nor calls.
Indirections are supported, but affine array subscripts are necessary for efficiency.

Compilation

Implicit distribution of data over the software memory banks (general block-cyclic)
Task generation based on control partioning
- Legal tiling of loop nests chosen to parallelize the code
- Use of the dependence graph for copy allocation
- Partial loop distribution...
- Default tile size used.
Code generation for the tiles over the processors (send, receive and compute codes).

Output

2 complementary SPMD Fortran 77 programs
- COMPUTE(PROC_ID) for the compute processors
- BANK(BANK_ID) for the memory bank processors
Similar structures, for each tile:
- Bank processors to computational processors communications (software cache prefetch)
- Computations on main processors
- Processors to banks communications (software cache flush)
Main difficulties:
- Data needed for one tile are often apart on different banks
- Memory coherency requires that only written data is flushed
- Thus complex code with guards may be generated for communications
- Local allocation and addressing to be selected and computed on the processors
PVM 3.3 used as run-time for testing correctness

Features

No static or dynamic explicit data partitioning...
More general than owner-computes rule
Memory server and computation processes can each be put on a different processor and/or multiplexed:
- to control communication bandwidth,
- to reduce the number of context switches
Load balancing: any process can be started on any processor...
Parallelism granularity tuning easier since independent of data distribution.
Communications and computations may be performed on processors of different kinds (heterogeneous machines).
Although a full-software cache cannot be fully statically compiled, regular loops can exploit the underlying INMOS hardware very efficiently.

URL: http://www.cri.mines-paristech.fr/pips/wp65.html