WP65: A Summary
The WP65 compilation scheme was developped in 1990 to provide the
programmer of Transputer networks with a unique address space. It is
based on a shared-memory emulation using one half of the target machine
processors for pure computation, and the other half for data
management. It is equaivalent to a full-software cache.
INMOS T9000 processor and C104 hardware router provided the necessary
hardware support for multithreading computations and communications, and
for uniform point-to-point communication. Efficient use of the hardware
is possible when the memory emulation is statically compiled and when
computation and communication can overlap.
The project was funded by Esprit project 2701 (PUMA - WorkPackage 6.5)
and DRET, and developed by Corinne ANCOURT and
François IRIGOIN.
Input
- Fortran 77 with one main program.
- Set of loop nests (maybe non-perfectly nested).
- Affine loop bounds.
- Neither guards nor calls.
- Indirections are supported, but affine array subscripts
are necessary for efficiency.
Compilation
- Implicit distribution of data over the software memory banks
(general block-cyclic)
- Task generation based on control partioning
- Legal tiling of loop nests chosen to parallelize the code
- Use of the dependence graph for copy allocation
- Partial loop distribution...
- Default tile size used.
- Code generation for the tiles over the processors (send, receive
and compute codes).
Output
- 2 complementary SPMD Fortran 77 programs
-
COMPUTE(PROC_ID)
for the compute processors
-
BANK(BANK_ID)
for the memory bank processors
- Similar structures, for each tile:
- Bank processors to computational processors communications
(software cache prefetch)
- Computations on main processors
- Processors to banks communications
(software cache flush)
- Main difficulties:
- Data needed for one tile are often apart on different banks
- Memory coherency requires that only written data is flushed
- Thus complex code with guards may be generated for communications
- Local allocation and addressing to be selected and
computed on the processors
- PVM 3.3 used as run-time for testing correctness
Features
- No static or dynamic explicit data partitioning...
- More general than owner-computes rule
- Memory server and computation processes can each be put on a different processor and/or multiplexed:
- to control communication bandwidth,
- to reduce the number of context switches
- Load balancing: any process can be started on any processor...
- Parallelism granularity tuning easier
since independent of data distribution.
- Communications and computations may be performed on processors of
different kinds (heterogeneous machines).
- Although a full-software cache cannot be fully statically compiled,
regular loops can exploit the underlying INMOS hardware
very efficiently.
URL: http://www.cri.mines-paristech.fr/pips/wp65.html