Distributed Code Generation - The WP65 Scheme

Last update: 95/09/1 - Author: Corinne ANCOURT

This phase was designed to automatize the generation of Fortran 77 program onto distributed memory machines using an Emulated Shared Memory scheme and to exploit the universal message passing capability provided by the INMOS T9000 processor and C104 hardware router. One half of the processors perform computations and the other half emulate memory banks providing the compiler with a better understood target machine, a multiprocessor with a fast local memory managed as a software cache and a slow shared memory. The fast context switching times and intelligent on-chip channel processors make possible to overlap computations and communications when T9000 and C104 are used.

This work was partially funded by ESPRIT project 2701 (PUMA - WorkPackage 6.5) and by DRET.

Input

This phase takes as input a sequential Fortran77 program meeting the following conditions:

it is made of a single main program whose body is made of a set of perfectly or non-perfectly nested loop(s) whose upper and lower bounds are affine expressions and assignments ;
it may contain indirections but neither guards nor calls and I/Os.

WP65 Approach

Task generation is based on control partitioning. The data dependence graph between program instructions is used to build parallel tasks. Loop transformations like tiling transformation and distribution are used on nested loops in order to define blocks of loop iterations that can be computed in parallel.

The dependence graph is used to decide if a given tiling is legal. The current implementation does not include an automatic estimation of the tile size and a default size is used.

Each tile is seen as a logically independent task. Each task is made of three parts: a prologue to read the input data from the emulated shared memory, a computational part and a final part to store the results. Ideally several tasks should be executed by the same physical processor to overlap communications and computations.

Output

A 2-PMD distributed Fortran77 program containing calls to the runtime communication library PVM is generated. The input program is transformed into two subroutines:

one subroutine, called COMPUTE(PROC_ID), contains the computational part of the code and receives a (logical) processor number as parameter;
one subroutine, called BANK(BANK_ID), contains the shared memory emulator part of the code and receives a (logical) bank number as parameter.

The general structures of these two subroutines are very close since each send (receive) must be met by a corresponding receive (send). Like the input program they are sequences of nested loop. The outermost loop nest defines which tile is being executed. Each tile body is made of two or three sections:

a sequence of nested loops used to load the different local copies from the emulated shared memory; this sequence may be empty when there is no input, as is the case with a simple array initialization; it can be viewed as a software cache prefetch.
a sequence of nested loops to execute all iterations of the current tile; the loop body is a copy of the initial loop body but references to user variables are replaced by references to local variables; this section is empty in the memory emulator code;
a sequence of nested loops used to store the different local copies in the emulated shared memory, i.e. a cache flush.

Features

The potential advantages of this approach are:

implicit data distribution based on control partitioning, which means there is no need for static or dynamic explicit data partitioning;
efficient handling of more general constructs than with the owner rule (e.g. Fortran HPF compilation);
memory servers do not have to be run on the same processors as computations; this increases the communication bandwidth and decreases the number of context switches on computational processors;
easier load balancing because any process can be started on any processor;
parallelism granularity can be easily tuned because it is not implicitly linked to a data distribution; this also may improve load balancing.

The obvious disavantage is that a full software cache cannot be fully statically compiled. However regular code can exploit the underlying INMOS hardware very efficiently.

More information

A full description of the approach and examples are given in [3].

To run the WP65 phase with wpips, ask the distributed view.

e-mail: ancourt-at-cri.mines-paristech.fr and pips-support-at-cri.mines-paristech.fr

References

1: C. Ancourt, Generation automatique de codes de transfert pour multiprocesseurs a memoires locales, These de doctorat, Universite Paris 6, 1990
2: C. Ancourt, F. Irigoin, Scanning Polyhedra with DO loops, Third ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'91), Williamsburg, 1991
3: C. Ancourt, F. Irigoin, Automatic Code Distribution, : The Third Workshop on Compilers for Parallel Computers (CPC'92), Vienna, Austria, July 6-9, 1992.
4: F. Irigoin, R. Triolet, Supernode Partitioning, ACM Symposium on Principles of Programming Languages, San-Diego, 1988
5: F. Irigoin, P. Jouvelot, R. Triolet, Semantical Interprocedural Parallelization: An Overview of the PIPS Project, ACM International Conference on Supercomputing (ICS'91), Cologne, 1991
6: L. Zhou, Enhanced Static Evaluation of Fortran Programs in the PIPS Environment, Tech. Report E/160/CRI, Ecole des Mines de Paris, 1991