Last update: 95/09/1 - Author: Corinne ANCOURT
This phase was designed to automatize the generation of Fortran 77
program onto distributed memory machines using an Emulated Shared
Memory scheme and to exploit the universal message passing
capability provided by the INMOS T9000 processor and C104 hardware
router. One half of the processors perform computations and the other
half emulate memory banks providing the compiler with a better
understood target machine, a multiprocessor with a fast local memory
managed as a software cache and a slow shared memory. The fast context
switching times and intelligent on-chip channel processors make possible
to overlap computations and communications when T9000 and C104 are used.
This work was partially funded by ESPRIT project 2701 (PUMA
- WorkPackage 6.5) and by DRET.
This phase takes as input a sequential Fortran77 program meeting the following conditions:
Task generation is based on control partitioning. The data
dependence graph between program instructions is used to build parallel
tasks. Loop transformations like tiling transformation and
distribution are used on nested loops in order to define blocks of
loop iterations that can be computed in parallel.
The dependence graph is used to decide if a given tiling is legal. The current implementation does not include an automatic estimation of the tile size and a default size is used.
Each tile is seen as a logically independent task. Each task is made of three parts: a prologue to read the input data from the emulated shared memory, a computational part and a final part to store the results. Ideally several tasks should be executed by the same physical processor to overlap communications and computations.
A 2-PMD distributed Fortran77 program containing calls to the
runtime communication library PVM is generated. The input program is
transformed into two subroutines:
COMPUTE(PROC_ID)
, contains the
computational part of the code and receives a (logical) processor
number as parameter;
BANK(BANK_ID)
, contains the
shared memory emulator part of the code and
receives a (logical) bank
number as parameter.
The general structures of these two subroutines are very close since each send (receive) must be met by a corresponding receive (send). Like the input program they are sequences of nested loop. The outermost loop nest defines which tile is being executed. Each tile body is made of two or three sections:
The potential advantages of this approach are:
The obvious disavantage is that a full software cache cannot be fully statically compiled. However regular code can exploit the underlying INMOS hardware very efficiently.
A full description of the approach and examples are given in [3].
To run the WP65 phase with wpips, ask the distributed view.
e-mail: ancourt-at-cri.mines-paristech.fr and pips-support-at-cri.mines-paristech.fr