The compilation is a sequence of five phases. The first phase checks that the imput is a static control program, i.e., fulfills some restrictions. The second phase builds an array data flow graph (DFG). The third one constructs a scheduling function compatible with the DFG. This function minimizes the execution time for an unbounded PRAM. The fourth one computes a mapping value for a physical distributed memory parallel machine. Finally, the last phase generates the parallel code.
Although this compilation scheme is implemented in PIPS, it is NOT interprocedural yet.
To each node of the DFG are associated an instruction and an execution domain. The execution domain is a set of constraints specifying the variation domain of the surrounding loop indices of the instruction.
Nodes are connected by arcs which represent true data dependences. To each edge of the DFG are associated the reference of the use dependence, a transformation function in which each index of the definition iteration domain is expressed as a linear function of the indices of the use iteration domain, and a governing predicate that specifies the sub-space of the sink's iteration domain on which the edge exists (it is a system of constraints upon the indices of the sink's iteration domain).
For a given time t, the scheduling function defines a set of independent operations which can be executed simultaneously. This set is called a front. The execution of two successive fronts must be sequential.
Even with the static control restrictions, it is not possible to always have a unidimensional linear schedule. Typically, this happens when there are two or more sequential loops in the same loop nest. In that case, a multidimensional linear schedule is computed [Fea92b].
Each dimension of the placement function defines a placement direction for the instruction and all these directions constitute the distribution space of the target machine. The number of dimensions to compute is arbitrary with a maximum equal to the dimension of iteration space minus the dimension of the scheduling function.
Initially, the goal of the algorithm that computes the placement function was to reduce the number of communications. For an edge of the DFG, there is a potential communication between its source and its destination, which can be represented by a distance. If such a distance is equal to zero (the edge is ``cut'') then the source and the destination will be mapped onto the same processor, and there will be no communication at all.
The principle of the method is to nullify as many distances as possible. To each edge is associated a cutting condition represented by a system of equalities. The length of an edge is nullified if its cutting condition is satisfied, i.e. its equalities are satisfied. These equalities corresponds to the nullification of the factors of the variables (loop indices and structure parameters) appearing in the distance. In most cases, satisfying the cutting conditions of all the edges will lead to the trivial solution: some null dimensions, i.e., all the occurences of the instruction are placed on the same processor. To avoid this, some edges should not be cut. To choose them, a greedy algorithm has been proposed [Fea93], it treats the edges by decreasing importance. The importance of an edge is represented by its weight which is equal to the volume of data which has to be sent if the edge is not cut.
This method does not take into account the type (and hence the temporal cost) of the communications. To that purpose, an extension of the method has been implemented. It is based on a special treatment of potential communications that can be optimized [Pla95b], for instance spread, reduction...
do t = 1, number of fronts execute concurrently all operations in F(t) synchronize end doThe code generation is based upon three transformations:
In PIPS, the generated parallel code can be either CM Fortran (for the CM-5) or CRAFT Fortran (for the Cray-T3D).
[CF93] J.-F. Collard and P. Feautrier. Automatic generation of data parallel code. In Fourth International Workshop on Compilers for Parallel Computers, Delft University of Technology, The Netherlands, December 1993.
[Fea88] P. Feautrier. Array expansion. in ACM Int. Conf. on Supercomputing, Saint-Malo, jul 1988.
[Fea91] P. Feautrier. Dataflow Analysis of Array and Scalar References. Int. Journal of Parallel Programming, 20(1):23-53, February 1991.
[Fea92a] P.Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part I : One-dimensional Time. Int. J. of Parallel Programming, 21(5):313-348, October 1992.
[Fea92b] P. Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part II : Multidimensional Time. Int. J. of Parallel Programming, 21(6):389-420, December 1992.
[Fea93] P. Feautrier. Toward Automatic Partitioning of Arrays on Distributed Memory Computers. In ACM ICS'93, pages 175-184, Tokyo, July 1993.
[Pla95a] A. Platonoff. Contribution à la distribution Automatique des Données pour Machines Massivement Parallèles. Thèse de Doctorat de l'Université P. et M. Curie, March 9th 1995.
[Pla95b] A. Platonoff. Automatic Data Distribution for Massively Parallel Computers. In 5th International Workshop on Compilers for Parallel Computers, Malaga University, Spain, June 1995.