1 \chapter{Controlling the code generation}
2 Several options of the \faust compiler allow to control the generated C++ code. By default the computations are done sample by sample in a single loop. But the compiler can also generate \textit{vector} and \textit{parallel} code.
5 \section{Vector Code generation}
6 Modern C++ compilers are able to do autovectorization, that is to use SIMD instructions to speedup the code. These instructions can typically operate in parallel on short vectors of 4 simple precision floating point numbers thus leading to a theoretical speedup of $\times4$.
7 Autovectorization of C/C++ programs is a difficult task. Current compilers are very sensitive to the way the code is arranged. In particular too complex loops can prevent autovectorization. The goal of the vector code generation is to rearrange the C++ code in a way that facilitates the autovectorization job of the C++ compiler. Instead of generating a single sample computation loop, it splits the computation into several simpler loops that communicates by vectors.
9 The vector code generation is activated by passing the \lstinline!--vectorize! (or \lstinline!-vec!) option to the \faust compiler. Two additional options are available: \lstinline!--vec-size <n>! controls the size of the vector (by default 32 samples) and \lstinline!--loop-variant 0/1! gives some additional control on the loops.
11 To illustrate the difference between scalar code and vector code, let's take the computation of the RMS (Root Mean Square) value of a signal. Here is the \faust code that computes the Root Mean Square of a sliding window of 1000 samples:
12 \label{rms}
13 \begin{lstlisting}
14 // Root Mean Square of n consecutive samples
15 RMS(n) = square : mean(n) : sqrt ;
17 // Square of a signal
18 square(x) = x * x ;
20 // Mean of n consecutive samples of a signal
21 // (uses fixpoint to avoid the accumulation of
22 // rounding errors)
23 mean(n) = float2fix : integrate(n) :
24 fix2float : /(n);
26 // Sliding sum of n consecutive samples
27 integrate(n,x) = x - x@n : +~_ ;
29 // Convertion between float and fix point
30 float2fix(x) = int(x*(1<<20));
31 fix2float(x) = float(x)/(1<<20);
33 // Root Mean Square of 1000 consecutive samples
34 process = RMS(1000) ;
35 \end{lstlisting}
37 The compute() method generated in scalar mode is the following:
39 \begin{lstlisting}
40 virtual void compute (int count,
41 float** input,
42 float** output)
43 {
44 float* input0 = input;
45 float* output0 = output;
46 for (int i=0; i<count; i++) {
47 float fTemp0 = input0[i];
48 int iTemp1 = int(1048576*fTemp0*fTemp0);
49 iVec0[IOTA&1023] = iTemp1;
50 iRec0 = ((iVec0[IOTA&1023] + iRec0)
51 - iVec0[(IOTA-1000)&1023]);
52 output0[i] = sqrtf(9.536744e-10f *
53 float(iRec0));
54 // post processing
55 iRec0 = iRec0;
56 IOTA = IOTA+1;
57 }
58 }
59 \end{lstlisting}
61 The \lstinline!-vec! option leads to the following reorganization of the code:
62 \begin{lstlisting}
63 virtual void compute (int fullcount,
64 float** input,
65 float** output)
66 {
67 int iRec0_tmp[32+4];
68 int* iRec0 = &iRec0_tmp;
69 for (int index=0; index<fullcount; index+=32)
70 {
71 int count = min (32, fullcount-index);
72 float* input0 = &input[index];
73 float* output0 = &output[index];
74 for (int i=0; i<4; i++)
75 iRec0_tmp[i]=iRec0_perm[i];
76 // SECTION : 1
77 for (int i=0; i<count; i++) {
78 iYec0[(iYec0_idx+i)&2047] =
79 int(1048576*input0[i]*input0[i]);
80 }
81 // SECTION : 2
82 for (int i=0; i<count; i++) {
83 iRec0[i] = ((iYec0[i] + iRec0[i-1]) -
84 iYec0[(iYec0_idx+i-1000)&2047]);
85 }
86 // SECTION : 3
87 for (int i=0; i<count; i++) {
88 output0[i] = sqrtf((9.536744e-10f *
89 float(iRec0[i])));
90 }
91 // SECTION : 4
92 iYec0_idx = (iYec0_idx+count)&2047;
93 for (int i=0; i<4; i++)
94 iRec0_perm[i]=iRec0_tmp[count+i];
95 }
96 }
97 \end{lstlisting}
99 While the second version of the code is more complex, it turns out to be much easier to vectorize efficiently by the C++ compiler. Using Intel icc 11.0, with the exact same compilation options: \texttt{-O3 -xHost -ftz -fno-alias -fp-model fast=2}, the scalar version leads to a throughput performance of 129.144 MB/s, while the vector version achieves 359.548 MB/s, a speedup of x2.8 !
101 \begin{figure}[htb]
102 \centering
103 \includegraphics[scale=0.75]{images/compiler-stack}
104 \caption{\faust's stack of code generators}
105 \label{fig:stack}
106 \end{figure}
109 The vector code generation is built on top of the scalar code generation (see figure \ref{fig:stack}). Every time an expression needs to be compiled, the compiler checks if it requires a separate loop or not. It applies some simple rules for that. Expressions that are shared (and are complex enough) are good candidates to be compiled in a separate loop, as well as recursive expressions and expressions used in delay lines.
111 The result is a directed graph in which each node is a computation loop (see Figure \ref{fig:loopgraph}). This graph is stored in the klass object and a topological sort is applied to it before printing the code.
113 \begin{figure}[htb]
114 \centering
115 \includegraphics[scale=0.75]{graphs/loopgraph2}
116 \caption{The result of the -vec option is a directed acyclic graph (DAG) of small computation loops}
117 \label{fig:loopgraph}
118 \end{figure}
121 \section{Parallel Code generation}
123 The parallel code generation is activated by passing either the \lstinline!--openMP! (or \lstinline!-omp!) option or the \lstinline!--scheduler! (or \lstinline!-sch!) option. It implies the \lstinline!-vec! options as the parallel code generation is built on top of the vector code generation.
126 \subsection{The OpenMP code generator}
128 \begin{figure}[htb]
129 \centering
130 \includegraphics[scale=0.5,angle=-90]{images/openmp-model}
131 \caption{OpenMP is based on a fork-join model}
132 \label{fig:openmp}
133 \end{figure}
135 The \lstinline!--openMP! (or \lstinline!-omp!) option given to the \faust compiler will insert appropriate OpenMP directives in the C++ code. OpenMP (http://wwww.openmp.org) is a well established API that is used to explicitly define direct multi-threaded, shared memory parallelism. It is based on a fork-join model of parallelism (see figure \ref{fig:openmp}).
136 Parallel regions are delimited by \lstinline!#pragma omp parallel! constructs. At the entrance of a parallel region a team of parallel threads is activated. The code within a parallel region is executed by each thread of the parallel team until the end of the region.
138 \begin{lstlisting}
139 #pragma omp parallel
140 {
141 // the code here is executed simultaneously by
142 // every thread of the parallel team
143 ...
144 }
145 \end{lstlisting}
147 In order not to have every thread doing redundantly the exact same work, OpemMP provides specific \textit{work-sharing} directives. For example \lstinline!#pragma omp sections! allows to break the work into separate, discrete sections, each section being executed by one thread:
149 \begin{lstlisting}
150 #pragma omp parallel
151 {
152 #pragma omp sections
153 {
154 #pragma omp section
155 {
156 // job 1
157 }
158 #pragma omp section
159 {
160 // job 2
161 }
162 ...
163 }
165 ...
166 }
167 \end{lstlisting}
170 As said before the parallel code generation is built on top of the vector code generation. The graph of loops produced by the vector code generator is topologically sorted in order to detect the loops that can be computed in parallel. The first set $S_0$ (loops $L1$, $L2$ and $L3$ in the DAG of Figure \ref{fig:loopgraph}) contains the loops that don't depend on any other loops, the set $S_1$ contains the loops that only depend on loops of $S_0$, (that is loops $L4$ and $L5$), etc..
172 As all the loops of a given set $S_n$ can be computed in parallel, the compiler will generate a \lstinline!sections! construct with a \lstinline!section! for each loop.
173 \begin{lstlisting}
174 #pragma omp sections
175 {
176 #pragma omp section
177 for (...) {
178 // Loop 1
179 }
180 #pragma omp section
181 for (...) {
182 // Loop 2
183 }
184 ...
185 }
186 \end{lstlisting}
188 If a given set contains only one loop, then the compiler checks to see if the loop can be parallelized (no recursive dependencies) or not. If it can be parallelized, it generates:
189 \begin{lstlisting}
190 #pragma omp for
191 for (...) {
192 // Loop code
193 }
194 \end{lstlisting}
195 otherwise it generates a \lstinline!single! construct so that only one thread will execute the loop:
196 \begin{lstlisting}
197 #pragma omp single
198 for (...) {
199 // Loop code
200 }
201 \end{lstlisting}
203 \subsection{Example of parallel OpenMP code}
204 To illustrate how \faust uses the OpenMP directives, here is a very simple example, two 1-pole filters in parallel connected to an adder (see figure \ref{fig:parfilter} the corresponding block-diagram):
206 \begin{lstlisting}
207 filter(c) = *(1-c) : + ~ *(c);
208 process = filter(0.9), filter(0.9) : +;
209 \end{lstlisting}
211 \begin{figure}[htb]
212 \centering
213 \includegraphics[width=8cm]{images/filter2}
214 \caption{two filters in parallel connected to an adder}
215 \label{fig:parfilter}
216 \end{figure}
218 The corresponding compute() method obtained using the -omp option is the following:
219 \begin{lstlisting}
221 virtual void compute (int fullcount,
222 float** input,
223 float** output)
224 {
225 float fRec0_tmp[32+4];
226 float fRec1_tmp[32+4];
227 float* fRec0 = &fRec0_tmp;
228 float* fRec1 = &fRec1_tmp;
229 #pragma omp parallel firstprivate(fRec0,fRec1)
230 {
231 for (int index = 0; index < fullcount;
232 index += 32)
233 {
234 int count = min (32, fullcount-index);
235 float* input0 = &input[index];
236 float* input1 = &input[index];
237 float* output0 = &output[index];
238 #pragma omp single
239 {
240 for (int i=0; i<4; i++)
241 fRec0_tmp[i]=fRec0_perm[i];
242 for (int i=0; i<4; i++)
243 fRec1_tmp[i]=fRec1_perm[i];
244 }
245 // SECTION : 1
246 #pragma omp sections
247 {
248 #pragma omp section
249 for (int i=0; i<count; i++) {
250 fRec0[i] = ((0.1f * input1[i])
251 + (0.9f * fRec0[i-1]));
252 }
253 #pragma omp section
254 for (int i=0; i<count; i++) {
255 fRec1[i] = ((0.1f * input0[i])
256 + (0.9f * fRec1[i-1]));
257 }
258 }
259 // SECTION : 2
260 #pragma omp for
261 for (int i=0; i<count; i++) {
262 output0[i] = (fRec1[i] + fRec0[i]);
263 }
264 // SECTION : 3
265 #pragma omp single
266 {
267 for (int i=0; i<4; i++)
268 fRec0_perm[i]=fRec0_tmp[count+i];
269 for (int i=0; i<4; i++)
270 fRec1_perm[i]=fRec1_tmp[count+i];
271 }
272 }
273 }
274 }
276 \end{lstlisting}
278 This code requires some comments:
280 \begin{enumerate}
281 \item The parallel construct \lstinline!#pragma omp parallel! is the fundamental construct that starts parallel execution. The number of parallel threads is generally the number of CPU cores but it can be controlled in several ways.
283 \item Variables external to the parallel region are shared by default. The pragma \lstinline!firstprivate(fRec0,fRec1)! indicates that each thread should have its private copy of fRec0 and fRec1. The reason is that accessing shared variables requires an indirection and is quite inefficient compared to private copies.
285 \item The top level loop \lstinline!for (int index = 0;...)...! is executed by all threads simultaneously. The subsequent work-sharing directives inside the loop will indicate how the work must be shared between the threads.
287 \item Please note that an implied barrier exists at the end of each work-sharing region. All threads must have executed the barrier before any of them can continue.
289 \item The work-sharing directive \lstinline!#pragma omp single! indicates that this first section will be executed by only one thread (any of them).
291 \item The work-sharing directive \lstinline!#pragma omp sections! indicates that each corresponding \lstinline!#pragma omp section!, here our two filters, will be executed in parallel.
293 \item The loop construct \lstinline!#pragma omp for! specifies that the iterations of the associated loop will be executed in parallel. The iterations of the loop are distributed across the parallel threads. For example, if we have two threads, the first one can compute indices between 0 and count/2 and the other one between count/2 and count.
295 \item Finally \lstinline!#pragma omp single! in section 3 indicates that this last section will be executed by only one thread (any of them).
297 \end{enumerate}
299 \subsection{The scheduler code generator}
300 With the \lstinline!--scheduler! (or \lstinline!-sch!) option given to the \faust compiler, the computation graph is cut into separated computation loops (called "tasks"), and a "Work Stealing Scheduler" is used to activate and execute them following their dependencies. A pool of worked threads is created and each thread uses it's own local WSQ (Work Stealing Queue) of tasks. A WSQ is a special queue with a Push operation, a "private" LIFO Pop operation and a "public" FIFO Pop operation.
304 The local LIFO Pop operation allows better cache locality and the FIFO steal Pop "larger chuck" of work to be done. The reason for this is that many work stealing workloads are divide-and-conquer in nature, stealing one of the oldest task implicitly also steals a (potentially) large subtree of computations that will unfold once that piece of work is stolen and run.
306 Compared to the OpenMP model (-omp) the new model is worse for simple \faust programs and usually starts to behave comparable or sometimes better for "complex enough" \faust programs. In any case, since OpenMP does not behave so well with GCC compilers (only quite recent versions like GCC 4.4 start to show some improvements), and is unusable on OSX in real-time contexts, this new scheduler option has it's own value. We plan to improve it adding a "pipelining" idea in the future.
308 \subsection{Example of parallel scheduler code}
309 To illustrate how \faust generates the scheduler code, here is a very simple example, two 1-pole filters in parallel connected to an adder (see figure \ref{fig:parfilter} the corresponding block-diagram):
311 \begin{lstlisting}
312 filter(c) = *(1-c) : + ~ *(c);
313 process = filter(0.9), filter(0.9) : +;
314 \end{lstlisting}
317 When \lstinline!-sch! option is used, the content of the additional \textit{architecture/scheduler.h} file is inserted in the generated code. It contains code to deal with WSQ and thread management. The \lstinline'compute()' and \lstinline'computeThread()' methods are the following:
318 \begin{lstlisting}
320 virtual void compute (int fullcount,
321 float** input,
322 float** output)
323 {
324 GetRealTime();
325 this->input = input;
326 this->output = output;
327 StartMeasure();
328 for (fIndex = 0; fIndex < fullcount; fIndex += 32) {
329 fFullCount = min (32, fullcount-fIndex);
333 // Only initialize tasks with inputs
335 fIsFinished = false;
339 }
342 }
344 float* fRec0 = &fRec0_tmp;
345 float* fRec1 = &fRec1_tmp;
346 // Init graph state
347 {
350 int count = fFullCount;
351 // Init input and output
352 FAUSTFLOAT* input0 = &input[fIndex];
353 FAUSTFLOAT* input1 = &input[fIndex];
354 FAUSTFLOAT* output0 = &output[fIndex];
358 while (!fIsFinished) {
360 case WORK_STEALING_INDEX: {
362 break;
363 }
365 fIsFinished = true;
366 break;
367 }
368 // SECTION : 1
369 case 2: {
370 // LOOP 0x101111680
371 // pre processing
372 for (int i=0; i<4; i++) fRec0_tmp[i]=fRec0_perm[i];
373 // exec code
374 for (int i=0; i<count; i++) {
375 fRec0[i] = ((1.000000e-01f * (float)input1[i]) + (0.9f * fRec0[i-1]));
376 }
377 // post processing
378 for (int i=0; i<4; i++) fRec0_perm[i]=fRec0_tmp[count+i];
381 break;
382 }
383 case 3: {
384 // LOOP 0x1011125e0
385 // pre processing
386 for (int i=0; i<4; i++) fRec1_tmp[i]=fRec1_perm[i];
387 // exec code
388 for (int i=0; i<count; i++) {
389 fRec1[i] = ((1.000000e-01f * (float)input0[i]) + (0.9f * fRec1[i-1]));
390 }
391 // post processing
392 for (int i=0; i<4; i++) fRec1_perm[i]=fRec1_tmp[count+i];
395 break;
396 }
397 case 4: {
398 // LOOP 0x101111580
399 // exec code
400 for (int i=0; i<count; i++) {
401 output0[i] = (FAUSTFLOAT)(fRec1[i] + fRec0[i]);
402 }