SIMD Parallelism using Array Notation

Are you a C or C++ programmer who has ever envied APL or Fortran 90's array expressions?   Read on.  If you don't know what array expressions are, then you really should read on, to find out what you should have envied.  In any case, the envy is over, because  Intel Parallel Composer 2011 brings array expressions to C and C++.

Background


A while back I wrote about the Three Layer Cake pattern for parallel programming.  The pattern is a way of organizing programs to fully exploit modern  multi-core chips.   Two of the layers are:

    • fork-join: harnesses multiple hardware threads.

    • SIMD: harnesses SIMD instructions.



The compiler in Intel Parallel Composer 2011 extends C++ to directly support these two layers.  The extensions are called Intel® Cilk Plus.  They are:

    • Cilk notation for specifying fork-join parallelism.

    • Array notation for specifying SIMD parallelism.



This blog introduces the array notation, with a Seismic Duck kernel as the example.  I'll introduce Cilk notation in another blog.  The two notations are independent.  Indeed, the array notation is valuable with other threading packages too, such as Threading Building Blocks, or just for writing faster serial code.

Quick Introduction to Array Notation


The array notation extension is reminiscent of APL and Fortran-90 style array expressions.   The expression:

a[index:count]


denotes an array section starting at index with count elements.  Scalar operations can be used on conformable array sections in an intuitive manner.   Operations between scalars and array sections work too; scalar extende in the obvious way (like in APL or Fortran 90).  Examples:  

    z[i:n] = x[i:n];      // Copies x[i..i+n-1] to x[i..i+n-1].

    z[i:n] = 2*x[i+1:n];  // Sets z[i..i+n-1] to twice the corresponding elements in x[i+1..i+n].

    u[i:m][j:n] += 1;     // Increments elements of two-dimensional mxn array section with upper left corner [i][j].


Section notation also permits expressions of the form array[index:count:stride], reductions, and shorthands that I will not into here.   I'm presenting just enough to pique your interest.  To learn more about it, follow this link to the compiler documentation.   

Example


I've described in other blogs how seismic wave propagation in Seismic Duck depends on updating a "tile", a small subarray that fits in cache.  Here is the scalar code that dominates execution time.  It updates a tile with uniform A and B coefficients:

    float a = 2*A[iFirst][jFirst];

    float b = B[iFirst][jFirst];

    for( int i=iFirst; i<iLast; ++i ) {

        for( int j=jFirst; j<jLast; ++j ) {

            Vx[i][j] += a*(U[i][j+1]-U[i][j]);

            Vy[i][j] += a*(U[i+1][j]-U[i][j]);

            U[i][j] += b*((Vx[i][j]-Vx[i][j-1])+(Vy[i][j]-Vy[i-1][j]));

        }

}


To improve the speed on compilers that did not automatically generate SIMD code from the scalar loops, I wrote the key loops with SSE intrinsics, so that calculations are done four wide instead of one at a time.  The resulting code looks like this:

    #define CAST(x) (*(__m128*)&(x))        /* for aligned load or store */

    #define LOAD(x) _mm_loadu_ps(&(x))      /* for unaligned load */

    #define ADD _mm_add_ps

    #define MUL _mm_mul_ps

    #define SUB _mm_sub_ps

    ...

    __m128 a = CAST(A[iFirst][jFirst]);

    a = ADD(a,a);

    __m128 b = CAST(B[iFirst][jFirst]);

    for( int i=iFirst; i<iLast; ++i ) {

        for( int j=jFirst; j<jLast; j+=4 ) {

            __m128 u = CAST(U[i][j]);

            CAST(Vx[i][j]) = ADD(CAST(Vx[i][j]),MUL(a,SUB(LOAD(U[i][j+1]),u)));

            CAST(Vy[i][j]) = ADD(CAST(Vy[i][j]),MUL(a,SUB(CAST(U[i+1][j]),u)));

            CAST(U[i][j]) = ADD(u,MUL(b,ADD(SUB(CAST(Vx[i][j]),LOAD(Vx[i][j-1])),SUB(CAST(Vy[i][j]),CAST(Vy[i-1][j])))));

        }

    }


The downside of the change is obvious - it's hard to read. And this was a simple case because logic elsewhere guarantees that jLast-jFirst is a multiple of 4.  Otherwise, dealing with the extra iterations would have further obfuscated the code.

For this particular example, explicit SSE intrinsics are not actually necessary with a compiler that automatically vectorizes (convert to SIMD instructions).  Indeed, recent compilers that I tried seem to be able to do so.  (Though one older compiler from 2008 did not.)   But I was careful to cater to the optimizer.  I declared the arrays Vx, Vy, and U as static file-scope arrays in the source code, not pointers.  That's not trendy OO programming, but it lets the compiler easily prove absence of aliasing, and thus absence of loop carried dependences that could thwart vectorization.   It's not always practical to cater this way to the optimizer.  Furthermore, array notation has its own elegance.  So I'll use the kernel as a running example anyway. 

The array notation in Intel® Cilk Plus lets me state my intent ("SIMD parallelism!") to the compiler more bluntly.  Below is an array notation version of the example:

   int i = iFirst;

   int j = jFirst;

   size_t m = iLast-iFirst;

   size_t n = jLast-jFirst;

   float a = 2*A[i][j];

   float b = B[i][j];

   Vx[i:m][j:n] += a*(U[i:m][j+1:n]-U[i:m][j:n]);

   Vy[i:m][j:n] += a*(U[i+1:m][j:n]-U[i:m][j:n]);

   U[i:m][j:n] += b*((Vx[i:m][j:n]-Vx[i:m][j-1:n])+(Vy[i:m][j:n]-Vy[i-1:m][j:n]));


Compare the last three lines with the forall loops from which the code was derived in another blog:

    forall i, j {

        Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);

        Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);

    }

    forall i, j {

        U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));

    }


The array notation has let me clearly convey the parallel nature of the updates.  I had to add the setup of i, j, m, n.  But that's an accident of history.  Elsewhere the code computes { iFirst, iLast, jFirst, jLast} from the equivalent of {i, j, m, n} because the former simplified writing the C++ for loops.  If I adapt the rest of the code to use array notation, then { iFirst, iLast, jFirst, jLast} will disappear and {i,j,m,n} will be setup in their place.

Of course I'm still depending on a clever optimizer to eliminate the temporary subarrays.  For example, the subexpression:

    a*(U[i:m][j+1:n]-U[i:m][j:n]);


conceptually generates two temporary array sections, for the results of - and *.   In practice, the Intel compiler is good at eliminating those temporaries.   (It's had years of practice doing so for Fortran 90.)  But even if some temporary sections remained, the parallelism is still clear.  The compiler does not have to deduce parallelism from dependence analysis of serial for loops. 

Concise notation is nice, but how about the performance?  When compiled by the Intel compiler and run on the Core-2 Quad system in my office, the array notation variant performed faster than my hand-coded SSE.   [Your mileage may vary.  See optimization disclaimer here.]  I dug through the object code to figure out why.  It turns out that updating Vx, Vy, and U in separate loops does better than with a single loop.   I found out that the hand-coded SSE does as well if changed to use separate loops to update the three arrays.  Anyway, I'm happy that the array notation matches the best that I can do by hand for this example.

Summary


The array notation is a concise way to express SIMD parallelism.  I'm hoping it catches on with other compilers.  In another blog I'll introduce the application of Cilk fork-join parallelism to Seismic Duck.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.