Assessing the accelerator buzz: Vectorization of Monte Carlo algorithms

Now we’ll take a look at optimizing something more interesting and complex.  Since we can’t show much of the customer source we work on, we’ll look at some public domain code from the internet, specifically this Box Muller random number transformation from http://www.taygeta.com/random/gaussian.html:

 
    for (int i = 0; i < LENGTH; i++)
      {
        double w, x1, x2;

        do {
          x1 = 2.0 * random()/RAND_MAX - 1.0;
          x2 = 2.0 * random()/RAND_MAX - 1.0;
          w = x1 * x1 + x2 * x2;
        } while ( w >= 1.0 );

        w = sqrt( (-2.0 * log( w ) ) / w );
        y1[i] = x1 * w;
        y2[i] = x2 * w;
      }

 
This will produce a stream of Gaussian pseudo-random numbers saved into the arrays y1 and y2.

Going in we know that the Intel Compiler will not vectorize the do/while loop because it has an unpredictable loop count that is dependent on the results of each pass through the code.  The outer for loop won’t vectorize either as the compiler is not yet capable of such magic.

Between the square root, log, and divides we have a lot of expensive stuff going on here.  It would pay to restructure the code into something that the compiler can recognize as a vectorization candidate.  This can be done by splitting the outer loop into two for loops and saving intermediate values into arrays, as such:

double x1, x2, w;
__declspec(align(16)) double _w[LENGTH], _x1[LENGTH], _x2[LENGTH];

for (int i = 0; i < COUNT; i++)
      {
       do
            {
              x1 = 2.0*random()/RAND_MAX - 1.0;
              x2 = 2.0*random()/RAND_MAX - 1.0;
              w = x1 * x1 + x2 * x2;
            } while ( w >= 1.0 );

            _x1[i] = x1;
            _x2[i] = x2;
            _w[i] = w;

    }

#pragma ivdep
#pragma vector aligned
for (int i = 0; i < LENGTH; i++)
          {
            w = _w[i];
            w = sqrt( (-2.0 * log( w ) ) / w );
            y1[i] = _x1[i] * w;
            y2[i] = _x2[i] * w;0
          }

 
With this restructuring we’ve isolated the sqrt/log/divide sequence into a loop simple enough for the compiler to vectorize.  The #pragma’s instruct it to ignore potential vector dependencies, i.e. pointer aliasing, and assume that the arrays exhibit 16-byte alignment (guaranteed by adding “__declspec(align(16))” to the declarations).

 
$ icc -xP -vec_report3  GaussianRand.C
...
GaussianRand.C(85): (col. 21) remark: loop was not vectorized: unsupported loop structure.
GaussianRand.C(95): (col. 2) remark: LOOP WAS VECTORIZED.

 

This generated a healthy 1.5x speedup on the algorithm.

Note that the compiler employs a “short vector math library” which contains SIMD versions of math calls such as log, exp, sin, cos, etc.  This enables vectorization on loops such as the one above.  You can see how SVML is used in the assembly listing:

..B3.9:                        
        movaps    16032(%rsp,%rbp), %xmm8                       #97.10
        movaps    16032(%rsp,%rbp), %xmm0                       #98.29
        call      __svml_log2                                   #98.29
        mulpd     %xmm9, %xmm0                                  #98.24
        divpd     %xmm8, %xmm0                                  #98.37
        sqrtpd    %xmm0, %xmm2                                  #98.10
        movaps    32(%rsp,%rbp), %xmm1                          #99.24
...

For more complete information about compiler optimizations, see our Optimization Notice.