Assessing the accelerator buzz: Another tip for faster Monte Carlo computing

Continuing with the GaussianRand example, a 1.5x gain is nice but were there additional opportunities for performance gains?  Of course there were! (That was a rhetorical question…)  Seeing as floating point divides are among the longer latency operations, we should look at the two that are coded into the do/while loop to normalize the random numbers:

        do {
          x1 = 2.0 * random()/RAND_MAX - 1.0;
          x2 = 2.0 * random()/RAND_MAX - 1.0;
          w = x1 * x1 + x2 * x2;
        } while ( w >= 1.0 );

The x1 and x2 computations take a random integer from 0 to RAND_MAX and normalize it into the range -1.0 to 1.0.  While we might expect the compiler to reduce this to a single multiply by the constant (2.0/RAND_MAX) and then subtract 1.0, we can’t assume anything.  Take a look at the assembly listing:

        call      random                                        #55.15
..B2.4:
        cvtsi2sdq %rax, %xmm0                                   #55.15
        addsd     %xmm0, %xmm0                                  #55.15
        divsd     _2il0floatpacket.1(%rip), %xmm0               #55.24
        subsd     _2il0floatpacket.3(%rip), %xmm0               #55.35
        movsd     %xmm0, 24(%rsp)                               #55.35
        call      random                                        #56.15
..B2.5:
        cvtsi2sdq %rax, %xmm4                                   #56.15
        movsd     _2il0floatpacket.3(%rip), %xmm2               #56.35
        addsd     %xmm4, %xmm4                                  #56.15
        divsd     _2il0floatpacket.1(%rip), %xmm4               #56.24
        movsd     24(%rsp), %xmm0                               #57.13
        subsd     %xmm2, %xmm4                                  #56.35

Even without being an assembly wizard, you might detect that the two calls to random() are soon followed with divides by some constant value.  Those are going to chew up a lot of clock cycles.  So we should get a nice gain by explicitly folding this into a multiply:

    const double RMrcp = 2.0/RAND_MAX;

        for (int i = 0; i < LENGTH; i++)
          {
            do
            {
              x1 = random()*RMrcp - 1.0;
              x2 = random()*RMrcp - 1.0;
              w = x1 * x1 + x2 * x2;
            } while ( w >= 1.0 );
            
            _x1[i] = x1;
            _x2[i] = x2;
            _w[i] = w;

          }

This quick code mod pushed the speedup to 1.9x.

For more complete information about compiler optimizations, see our Optimization Notice.