# Assessing the accelerator buzz: Another tip for faster Monte Carlo computing

Continuing with the GaussianRand example, a 1.5x gain is nice but were there additional opportunities for performance gains?  Of course there were! (That was a rhetorical question…)  Seeing as floating point divides are among the longer latency operations, we should look at the two that are coded into the do/while loop to normalize the random numbers:

do {
x1 = 2.0 * random()/RAND_MAX - 1.0;
x2 = 2.0 * random()/RAND_MAX - 1.0;
w = x1 * x1 + x2 * x2;
} while ( w >= 1.0 );

The x1 and x2 computations take a random integer from 0 to RAND_MAX and normalize it into the range -1.0 to 1.0.  While we might expect the compiler to reduce this to a single multiply by the constant (2.0/RAND_MAX) and then subtract 1.0, we can’t assume anything.  Take a look at the assembly listing:

call      random                                        #55.15
..B2.4:
cvtsi2sdq %rax, %xmm0                                   #55.15
divsd     _2il0floatpacket.1(%rip), %xmm0               #55.24
subsd     _2il0floatpacket.3(%rip), %xmm0               #55.35
movsd     %xmm0, 24(%rsp)                               #55.35
call      random                                        #56.15
..B2.5:
cvtsi2sdq %rax, %xmm4                                   #56.15
movsd     _2il0floatpacket.3(%rip), %xmm2               #56.35
divsd     _2il0floatpacket.1(%rip), %xmm4               #56.24
movsd     24(%rsp), %xmm0                               #57.13
subsd     %xmm2, %xmm4                                  #56.35

Even without being an assembly wizard, you might detect that the two calls to random() are soon followed with divides by some constant value.  Those are going to chew up a lot of clock cycles.  So we should get a nice gain by explicitly folding this into a multiply:

const double RMrcp = 2.0/RAND_MAX;

for (int i = 0; i < LENGTH; i++)
{
do
{
x1 = random()*RMrcp - 1.0;
x2 = random()*RMrcp - 1.0;
w = x1 * x1 + x2 * x2;
} while ( w >= 1.0 );

_x1[i] = x1;
_x2[i] = x2;
_w[i] = w;

}

This quick code mod pushed the speedup to 1.9x.

For more complete information about compiler optimizations, see our Optimization Notice.