Non-Obvious Implementation Choices for the Monte Carlo Simulation

This article was written for the Monte Carlo Simulation of European Swaptions sample.

Visit the Wikipedia pages for the Monte Carlo algorithm and swaptions to learn more about what this code is doing.
There are a few parts of the Monte Carlo code that serve a non-obvious purpose and benefit from an expanded explanation here.

Separated "for" loops for simulations and additions to achieve consistent results

The astute programmer might notice that the second “for” loop in the Monte Carlo loop (in Simulations.cpp) could be rolled into the first, thus eliminating the redundancy of iterating to c_num_simulations twice, like so
for (int path=0; path<c_num_simulations; ++path) {
    calculate_path_for_swaption_kernel_scalar(initial_LIBOR_rate, volatility, 
    normal_distribution_rand+(path*c_time_steps), discounted_swaption_payoffs+path);
    total_payoff += discounted_swaption_payoffs[i];
The astute parallel programmer might even consider using an Intel® Cilk reducer so that a correct answer can be maintained even in the cilk_for version of the loop, like so
cilk::reducer_opadd<float_point_precision> total_payoff;
cilk_for (int path=0; path<c_num_simulations; ++path) {
    calculate_path_for_swaption_kernel_scalar(initial_LIBOR_rate, volatility, 
    normal_distribution_rand+(path*c_time_steps), discounted_swaption_payoffs+path);
    total_payoff += discounted_swaption_payoffs[i];
This is a valid and efficient way to derive the final reduction value, but this Monte Carlo simulation involves floating-point operations, which muddies the waters a little bit. Due to the approximate nature of floats, mathematical operations can result in a loss of accuracy in the final answer. The problem is only magnified if the operation is between a very large and very small number. It just so happens that the random numbers generated from the normal distribution from the Intel® Math Kernel Library (Intel® MKL) has such large and small numbers. This means that calculations normally considered associative, where the order does not matter, such as addition, may vary based on the order they are conducted. While the nondeterministic nature of parallel code would not normally affect the answer, if many float additions are conducted at relative random, the final answer will change between runs. This is not to say that the answer being calculated is incorrect—the calculation itself is the same every time—but the way it is rounded will change the final answer. (For a more in-depth explanation, see our “Consistency of Floating-Point Results using Intel Compiler” white paper.)
Thus it is up to the programmer to decide if bit-for-bit reproducibility is important. For this particular sample, in order to maintain a sense of consistency, the averaging of all simulations is done outside of the parallel loop; this makes the floating point rounding exactly the same for each “for” loop.
for/cilk_for (int path=0; path<c_num_simulations; ++path) {
    calculate_path_for_swaption_kernel_scalar(initial_LIBOR_rate, volatility, 
    normal_distribution_rand+(path*c_time_steps), discounted_swaption_payoffs+path);
for (int path=0; path<c_num_simulations; ++path) {
    total_payoff += discounted_swaption_payoffs[i];

Accessing the normal distribution in a uniform fashion for both scalar and Array Notation

The array notation kernel (calculate_path_for_swaption_kernel_array()) is almost identical to its scalar counterpart (calculate_path_for_swaption_kernel_scalar()). At first glance, the only difference between the two is the support of vectorization in the form of appropriately placed [:]. There is, however, one other major difference: accessing the normal distribution.
This normal distribution is used as a way to grow the evolving swap rate of each swaption portfolio in a pseudo organic way. While the ordering is random, to maintain a sense of consistency, it should be the same random numbers for every test. Simply accessing the array in-order for the scalar and array notation does not keep this order. This simplified example explains why:
Let’s say the random (or in this case, not very random) normal distribution array looks like
RND: [1, 2, 3, 4, 5, 6]
Let’s say we want to complete two simulations; that is,
c_num_simulations = 2
This would mean that each simulation will use half of the numbers in the normal distribution, or 3 numbers each.
c_time_steps = 3
In the scalar kernel, each simulation draws all of its numbers at once, such that
simulation 1: [1, 2, 3]
simulation 2: [4, 5, 6]
Now think about accessing the normal distribution array in an in-order fashion for the array notation kernel. This would look like:
sqiz[:] = sqrt(c_reset_interval) * normal_distribution_rand[j*c_simd_vector_length: c_simd_vector_length];
It draws two numbers at a time, where the first one is assigned to simulation 1 and the second one is assigned to simulation 2. Simulation 1 would get [1], and simulation 2 would get [2], and so on. Thus,
simulation 1 w/ Array Notation: [1, 3, 5]
simulation 2 w/ Array Notation: [2, 4, 6]
Each time this array notation kernel is called, it actually follows a different route than its scalar counterpart. This results in a different, but computationally similar final answer. Similar to the previous modification, this answer is not wrong, but simply different due to ordering. To fix the ordering, so that the scalar and vector kernels are bit-for-bit consistent, the access pattern for the normal distribution should be
sqiz[:] = sqrt(c_reset_interval) * normal_distribution_rand[j:c_simd_vector_length:c_time_steps];

fp:source and Qfast-transcendentals

fp:source specifies to the compiler that all intermediate calculations should be completed with the source floating-point accuracy. In other words, if two floats are multiplied, floating point values are used for all steps of the multiplication. This is an important setting because the Intel compiler defaults to fp:double(Windows*) or fp:extended(Linux*/OS X*) when compiling for 32 bit applications, which forces all floating-point calculations with floats to first up-convert the number to a double/extended. While this can increase final value accuracy, it may come at the cost of excessive casting, which can slow down the calculation. (Note: fp:source additionally implies fp:precise)
Qfast-transcendentals allows the Intel compiler to make fast calls for transcendental functions even under fp:precise. Normally, fp:precise will enable a slow but very accurate implementation of transcendental functions, but enabling Qfast-transcendentals allows the compiler to use faster, SSE2 calls where there is hardware support. By default, these fast transcendental functions introduce variability in the calculations, but the compiler option Qfimf-precision=high tells the compiler to use more accurate (and only slightly slower) SSE2 function calls that, at least for this sample, maintain bit-for-bit consistency.
For more complete information about compiler optimizations, see our Optimization Notice.