Finance: Monte Carlo

Monte Carlo algorithms solve deterministic problems by using a probabilistic analogue. The algorithm requires repeated simulations of chance that lend themselves well to parallel processing and vectorization. The simulations in this example are run serially, with Intel® Cilk™ Plus Array Notation (AN) for vectorization, with Intel Cilk Plus cilk_for for parallelization, and with both vectorization and cilk_for. In this example, the Monte Carlo algorithm is utilized to estimate the valuation of a European swaption, which is fundamentally calculated by the difference between the strike price and the future estimated value, or forward swap rate. The Monte Carlo algorithm estimates the valuation by applying the initial conditions to a normal distribution over many simulations to calculate a normal valuation.

Click here for a more in-depth discussion about the sample as well as more information about the design choices of this sample.

Based on an original code by Paul Glasserman and Xiaoliang Zhao (Columbia University, 1999-2000) with subsequent modifications by Mike Giles (Oxford University, 2005-8)

 

Code Change Highlights:

This code is optimized on two levels: calling simulations, which utilizes the cilk_for Keyword, and calculating discounted swaption payoffs, which utilizes Array Notation.
  • cilk_for
  • linear version:
    for (int path=0; path<c_num_simulations; ++path) { Pathcalc_Portfolio_Scalar_Kernel(initial_LIBOR_rate, volatility, normal_distribution_rand+(path*c_earliest_maturity), discounted_swaption_payoffs+path); }
    cilk_for version:
    cilk_for (int path=0; path<c_num_simulations; ++path) { Pathcalc_Portfolio_Scalar_Kernel(initial_LIBOR_rate, volatility, normal_distribution_rand+(path*c_earliest_maturity), discounted_swaption_payoffs+path); }
    This simple change creates code that ran 4x faster on our machine.
  • Array Notation
  • The vectorization itself happens within kernels.cpp. The math is identical, but is conducted over multiple simulations at once. An example occurs here:
    scalar version:
    for (int i=j+1; i<c_maturity; ++i) { float_point_precision volatility_local = volatility[i-j-1]; spot_LIBOR_drift += c_time_step*volatility_local*forward_LIBOR_rates[i]/ (1.0+c_time_step*forward_LIBOR_rates[i]); forward_LIBOR_rates[i] *= exp(volatility_local*(spot_LIBOR_drift - 0.5*volatility_local)*c_time_step + volatility_local*sqiz); }
    array notation version:
    for (int i=j+1; i<c_maturity; ++i) { float_point_precision volatility_local = volatility[i-j-1]; spot_LIBOR_drift[:] += c_time_step*volatility_local*forward_LIBOR_rates[i][:]/ (1.0+c_time_step*forward_LIBOR_rates[i][:]); forward_LIBOR_rates[i][:] *= exp(volatility_local*(spot_LIBOR_drift[:] - 0.5*volatility_local)*c_time_step + volatility_local*sqiz[:]); }
  • cilk_for + Array Notation
  • Combining cilk_for and array notation reduces the number of loops necessary because vectorized code will traverse the for loop in fewer steps.
    cilk_for version:
    cilk_for (int path=0; path<c_num_simulations; ++path) { Pathcalc_Portfolio_Scalar_Kernel(initial_LIBOR_rate, volatility, normal_distribution_rand+(path*c_earliest_maturity), discounted_swaption_payoffs+path); }
    cilk_for + Array Notation version:
    cilk_for (int path=0; path<c_num_simulations; path+=c_simd_vector_length) { calculate_path_for_swaption_kernel_array(initial_LIBOR_rate, volatility, normal_distribution_rand+(path*c_earliest_maturity), discounted_swaption_payoffs+path); }

Performance Data:

Note: Modified Speedup shows performance speedup with respect to serial implementation.

Modified Speedup Compiler
(Intel® 64)
Compiler options System specifications
AN: 6.3x
cilk_for: 4.5x
Both: 28.2x
Intel® C++ Compiler 15.0 for Windows /O2 /Oi /fp:source1 /Qfast-transcendentals /Qimf-precision=high /QxHost /Qipo Microsoft Windows Server* 2012 (x64)
Intel® Xeon® E3 1280 CPU @ 3.50GHz
8GB memory
AN: 8.8x
cilk_for: 4.2x
Both: 38.3x
Intel® C++ Compiler 15.0 for Linux -O2 -fp-model source1 -fast-transcendentals -fimf-precision=high -xHost -ipo Red Hat* Enterprise Linux 7 (x64)
Intel® Core™ i7-4790 CPU @3.60GHz
32GB memory
1/fp:source(-fp-model source) is used for bit-for-bit reproducability. If your program does not require this, you can enable /fp:fast(-fp-model fast) for better performance

Build Instructions:

  • For Microsoft Visual Studio* 2010 and 2012 users:
  • Open the solution .sln file
    [Optional] To use Intel® Math Kernel Library (Intel® MKL) for Gaussian distribution (a different result will be generated):
    • Project Properties -> Configuration Properties -> Intel® Performance Libraries -> set Use MKL : Parallel
    • Project Properties -> C/C++ -> Preprocessor -> Preprocessor Definitions -> add IS_USING_MKL
    Choose a configuration (for best performance, choose a release configuration):
    • Intel-debug and Intel-release: uses Intel® C++ compiler
    • VSC-debug and VSC-release: uses Visual C++ compiler (only linear/scalar will run)
  • For Windows Command Line users:
  • Enable your particular compiler environment
    For Intel® C++ Compiler:
    • Open the appropriate Intel® C++ compiler command prompt and navigate to project folder
    • To compile: Build.bat [MKL] [perf_num]
      • MKL: compile with Intel® Math Kernel Library (Intel® MKL)
      • perf_num: collect performance numbers (will run example 5 times and take average time)
    • To run: Build.bat run [help|0|1|2|3|4]
    For Visual C++ Compiler (only linear/scalar will run):
    • Open the appropriate Microsoft Visual Studio* 2010 or 2012 command prompt and navigate to project folder
    • To compile: Build.bat [perf_num]
      • perf_num: collect performance numbers (will run example 5 times and take average time)
    • to run: Build.bat run
  • For Linux* or OS X* users:
  • From a terminal window, navigate to the project folder
    Using Intel® C++ compiler:
    • Set the icc environment: source <icc-install-dir>/bin/compilervars.sh {ia32 | intel64}
    • To compile: make [icpc] [MKL=1] [perf_num=1]
      • MKL=1: compile with Intel Math Kernel Library (Intel MKL)
      • perf_num=1: collect performance numbers (will run example 5 times and take average time)
    • To run: make run [option=help|0|1|2|3|4]
    Using gcc (only linear/scalar will run):
    • To compile: make gcc [perf_num=1]
      • perf_num=1: collect performance numbers (will run example 5 times and take average time)
    • To run: make run
For more complete information about compiler optimizations, see our Optimization Notice.