Download Available under the Intel Sample Source Code License Agreement license.

### Introduction

This is the second of a two document series. The first article, Monte Carlo European Option Pricing with RNG Interface for Intel® Xeon Phi™ Coprocessor, covers the background of Monte Carlo methods, and details the random number generation interface with Intel® MKL. This article covers the Monte Carlo Methods using a simple quasi random number generator.

When the Monte Carlo method was first introduced, it was generally considered an inefficient algorithm because the results converge at the slow rate of the reciprocal of the square root of its problem size, . In other words, to achieve 10 times improvement in accuracy, you have to process 100 times more samples. However, in the last 60 years, research has developed more efficient algorithms with faster converge rates and better variance control techniques. Moreover, the Monte Carlo method has become increasingly attractive, especially in quantitative finance, as computational power has improved at the rate of Moore’s law.

The first paper showing the use of Monte Carlo in quantitative finance was published in 1977. Since then, researchers and practitioners have continually worked to improve Monte Carlo methods in quantitative finance. Quasi-random numbers are used to achieve faster (1/N) converge rates; antithetic sampling, control variate, and importance sampling have widely been used as variance reduction techniques so that Monte Carlo can converge faster with higher-quality results using limited samples. In recent years, as financial products have become increasingly complex, alternative methods have either become intractable or simply do not exist. With the widely used parallel computing techniques, Monte Carlo has become the dominant numerical method in the quantitative finance industry. At the same time, many new and innovative ways of using the method have been developed. In our example, we show a variety of Monte Carlo methods using pre-generated quasi-random number sequence with all option data sharing the same random number sequence.

In this document, we will cover how to download, build, and run the code. Since the algorithm is the same as the one shown in the Monte Carlo with RNG interface document, we are not going to repeat the same technical detail here. However, we do include a section covering the computational consequence of sharing a large portion of memory among a large number of worker threads.

### Code Access

Monte Carlo European Option with Pre-generated Random numbers is maintained by Shuo Li and is available under the BSD 3-Clause Licensing Agreement. The code runs natively on Intel® Xeon Phi™ coprocessor (referred to as “coprocessor” in this document) in a single node environment.

To access the code and test workloads:

Go to the source location to download the `MonteCarlorsrc.tar`

file.

### Build Directions

Perform these steps to rebuild the program:

- Install Intel® Composer XE 2013 SP 2 on your system
- Source the environment variable script file
`compilervars.csh`

under`/pkg_bin`

- Untar the montecarlo.tar and type make to build the binary
- Run the make command as follows:

icpc -DFPFLOAT -O3 -ipo -mmic -fno-alias -opt-threads-per-core=4 -openmp -restrict -vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION=""icpc-20140120"" -ltbbmalloc -o MonteCarloSP.knc MonteCarlo.cpp icpc -O3 -ipo -mmic -fno-alias -opt-threads-per-core=4 -openmp -restrict -vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION=""icpc-20140120"" -ltbbmalloc -o MonteCarloDP.knc MonteCarlo.cpp

This process will make two executable files:

- For Single Precision processing: MonteCarloSP.knc
- For Double Precision processing: MonteCarloDP.knc

### Run Directions

**Copy the following files to the coprocessor card:**

[prompt]$ scp MonteCarloSP.knc yourhost-mic0: [prompt]$ scp MonteCarloDP.knc yourhost-mic0: [prompt]$ scp /opt/intel/composerxe/lib/mic/libiomp5.so yourhost-mic0: [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so yourhost-mic0: [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so.2 yourhost-mic0:

**Enable the turbo on the coprocessor:**

[prompt]$ sudo /opt/intel/mic/bin/micsmc --turbo status mic0 (Turbo Mode Status): Turbo Mode is DISABLED mic1 (Turbo Mode Status): Turbo Mode is DISABLED [prompt]$ sudo /opt/intel/mic/bin/micsmc --turbo enable Information: mic0: Turbo Mode Enable succeeded. Information: mic1: Turbo Mode Enable succeeded.

**Make sure you have the Intel® Xeon Phi™ coprocessor C0-7120P/7120X/7120:**

Device No: 0, Device Name: mic0 Version Flash Version : 2.1.03.0386 SMC Firmware Version : 1.15.4830 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8-g2593b11 Device Serial Number : ADKC31303815 Board Board SKU : C0-7120P/7120X/7120 ECC Mode; : Enabled SMC HW Revision : Product 300W Active CS Cores Total No of Active Cores : 61 Voltage : 1010000 uV Frequency : 1333333 kHz

**Set the environmental variables and invoke the executable files from the host OS environment:**

[sli@cthor-knc1 mc_native]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloSP.knc" Monte Carlo European Option Pricing in Single Precision Compiler Version = 14 Release Update = 2 Build Time = Jun 2 2014 16:34:27 Path Length = 262144 Number of Options = 15998592 Block Size = 16384 Worker Threads = 244

Initializing data... Starting options pricing. Parallel simulation completed in 27.931418 seconds. Validating the result... L1_NORM = 2.934E-06 AVERAGE RESERVE = 372.881 Max Error = 1.265E-04 Max Error Index = 4275990 ========================================== Total Cycles = 37241891023 Cyc/opt = 2327.823 Time Elapsed = 27.931 Options/sec = 572781.226 ========================================== [sli@cthor-knc1 mc_native]$ ssh yourhost-mic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloDP.knc" Monte Carlo European Option Pricing in Double Precision Compiler Version = 14 Release Update = 2 Build Time = Jun 2 2014 16:34:28 Path Length = 262144 Number of Options = 15998592 Block Size = 8192 Worker Threads = 244 Initializing data... Starting options pricing. Parallel simulation completed in 132.224465 seconds. Validating the result... L1_NORM = 2.919E-06 AVERAGE RESERVE = 374.393 Max Error = 1.215E-04 Max Error Index = 4513563 ========================================== Total Cycles = 176299286899 Cyc/opt = 11019.675 Time Elapsed = 132.224 Options/sec = 120995.702 ==========================================

The program processes 15M sets of option data. There is an average of 64K data sets on each of 244 threads. Each thread prices one option. Each thread first draws a random number from the pregenerated random number sequence, then it uses it as a sample of stock movement, based on many samples and the European payoff formula, it calculates the stock value and confidence intervals.

The program was built on the host and executes on the coprocessor. For each option data set, it calculates the option value and the confidence interval. Result validation is part of the benchmark. It measures the average error between the calculation result and the result from the Black-Scholes-Merton [2] formula.

This benchmark runs on a single node of the coprocessor. It can also be modified to run in a cluster environment. The program reports a latency measure in seconds spent in pricing activities and also a throughput measure using 15M options. The last number, Options/sec, gives an indication of how fast the benchmark runs.

### Quasirandom Number Generator

The algorithm for the Monte Carlo European Option with Pre-generated Random number program is the same as the one used for the Monte Carlo European Option with RNG interface. The only difference between the two workloads is how the random numbers are generated and whether they are used in pricing a different option or not.

In this implementation of Monte Carlo, random numbers are generated by using the inverse of the cumulative normal distribution function. Given the function y= Ø(x), where Ø(x) is the normal cumulative distribution function. The transformation method Ø^{-1}(y) takes y uniformly distributed on (0, 1) and maps a normal distributed sequence. Intel® C/C++ Composer provide an optimized implementation of Ø^{-1}(y) called cdfnorminv(). The whole random number could be as simple as this. It’s both parallelized and vectorized by the Intel® Compiler using OpenMP* directives.

#pragma omp parallel for for (int k = 0; k < RAND_N; k++) samples[k] = cdfnorminv((k+1.0)/(RAND_N+1.0));

### Memory Access Optimization – Cache Blocking

In this implementation of Monte Carlo European Options, all threads share the pregenerated 256K random numbers, which have a footprint of 1 MB for single precision, 2 MB for double precision. These random numbers in their entirety do not fit in the worker thread’s L2 cache. Each thread can only bring a portion of the data into the cache. When it finishes the current portion, it generates an L2 miss event that triggers the ring interconnect to bring more data into L2 cache. If threads are not coordinated in accessing these data, each thread may access any portion of this 1 MB of data, and the ring interconnect would be too busy to satisfy L2 cache misses from all 244 threads. If this happens, the ring interconnect’s bandwidth becomes the bottleneck even though only 1 MB of data is involved. The uncoordinated memory access makes the 1 MB data get accessed 244 times, creating a bandwidth demand equivalent of 244 MB.

Coordinated access ensures the data is fetched once from memory, when the first thread requests the data. The data then crosses the ring interconnect to land in the first thread’s L2 cache. When other threads need the data, they copy the data from the first thread’s L2 cache into their own L2 caches. Each thread makes all the accesses to the data block while it remains in every thread’s L2 cache. Since memory access is coordinated, 1 MB of data crosses the ring interconnect once, and thus memory bandwidth pressure is relieved and reduced.

To maximize of the capacity of L2 cache, we need to calculate how many datasets can fit into each worker thread’s L2 cache. Each core has 512 KB of physical cache shared by 4 threads. Each thread can have 128 K for all its needs. Subtracting the runtime library and compiler spill and fill’s need, only 64 K is under the application’s control. 64 KB is 16K single precision numbers and 8K double precision floating point numbers.

We can optimize the memory accesses for this Monte Carlo European options to process a 16K SP block of random numbers at a time. In other words, we divide 256K random numbers into 8 16K SP data blocks and 16 8K DP data blocks.

We also have to save the intermediate, partial results in memory. When all data blocks are processed, we then process the partial results and turn them into the final results.

### Other Implementation Notes

In our implementation, memory access optimization happens on the outer-most loop. It is a serial loop by design and processes the 8 or 16 blocks of memory. The middle loop traverses all the option data and it is parallelized with OpenMP using a work-sharing construct. The innermost loop processes all the random data on the current memory blocks that fits on every thread’s cache. At the end of middle loop, the intermediate result is saved into per option storage so that intermediate data from different memory blocks are combined together.

After the triple-nested loop, a separate loop, which was parallelized with OpenMP and vectorized at same time, fetches the intermediate results and turns them into the workload output.

Aligned memory allocation interface from Intel® Threading Building Blocks is used to allocate the aligned memory with cache-friendly behavior.

### Source Code for the Monte Carlo Core Simulation

The following is a core part of Monte Carlo in double and single precision. The full source code and build scripts are also available through the link in the Code Access section.

#pragma omp parallel for for (int k = 0; k < RAND_N; k++) samples[k] = cdfnorminv((k+1.0)/(RAND_N+1.0)); start_cyc = _rdtsc(); for(int block = 0; block < nblocks; ++block) { const FP_TYPE *start = samples + block * RAND_BLOCK_LENGTH; const FP_TYPE *end = samples + (block + 1) * RAND_BLOCK_LENGTH; #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { const FP_TYPE VBySqrtT = VLog2E * SQRT(OptionYearsList[opt]); const FP_TYPE MuByT = MuLog2E * OptionYearsList[opt]; const FP_TYPE Y = StockPriceList[opt]; const FP_TYPE Z = OptionStrikeList[opt]; FP_TYPE v0 = 0.0; FP_TYPE v1 = 0.0; #pragma vector aligned #pragma loop count (RAND_N) #pragma simd reduction(+:v0) reduction(+:v1) #pragma unroll(UNROLLCOUNT) for(const FP_TYPE *hrnd = start; hrnd < end; ++hrnd) { const FP_TYPE result = std::max(TYPED_ZERO, Y*EXP2(VBySqrtT*(*hrnd) + MuByT) - Z); v0 += result; v1 += result*result; } CallResultList[opt] += v0; CallConfidenceList[opt] += v1; } } #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { const FP_TYPE val = CallResultList[opt]; const FP_TYPE val2 = CallConfidenceList[opt]; const FP_TYPE exprt = EXP2(RLog2E*OptionYearsList[opt]); CallResultList[opt] = exprt * val * INV_RAND_N; const FP_TYPE stdDev = sqrtf((F_RAND_N * val2 - val * val) * STDDEV_DENOM); CallConfidenceList[opt] = (FP_TYPE)(exprt * stdDev * CONFIDENCE_DENOM); } end_cyc = _rdtsc(); double elapsed_time = (end_cyc-start_cyc)/FREQ;

### About the Author

Shuo Li works for the Intel Software and Service Group. His main interests are parallel programming and application software performance. In his recent role as a staff software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers and helps them achieve high performance with their software solutions. Shuo holds a Master's degree in Computer Science from the University of Oregon and an MBA degree from Duke University.

### References and Resources

^{[1]}Intel® Xeon® processor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-e7-family.html

^{[2]}Intel® Xeon Phi™ coprocessor: https://software.intel.com/en-us/articles/quick-start-guide-for-the-intel-xeon-phi-coprocessor-developer