Download Available under the Intel Sample Source Code License Agreement license.
Background
Monte Carlo is a numerical method that uses statistical sampling techniques to approximate solutions to quantitative problems. The name comes from the famous casino in the principality of Monaco, where a roulette table provides uncertainty outcomes just like series of random numbers. The contemporary version of the Monte Carlo algorithm was first used by Stanislaw Ulam, while he was working on the Manhattan project in the mid1940s. Nicholas Metropolis was first to make the connection between the casino and the algorithm and coined the term Monte Carlo to refer to any numerical simulation algorithm that involves a random number generator. John von Neumann was the first to implement Monte Carlo on the ENIAC Computer in the late 1940s. Since then, Monte Carlo has been widely used in engineering physics, molecular dynamics, and in calculating integrals with complicated boundary conditions.
In 1973, Fisher Black and Myron Scholes published their historical paper and introduced what later became known as the BlackScholes Option Pricing model for financial derivatives. As the rest of the world was still trying to digest the BlackScholes Model, an actuarial professor from the University of British Columbia, Phelim Boyle introduced the Monte Carlo method to Finance and successfully used it as an alternate way to get the same result as the BlackScholes Model. In his article, he takes the example of a European call option and calculates its price using the Monte Carlo method.
In this paper, we use the same numerical problem as an example to highlight various techniques and practices to achieve high performance computing on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors.
Code Access
The Monte Carlo European Option with RNG interface is maintained by Shuo Li and is available under the BSD 3Clause Licensing Agreement. The code supports the asynchronous offload of the Intel Xeon processor (referred to as “host” in this document) with the Intel Xeon Phi coprocessor (referred to as “coprocessor” in this document) in a single node environment.
To access the code and test workloads:
Go to source location to download the MonteCarloRNGsrc.tar file.
Build Directions
Here are the steps you need to follow in order to rebuild the program:
 Install the Intel® Composer XE 2013 SP 2 on your system
 Source the environment variable script file compilervars.csh under /pkg_bin
 Untar the montecarlorng.tar file, type make to build the binary
 Issue the make command, be unconditional, and be silent using the –Bs option
[prompt]$ make –Bs
Run Directions
Copy the following files to the Intel Xeon Phi coprocessor card.
[prompt]$ scp MonteCarloRNGSP.knc yourhostmic0: [prompt]$ scp MonteCarloRNGDP.knc yourhostmic0: [prompt]$ scp /opt/intel/composerxe/lib/mic/libiomp5.so yourhostmic0: [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so yourhostmic0: [prompt]$ scp /opt/intel/composerxe/tbb/lib/mic/libtbbmalloc.so.2 yourhostmic0: [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_intel_lp64.so yourhostmic0: [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_sequential.so yourhostmic0: [prompt]$ scp /opt/intel/composerxe/mkl/lib/mic/libmkl_core.so yourhostmic0:
Turn on the turbo mode on your Intel Xeon Phi coprocessor card.
[prompt]$ sudo /opt/intel/mic/bin/micsmc turbo enable Information: mic0: Turbo Mode Enable succeeded.
Invoke the binary and set the environmental variable for the execution from the host.
[prompt]$ ssh yourhostmic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGSP.knc" Monte Carlo European Option Pricing Single Precision Compiler Version = 14 Release Update = 2 Build Time = Jun 2 2014 12:22:43 Path Length = 262144 Number of Options = 999912 Block Size = 16384 Worker Threads = 244 Starting options pricing... Parallel simulation completed in 21.439754 seconds. Validating the result... L1_Norm = 4.812E04 Average RESERVE = 12.872 Max Error = 8.035E02 ========================================== Total Cycles = 28586338291 Cyc/opt = 28588.854 Time Elapsed = 21.440 Options/sec = 46638.222 ========================================== [prompt]$ ssh yourhostmic0 "export LD_LIBRARY_PATH=.;export OMP_NUM_THREADS=244;export KMP_AFFINITY='compact,granularity=fine';./MonteCarloRNGDP.knc" Monte Carlo European Option Pricing Double Precision Compiler Version = 14 Release Update = 2 Build Time = Jun 2 2014 12:22:44 Path Length = 262144 Number of Options = 999912 Block Size = 8192 Worker Threads = 244 Starting options pricing... Parallel simulation completed in 47.075885 seconds. Validating the result... L1_Norm = 4.812E04 Average RESERVE = 12.920 Max Error = 8.034E02 ========================================== Total Cycles = 62767847297 Cyc/opt = 62773.371 Time Elapsed = 47.076 Options/sec = 21240.429 ==========================================
The program priced about a million sets of option input data. If you divide 1 million among 244 threads, you will get 4098.36. Without losing generality, let’s simplify the number of options each thread runs to a round number of 4096, or 4k, and the total number of options the program will price is 999,912. For each option, the program first generates a random number sequence that is independently and identically distributed or i.i.d. for short. We cover details of generating random number sequences in a later section. We then use these random numbers as samples of stock movement with the European payoff formula and calculate the stock value and confidence interval in a formula shown in the next section.
The program was built on the host and executes on the Intel Xeon Phi coprocessor. For each option data set, it calculates the option values and the confidence intervals. Result validation is part of the benchmark. It measures the average error between the calculated result and the result from BlackScholesMerton ^{[2]} formula.
This benchmark runs on a single node of an Intel Xeon Phi coprocessor. It can also be modified to run in a cluster environment. The program reports a latency measure in seconds spent in pricing activities and also a throughput measure using total number of options priced over the elapsed time, which was printed out as the last performance number in Options/sec.
Generating and Using Random Numbers in Monte Carlo Methods
Since Monte Carlo is a numerical method based on the simulation of random variables, the implementation of this algorithm starts with identifying a random number generator. VSL, the vector statistical library component of the Intel® Math Kernel Library (Intel® MKL), provides a variety of random number generators for different distributions. Our implementation uses the Mersenne Twister random number generator using the normal distribution. VSL is part of the Intel® C++ Composer XE 2013 that we use to build applications for Intel Xeon processor and Intel Xeon Phi coprocessors.
Using Random Number Generators
Inside VSL a random number sequence is identified as a stream. Each stream delivers random numbers in a given distribution in vector interface. To manage the complexity, VSL uses two implementation layers to support different RNGs and different distributions. At the lower level, all core random number generation routines are implemented to deliver random numbers in uniform distribution. At the higher level, transformation functions are applied to turn uniform distribution to the distribution the user desires.
To use VSL, follow this typical 5step process:
 Specify RNG streams
 Initialize and create the random number streams
 Request a vector of random numbers in a specific distribution
 Consume the random number sequences in the simulation
 Destroy the RNG streams
Here is how the process works for calculating our European Call options:
 Specify a random number stream. In our benchmark we are going to use all 61 cores and create 4 threads per core. In total, we can have 244 threads. To allow each thread to price an option independently, we should give each thread an independent random number stream. We need to declare an RNG state descriptor for each thread. We can declare these data structures before we create any threads.
#include <mkl_vsl.h> // Declare random number buffer and random number sequence descriptors float *samples[MAX_THREADS]; VSLStreamStatePtr Streams[MAX_THREADS];
VSLStreamStatePtr
is a C/C++ opaque data structure and Streams is an array of still uninitialized opaque data. 
Initialize and create the random stream and set up the stream with a basic random number generator and an integer seed.
Once we have created worker threads, each thread will allocate its own buffer to receive the RNG sequences and initialize the stream descriptor so that it knows what basic random number generator to use and whether the threads need to work together to ensure mutual independency.
samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN); vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED);
Intel MKL provides the following routine to create and initialize the stream you declared:
vslNewStream (VSLStreamStatePtr &Randomstream, int brng, int seed )
Randomstream
is a reference to the uninitialized random stream you just declared. It takes a reference because the routine passes back an initialized stream state descriptor.brng
is an enumeration parameter specifying which basic random number to use. VSL_BRNG_MT2203 specifies a family of modified Mersenne Twister[^{10]} pseudo generators, each of which is i.d.d. In our problem, each thread uses its own random number generator identified by VSL_BRNG_MT2203 plus its thread ID. You can find more information on the basic random number generator here BRNG parameter definitions.seed
is an integer for the random stream to ensure the reproducibility for debugging purposes.  Request a vector of random numbers of a specific distribution. Using the random stream descriptor, we can call one of the distribution generators to produce a sequence of random numbers with a certain probability distribution and a specific data type. The result will be placed in a userprovided buffer in the form of a C array.
float *rand = samples[threadID]; vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT);
Intel MKL uses the following routine to generate normal distributed random numbers.
vsRngGaussian(method, stream, n, r, a, sigma)
where:method
can beVSL_RNG_METHOD_GAUSSIAN_ICDF
, and other values are listed herestream
initialized random streamsn
the number of random numbers to be requestedr
address of the receiving buffer, usually a C array declared to hold the random numbera
first parameter to the distribution. For normal distribution, it’s the mean.sigma
second parameter to the distribution. For normal distribution, it’s the std deviation. 
Consume the sequence of random numbers.
for(int i=0; i < RAND_BLOCK_LENGTH; i++) { float callValue = Y * exp2f(rand[i])  Z; callValue = (callValue > 0) ? callValue : 0; v0 += callValue; v1 += callValue * callValue; }

Delete the random stream.
Use
vslDeleteStream
(VSLStreamStatePtr &stream
) to delete the stream declared in step 1. Since we created the stream in the worker threads, it’s customary to destroy the stream in the worker thread.
Other Implementation Notes
In our implementation, worker threads are created using OpenMP* parallel directives. Each thread creates its own random number streams, generates the unique option input data, prices the option, and then validates the result. Each thread’s input data is generated by calling C runtime library rand_r in a unique sequence identified by its thread ID, which guarantees each thread will produce a unique and reproducible sequence.
The aligned memory allocation interface from Intel® Threading Building Blocks (Intel® TBB) are used to allocate the aligned memory blocks that are also cachefriendly to the worker threads. This means, these memory blocks have to be disposed of using the corresponding API. Intel TBB is part of our minimum build requirement.
OpenMP reduction operations are used to calculate the statistical properties of all the options. It’s also used to find the maximum error from the threads.
Source Code for MonteCarloRNG Core
The following is a core part of MonteCarloRNG using single precision data types. Double precision is almost identical.
// Declare random number buffer and random number sequence descriptors float *samples[MAX_THREADS]; VSLStreamStatePtr Streams[MAX_THREADS]; // calculate the block number based on block size const int nblocks = RAND_N/RAND_BLOCK_LENGTH; #pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref) reduction(+ : sumReserve) reduction(max : max_delta) { #ifdef _OPENMP int threadID = omp_get_thread_num(); #else int threadID = 0; #endif unsigned int randseed = RANDSEED + threadID; srand(randseed); float *CallResultList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN); float *CallConfidenceList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN); float *StockPriceList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN); float *OptionStrikeList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN); float *OptionYearsList = (float *)scalable_aligned_malloc(mem_size, SIMDALIGN); for(int i = 0; i < OPT_PER_THREAD; i++) { CallResultList[i] = 0.0f; CallConfidenceList[i] = 0.0f; StockPriceList[i] = RandFloat_T(5.0f, 50.0f, &randseed); OptionStrikeList[i] = RandFloat_T(10.0f, 25.0f, &randseed); OptionYearsList[i] = RandFloat_T(1.0f, 5.0f, &randseed); } samples[threadID] = (float *)scalable_aligned_malloc(RAND_BLOCK_LENGTH * sizeof(float), SIMDALIGN); vslNewStream(&(Streams[threadID]), VSL_BRNG_MT2203 + threadID, RANDSEED); #pragma omp barrier if (threadID == 0) { printf("Starting options pricing...n"); sTime = second(); start_cyc = _rdtsc(); } for(int opt = 0; opt < OPT_PER_THREAD; opt++) { const float VBySqrtT = VLog2E * sqrtf(OptionYearsList[opt]); const float MuByT = MuLog2E * OptionYearsList[opt]; const float Y = StockPriceList[opt]; const float Z = OptionStrikeList[opt]; float v0 = 0.0f; float v1 = 0.0f; for(int block = 0; block < nblocks; ++block) { float *rand = samples[threadID]; vsRngGaussian (VSL_RNG_METHOD_GAUSSIAN_ICDF, Streams[threadID], RAND_BLOCK_LENGTH, rand, MuByT, VBySqrtT); #pragma vector aligned #pragma simd reduction(+:v0) reduction(+:v1) #pragma unroll(4) for(int i=0; i < RAND_BLOCK_LENGTH; i++) { float callValue = Y * exp2f(rand[i])  Z; callValue = (callValue > 0) ? callValue : 0; v0 += callValue; v1 += callValue * callValue; } } const float exprt = exp2f(RLog2E*OptionYearsList[opt]); CallResultList[opt] = exprt * v0 * INV_RAND_N; const float stdDev = sqrtf((F_RAND_N * v1  v0 * v0) * STDDEV_DENOM); CallConfidenceList[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM); } //end of opt #pragma omp barrier if (threadID == 0) { end_cyc = _rdtsc(); eTime = second(); printf("Parallel simulation completed in %f seconds.n", eTimesTime); printf("Validating the result...n"); } double delta = 0.0, ref = 0.0, L1norm = 0.0; int max_index = 0; double max_local = 0.0; for(int i = 0; i < OPT_PER_THREAD; i++) { double callReference, putReference; BlackScholesBodyCPU( callReference, putReference, StockPriceList[i], OptionStrikeList[i], OptionYearsList[i], RISKFREE, VOLATILITY ); ref = callReference; delta = fabs(callReference  CallResultList[i]); sum_delta += delta; sum_ref += fabs(ref); if(delta > 1e6) sumReserve += CallConfidenceList[i] / delta; max_local = delta>max_local? delta: max_local; } max_delta = max_local>max_delta? max_local: max_delta; vslDeleteStream(&(Streams[threadID])); scalable_aligned_free(CallResultList); scalable_aligned_free(CallConfidenceList); scalable_aligned_free(StockPriceList); scalable_aligned_free(OptionStrikeList); scalable_aligned_free(OptionYearsList); }//end of parallel block
Appendix
About the Author
Shuo Li works for the Intel Software and Service Group. His main interests are parallel programming and application software performance. In his recent role as a staff software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers and helps them achieve high performance with their software solutions.
Shuo holds a Master's degree in Computer Science from the University of Oregon and an MBA degree from Duke University.
References and Resources
^{[1]}Option Pricing: A Simplified Approach (1979) by John C. Cox, Stephen A. Ross, and Mark Rubinstein:
^{[2]}Theorie de la Speculation, Annales Scientifiques de l´ Ecole Normale Sup´erieure, 21–86. Bachelier, L. (1900). reprinted 1995 Editions Jacques Gabay
^{[3]}Hull, John C, Options, Futures, and other Derivatives, 7th Edition PrenticeHull, 2009
^{[4]}Wilmott, P., Derivatives: The Theory and Practice of Financial Engineering. Chichester: Wiley, 1998
^{[5]}Cox, J. C. Ross, S. A. and Rubinstein, M. Option Pricing: A simplified Approach Journal of Financial Economics 7 (October 1979): 22964
^{[6]}Black, F., and M. Scholes, The Pricing of Options and Corporate Liabilities Journal of Political Economy, 81(May/June 1973): 63759
^{[7]}Merton, R. C. Theory of Rational Option Pricing, Bell Journal of Economics and Management Science, 4(Spring 1973): 14183
^{[8]}Boyle, P. P., Options: A Monte Carlo Approach Journal of Financial Economics, 4 (1977) 32338
^{[9]}Black, Fischer and Scholes, Myron The Pricing of Options and Corporate Liabilities (MayJun 1973)
^{[10]}Matsumoto, M., and Nishumira T. Mersenne Twister: A 623Dimensionally Equidistributed Uniform PseudoRandom Number Generator, ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, Pages 330, January 1998
^{[11]}Intel Xeon processor: http://www.intel.com/content/www/us/en/processors/xeon/xeonprocessore7family.html
^{[12]}Intel Xeon Phi coprocessor: https://software.intel.com/enus/articles/quickstartguidefortheintelxeonphicoprocessordeveloper
License
Intel sample source is provided under the Intel Sample Source License Agreement.
Notices
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 18005484725, or go to: http://www.intel.com/design/literature.htm
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.