Case Study: Achieving High Performance on Monte Carlo European Option on Intel® Xeon Phi™ Coprocessors

1. Introduction

Monte Carlo uses statistical computing method to solve complex scientific computing problems. It innovatively uses random numbers to simulate the uncertainty of inputs to a problem and makes use of computer to process the repeated sampling of the parameter to solve the problem that otherwise impossible to obtain a deterministic result. This method was originally pioneered by nuclear physicists involved in Manhattan projects in late 40s. It is named after the biggest casino in the principality of Monaco.

Phelim Boyle, a Canadian Financial mathematician, was credited as the first one to use the Monte Carlo method in Quantitative finance. In his historical paper Options: a Monte Carlo Approach on May, 1977 issues of Journal of Financial Economics, he developed simulation method to solve European option value problem and independently verified the Black-Scholes formula. Since then the Monte Carlo method has become widespread among Wall Street derivative pricing and risk management professionals. In a recent report, up to 40 percent of quantitative finance problem involves Monte Carlo method. Monte Carlo simulation is still an area that attracts researchers from academia and investment banks alike. 1996, Mark Broadie and Paul Glasserman applied Monte Carlo to Asian options. In 2001, Francis Longstaff, and Eduardo Schwartz developed the first Monte Carlo algorithm for pricing American options. In 2006, Mike Giles from Oxford University created multilevel Monte Carlo as a cost effective way to solve high dimensional problem to the user specified accuracy.

While Monte Carlo make many high dimensional problem tractable with simple elegant mathematical expression with perhaps much less memory requirement than direct PDE and the lattice method, it demand in raw computation power is equally alarming, especially areas where deterministic and accurate results are mandatory. Most of the time, it takes hundreds of thousands sometimes millions of sampling for a satisfactory result. It takes 100 times more samples, simply to get accuracy at the next decimal digit level. With so many problems that are computationally challenging, the Finance industry looks for an innovative architecture to satisfy its need for raw computation power.

1.1 Intel® Many Integrated Core Architecture (Intel® MIC Architecture)

Intel’s many integrated core architecture provided an ideal execution vehicle for Monte Carlo related applications. Building around the same Instruction set architecture as IA32/Intel 64 and using the very same programming model as Intel® Xeon® Multicore processor, Intel® Xeon Phi™ coprocessor product, based on Intel® Many Integrated Core Architecture further extended the parallel execution infrastructure and allows highly parallel applications to reach the level of performance far exceed that of a repurposed graphic card (sometimes referred to as a GPGPU) with little or no modification of source code. End user benefits from acceleration of software development and deployment as well as the acceleration of software performance.

Intel® Xeon Phi™ coprocessor (the coprocessor) is the first product based on the Intel many integrated core architecture. It derives its instruction set architecture from IA32-/Intel 64 and introduces an entirely new set of instructions, registers, and programming paradigms targeted at delivering high levels of performance for highly parallel applications.

1.1.1 Processor Architecture

At the microprocessor architecture level, the Intel® Xeon Phi™ coprocessor is an SMP processor comprised of more than 50 cores organized around two uni-directional rings. There are 8 on-die memory controllers supporting 16 GDDR5 channels and expected to deliver up to 5.5 GT/sec. At the core level, each coprocessor core is a fully functional, in-order core in its own right, capable of running IA instructions independently of the other cores. Each core is also given hardware multi-threading support and supports four hardware contexts or threads. At any given clock cycle each core can issue up to two instructions from any single context.

1.1.2 Core and Vector Processor Unit

The coprocessor core implements full sixteen general purpose 64-bit registers as well as most of the new instructions associated with 64-bit extension. However, the cores only vector instruction set supported are the Intel® Initial Many Core Instructions (Intel® IMCI). There is no support for Intel® MMX™, Intel® SSE or Intel® AVX in the cores, although the math co-processor (x87) remains integrated and fully functional.

Designed from the scratch, is an all-new Vector Processing Unit or VPU. It is a 512-bit SIMD engine, capable of executing 16-wide single precision, or 8-wide double precision floating point SIMD operations with full support of all four rounding modes as articulated by IEEE 754R. Inside the Vector Processing Unit, a new extended math unit provides the fast implementation of the single precision transcendental functions: reciprocal, reciprocal square root, base 2 logarithm, and base 2 exponential functions using lookup tables. These hardware implemented function can achieve high throughput of 1-cycle or 2-cycle other transcendental functions can be derived from elementary functions.

1.1.3 Cache Architecture

Intel® Xeon Phi™ coprocessor’s Level One (L1) cache was designed to accommodate higher working set requirements for four hardware contexts per core, it has a 32 KB L1 instruction cache and 32 KB L1 data cache. Associativity is 8-way, with a 64 byte cache line. Bank width is 8 bytes. Data return can now be out-of-order. The access time has 3-cycle latency.

The 512 KB unified Level Two (L2) cache comprises 64 bytes per way with 8-way associativity, 1024 sets, 2 banks, 32GB (35 bits) of cacheable address range. The expected idle access time is approximately 80 cycles.

The L2 cache has a streaming hardware prefetcher that can selectively prefetch code, read, and RFO (Read-For-Ownership) cache lines into the L2 cache. It supports 16 streams that can bring in up to a 4-KB page of data. Once a stream direction is detected, the prefetcher can issue up to 4 multiple prefetch requests. The L2 in the Intel® Xeon Phi™ coprocessor now has ECC support. The replacement algorithm for both the L1 and L2 caches is based on a pseudo-LRU implementation.

1.2 Software Development Environment for Intel® Xeon™ and Intel® Xeon Phi™ Coprocessor

1.2.1 Intel® Parallel Composer XE 2013

In September 2012, Intel publically announced Intel® Parallel Composer XE 2013. This is the software development toolkit for Intel multicore and Intel® MIC Architecture product line. In this software package, you can find Intel® C/C++ and FORTRAN compilers along with Intel® Application Debugger, Intel® Performance Libraries such Intel TBB and Intel MKL that developers can readily use to build applications and libraries on Intel® Multicore and Intel® MIC Architecture product families. Integrated Intel Multicore and Intel® MIC Architecture support is one of the major features of the current release.

1.2.2 Intel® Xeon™ host processor system

On the system with Intel Xeon Phi Coprocessors, the application developers can create offload applications, which makes the coprocessor capability available to the Intel Multicore hosted applications, or created native application, which run in its entirety on Intel® Xeon Phi™ Coprocessor card. In this paper, we use a dual-socket Xeon platform as the host system and Intel® Xeon Phi™ Coprocessor card is connected to this host system via PCIe interface. Each socket of the host is populated with an Intel® Xeon™ E5-2670 8-core CPU running 2.6Ghz. Intel® Xeon™ E5-2670 supports Intel Advanced vector extension for SIMD parallelism.

1.2.3 Intel® Xeon Phi™ Coprocessors device

The coprocessor is a computing device running its own modified version of Linux operating system known as MPSS. The host system can boot or shutdown the coprocessor through standard Linux device operation commands. The current MPSS version for the beta product is 2.1.4346-16. In this paper, we use model 7110 with 61 cores. Each core operates at 1.1 GHz. It has 8 GB memory and the core memory speed is 5.5 GHz.

2. Monte Carlo European Option Pricing Algorithm

In this section, we start with a little basic background on financial derivation and the stock option pricing in specific. We show how Monte Carlo method can be used to model the uncertain of the input, how each sample was taken and payoff function was determined based on the option specification. In the end, we arrive at an implementation of Monte Carle using Standard C/C++. We briefly mention the performance bottleneck for this implementation as we transition to the next section.

2.1 Option Pricing Background

In the Financial world, a derivative is a financial instrument, whose value depends on the value of other, more basic, underlying variables. Very often the variables underlying derivatives are the prices of traded assets. A stock option, for example, is a derivative whose value is dependent on the price of a stock. Not all variables derivatives dependent on are traded assets. Some of these variables can be snows falls at a certain resort, other may be the average weather temperatures in a specific time intervals, etc.

An option is a derivative that specifies a contract between two parties for a future transaction, known as exercise, on an asset at a reference price. The buyer of the option gains the right, but not the obligation, to engage in that transaction, while the seller incurs the corresponding obligation to fulfill the transaction. There are two types of option. A call option gives the holder the right to buy the underlying asset by a certain date for a certain price. A put option gives the holder the right to sell the underlying asset by a certain date for a certain price. The price in the contract is known as the strike price. The date in the contract is known as the expiration date. European option can be exercised only on the expiration date. American options can be exercised at any time up to the expiration date.

2.2 Pricing European Option using Monte Carlo Method

Monte Carlo simulation uses the risk-neutral valuation method to value an option. It samples a path to obtain the expected payoff in a risk-neutral world and then discount the payoff to current value using risk-free interest rate. Let’s consider a stock option for a stock with current price S and provides a payoff at time T. Assuming the interest are constant, we can value the derivatives as follow:

  1. Sample a random path for S in a risk-neutral world.
  2. Calculate the payoff from the derivative.
  3. Repeat step 1 and 2 to get many sample values of the payoff from the derivative in risk-neutral world.
  4. Calculate the mean of the sample payoffs to get an estimate of the expected payoff in a risk-neutral world.
  5. Discount the expected payoff at the risk-free rate to get an estimate of the value of the derivative.

Suppose the process followed by the underlying market variable in the Risk-neutral world is

where dWt is a Wiener process, µ is the expected return in a risk neutral work, and is the volatility. To simulate the path followed by lnS, we can divide the life of the derivative into N short interval of length and approximate equation 3.1 as

or equivalently

where is a random sample from (0,1), This enables the value of S at time to be calculated from the initial value of S, the value of 2 to be calculated from the value at time , and so on.

for each k between 1 and M. Here each is a draw from a standard normal distribution.

We know that the value of European options at the time of expiration, the call and put options

Using Monte Carlo, we can generate M numeric samples with the required (0, 1) distribution, corresponding to the underlying Wiener process, then average the possible end period stock profits, corresponding to each of the sample values:

Similar function can be obtained for European Put option.

The result is still the future value of the option at time T. Discount this value by a factor of e-rT, we get the present value of European Call options.

It follows from the central limit theorem to reduce the error by half, the number of sampling path required needs to be quadrupled. In another words, standard error for Monte Carlo converges at the rate of

The advantage of Monte Carlo simulation is that it can be used when the payoff depends on the path followed by the underlying variable S leading toward the expiration and the situation when payoffs take place multiple times during the life of the option. The strength of Monte Carlo method also lies in the situation when the payoff function involves multiple random variables, and all other analytical method failed, Monte Carlo then becomes the only choice.

2.3 Implementation of Monte Carlo European Option Pricing Algorithm

Implementation of Monte Carlo European Option pricing can be relatively easy once we have a payoff function from the previous section. First and foremost, we need to find a random number from the (0, 1), then we can use the payoff function to get the expected value and confidence interval for the underlying options. A C/C++ implementation may look like this.


float expectedCall(
    float S,
    float X,
    float MuByT,
    float VBySqrtT,
    float r)
{
    float eCall = S * exp(MuByT + VBySqrtT * r) - X;
    return (eCall > 0) ? eCall : 0;
}

float val = 0.0, val2 = 0.0;
for(int pos = 0; pos < RAND_N; pos++)
{
      float callValue =  expectedCall(Sval, MuByT, VBySqrtT, gen() );
      val  += callValue;
      val2 += callValue * callValue;
}


Let’s look at a hypothetic situation in which a firm has to calculate European options for millions of financial instruments. For each instrument, it has current price, strike price and option expiration time. For each set of data, calculate the European call options using Monte Carlo simulation. In the following case, we demonstrated the code fragments that price European call values for 32K options data sets using path length of 265K.


#include <stdio.h>
#include <sys/time.h>
#include <tr1/random>

#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) : (b))
#endif

const int OPT_N = 512*64;
const int RAND_N = 1 << 18;
static const float      RISKFREE = 0.06;
static const float    VOLATILITY = 0.10;

double second()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}

float expectedCall(
        float S,
        float X,
        float MuByT,
        float VBySqrtT,
        float r)
{
        float eCall = S * exp(MuByT + VBySqrtT * r) - X;
        return (eCall > 0) ? eCall : 0;

}


void MonteCarlo(
        float *h_CallResult,
        float *h_CallConfidence,
        float *S,
        float *X,
        float *T
)
{
        typedef std::tr1::mt19937                     ENG;   // Mersenne Twister
        typedef std::tr1::normal_distribution<float> DIST;   // Normal Distribution
        typedef std::tr1::variate_generator<ENG,DIST> GEN;   // Variate generator

        ENG  eng;
        DIST dist(0,1);
        GEN  gen(eng,dist);

        double sTime = second();

        for(int opt = 0; opt < OPT_N; opt++)
        {
                float VBySqrtT = VOLATILITY * sqrt(T[opt]);
                float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) * T[opt];
                float Sval = S[opt];
                float Xval = X[opt];
                float val = 0.0, val2 = 0.0;

                for(int pos = 0; pos < RAND_N; pos++){
                        float callValue =  expectedCall(Sval, Xval, MuByT, VBySqrtT, gen());
                        val  += callValue;
                        val2 += callValue * callValue;
                }

                float exprt = expf(-RISKFREE *T[opt]);
                h_CallResult[opt] = exprt * val / (float)RAND_N;
                float stdDev = sqrtf(((float)RAND_N * val2 - val * val)/ ((float)RAND_N * (float)(RAND_N - 1)));
                h_CallConfidence[opt] = (float)(exprt * 1.96 * stdDev / sqrtf((float)RAND_N));
        } //end of for
        double eTime = second();
        printf("\nComputation rate - %8.4f.\n", OPT_N/(eTime-sTime));
}


It’s worth mentioning that the above code sequence uses the random number generation facilities in C++ using the TR1 (C++ Standards Committee Technical Report 1) extensions. The C++ random number generation classes and functions are defined in the random header and contained in the namespace std::tr1.

At the core of any pseudorandom number generation software is a routine for generating uniformly distributed random integers. These are then used in a bootstrap process to generate uniformly distributed real numbers. The uniform real numbers are used to generate from other distributions through transformations, acceptance-rejection algorithms, etc.

C++ TR1 gives you several choices of core generators that it calls “engines.” The following four engine classes are supported in GCC 4.3.x and Visual Studio 2008 feature pack.

  • linear_congruential
  • mersenne_twister
  • subtract_with_carry
  • subract_with_carry_01

The C++ TR1 library supports non-uniform random number generation through distribution classes. These classes return random samples via the operator() method.

The template class variate_generator describes an object that holds an engine and a distribution and produces values by passing the wrapped engine object to the distribution object's operator().

Finally, compiled with GCC 4.4.5, the initial implementation of Monte Carlo for European options can price at a little more than 68 opt/sec on the dual-socket host system based on 8-core Intel®Xeon™ E5-2670 running at 2.6 GHz.

[sli@IDF-SLI-01 step1]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate -  68.7731.
Completed in 476.46543 seconds. Options per second:    68.77

Obviously this implementation has a lot of room for improvement. In the next section, we are going to take this implementation through a structured frame work and attempt to improve the performance along the way.

3. Stepwise Optimization Framework and Monte Carlo Optimization

Just like any scientific process, optimization requires a systematic and structured approach. In this section, we highlight an optimization framework that allows the practitioner to take holistic approach to application performance improvement. The objective of this framework is to achieve the highest performance with the best tools and library on hand.

This framework takes an application into four optimization stages. Each stage attempts to improve the application performance in one orthogonal direction by applying various techniques. Following this methodology, an application can achieve the highest performance possible on the Intel Architecture, at the same time avoid reinventing the wheels and also utilize all the possible parallel execution resources in a way that it is designed to.

3.1 Stepwise Optimization Framework

Stepwise Optimization Framework is a four-stage process. It helps the performance engineer achieve the best application performance in a shortest possible time. In another words, it reaches all kinds of parallelization in a shortest possible time, leaving nothing on the table.

The first stage is to take a complete implementation of the application to the optimization development environment. This environment should be able to create the optimizing code and should allow you to use the existing optimizing library. Second stage to optimize the operations you are performing. It ensures things are getting done, nothing else, and they are done optimally. Stage three is about Vectorization. In another words, how can we take advantage of SIMD instructions and explore synchronous parallelism in your application and express in a way the compiler can understand it. We try to maximize the performance of a single core. Fourth stage is threading parallelization. It’s a stage where we bring parallelism to problem with very little synchronous operations. The end objective is to max out potential of all cores. Fifth stage is scale your application from Multicore to the Intel® Many Integrated Core Architecture. It’s a stage applicable to highly parallel applications. The purpose is to evaluate the changes at the processor level and then perform additional optimization based on the differences such as memory bandwidth vs. processing capability, SIMD alignment requirement changes, Cache per core, etc. The purpose is to minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel® Xeon® processor) to another flavor of the Intel Architecture (Intel® Many Integrated Core architecture).

3.2 Stage 1: Leverage the Optimized Tools and Library

At the beginning of your optimization project, you need to select an optimizing development environment. The decision you make at this stage is going to have a profound influence in the later stage. Not only will you find the quality of your work is higher; you will also find you may have substantially less work to do. This is all because a good optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries and also the debugging and profiling tools to pin-point exactly what the code is doing at the runtime. Sometimes, it’s difficult to find a single environment to get all you want. In that case, you have to stitch a solution together based on different tools. For example, if someone wants to optimize a java program with performance critical portion of the code in C/C++. He has to find a Java SDK, a C/C++ Compiler, a JVM environment and perhaps a system wide profiler. Sometime the solution is obvious; Intel® Parallel Composer XE integrates multicore and many core development environment. If you project involves scaling highly parallel application from Intel Xeon processor to Intel® Xeon Phi™ coprocessor, the Intel software tool kit is the only choice now. You want to choose it because the integrated performance sampling tools deliver performance monitor events that tell you exactly where you need to spend your optimization effort. This is really a night vision goggle that illuminates your targets.

Intel® Parallel Composer XE 2013 on Linux* is compatible with GCC 4.4.5 which is released with RHEL 6.1 Server edition. What you can compile with g++ can also be compiled with icpc, the g++ equivalent from Intel® Parallel Studio. For example, instead of g++ -o MonteCarlo -O2 MonteCarloStage00.cpp, you can issue icpc -o MonteCarlo -O2 MonteCarloStage00.cpp You will get instant boost of 1.43X. That’s all we can get if we don’t touch the source code. Our objective is to improve the performance, make the change necessary and to change the source code but still keep the calculation correct.

[sli@IDF-SLI-01 step1]$ icpc -o MonteCarlo -xAVX -O2 MonteCarloStage00.cpp
[sli@IDF-SLI-01 step1]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.
Computation rate -  98.1659.
Completed in 333.80235 seconds. Options per second:    98.17

Intel® Math Kernel Library (Intel® MKL) is precompiled library consist of basic math routines such as BLAS, LAPACK, and random number generation functions. MKL provides C and FORTRAN interface for these functions. You call these functions from C/C++ or from Fortran.

Efficiency of random number generation (RNG) is an indicator of the performance aspect of any Monte Carlo simulation. Like C++ TR1, Intel MKL also provides the random number generation routines, they also independent choices of different basic RNG engines and choice of distributions multiplied by the data types of 64-bit, 32-bit float, and 32-bit integer. Unlike C++ TR1, Intel MKL’s RNG interface comes in the traditional C and FORTRAN interface, so that you can use in both C, C++ and Fortran. In C/C++ you would add interface declarations in a .h file, and make MKL API calls and then link with precompiled MKL library. A major difference is that in C++ TR1, random number is delivered one at a time with a gen() method call from variate_generator class. While in Intel MKL, the process of creating a RNG engine object is being replaced by creating a RNG stream, the process of creating a distribution object is being replaced by selecting the right RNG interface call. One RNG interface call can deliver any number of random numbers. In our problem, 256K random numbers are delivered in one RNG interface call.


#include <stdio.h>
#include <sys/time.h>
#include "mkl_vsl.h"
#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) : (b))
#endif

const int OPT_N = 512*64;
const int RAND_N = 1 << 18;
static const float      RISKFREE = 0.06;
static const float    VOLATILITY = 0.10;

#define RANDSEED 123

double second()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}

float expectedCall(
        float S,
        float X,
        float MuByT,
        float VBySqrtT,
        float r)
{
        float eCall = S * exp(MuByT + VBySqrtT * r) - X;
        return (eCall > 0) ? eCall : 0;

}

void MonteCarlo(
        float *h_CallResult,
        float *h_CallConfidence,
        float *S,
        float *X,
        float *T
)
{
float l_Random [RAND_N];
VSLStreamStatePtr Randomstream;
vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

double sTime = second();
vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, l_Random, 0.0, 1.0);

for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VOLATILITY * sqrt(T[opt]);
float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) * T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0, val2 = 0.0;

for(int pos = 0; pos < RAND_N; pos++)
{
float callValue =  expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]);
val  += callValue;
val2 += callValue * callValue;
}

float exprt = expf(-RISKFREE *T[opt]);
h_CallResult[opt] = exprt * val / (float)RAND_N;
float stdDev = sqrtf(((float)RAND_N * val2 - val * val)/ ((float)RAND_N * (float)(RAND_N - 1)));
h_CallConfidence[opt] = (float)(exprt * 1.96 * stdDev / sqrtf((float)RAND_N));
} //end of for
double eTime = second();
printf("\nComputation rate - %8.4f.\n", OPT_N/(eTime-sTime));
}


We actually reduce the number of lines of code by changing Intel MKL RNG facility. The performance difference is stunning. More than 3X improvement on the original code.

[sli@IDF-SLI-01 step2]$ icpc -o MonteCarlo -xAVX -O2 -mkl MonteCarloStage1.cpp
[sli@IDF-SLI-01 step2]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate - 328.5590.
Completed in 99.73325 seconds. Options per second:   328.56

Changing RNG from C++ TR1 to MKL not only delivered 3.35X improvement on the original code, it also enabled the compiler to do inline function calls and vectorized the code in the inner loop which is what we are going to talk to in stage 3.

So far, we should have managed to make full use of optimized software offered to us as an optimized software development environment. Avoiding reinventing wheel can save us a lot of time and hassle. However, rarely could we stitch our project to high performance. Most likely we have to make some source code changes. Before we apply any parallel mechanism, let’s first make sure our code is lean and clean.

3.3 Stage 2: Scalar Serial optimization

Scalar Serial optimization makes sure the code we are going to parallelize most efficient. In other words, there is no redundancy in the code and it’s lean and clean. Also at this stage, a performance engineer would ensure that the select of precision of your data structure and accuracy of your numerical method is indeed dictated by the problems at hand and not overkill, where less expensive operations such as single precision are replaced by more expensive operations such as double precision.

C/C++ still inherited a strong weak typing culture from Programming Language C, where auto promotion is believed to achieve high precision. While this practice may or may not improve accuracy marginally, it for sure kills any chances of high performance. Here is a list other things to watch out for at scalar and serial optimization stage.

In Monte Carlo application, we have one transcendental function calls, natural-based exponential function inside the inner loop. Any improvement we can make to this area can be amplified into a big number.

It turns out that most of Intel Architecture can implement 2-based exponential call very fast. The throughput difference between natured-based and 2-based exponential function calls is 50% -80%. In the Intel Xeon Phi coprocessor, this number is much higher for single precision because a dedicated extended math unit implements 4 transcendental functions, reciprocal, reciprocal square root, base 2 log, and base 2 exponential in hardware instruction. This creates some opportunity where we can make some adjustment to the code so that base 2 versions of these functions are called instead of nature based.

The mathematical equivalence is simple

where LOG2E is E’s 2 base logarithm number and is one of the constant defined in math.h as M_LOG2E.

Fact does not give you quick and easy massive replace of exp(x) and log(x) with exp2(x*M_LOG2E) and LOG2(x)*M_LN2 (or LOG2(x)/M_LOG2E). This is because the speed differences between these two versions of function are not big enough for another multiply or divide latency. Only in the case, when the additional multiple or divide are free, does it make sense to make these replacement. This usually happens when the call takes place in a loop and the calling parameter has a multiply by constant already, we can pre-adjust the constant outside the loop and replace the calls inside the loop body.


static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float  INV_RAND_N = 1/F_RAND_N;
static const float STDDEV_DENOM = 1/(F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1/sqrtf(F_RAND_N);

float expectedCall(
        float S,
        float X,
        float MuByT,
        float VBySqrtT,
        float r)
{
        float eCall = S * exp2(MuByT + VBySqrtT * r) - X;
        return (eCall > 0) ? eCall : 0;

}


void MonteCarlo(
        float *h_CallResult,
        float *h_CallConfidence,
        float *S,
        float *X,
        float *T
)
{
float l_Random [RAND_N];
VSLStreamStatePtr Randomstream;
vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

double sTime = second();
vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, l_Random, 0.0, 1.0);

for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VOLATILITY * sqrt(T[opt])*M_LOG2E;
float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) * T[opt]*M_LOG2E;
float Sval = S[opt];
                    float Xval = X[opt];
                    float val = 0.0, val2 = 0.0;

                    for(int pos = 0; pos < RAND_N; pos++)
{
                        float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]);
val  += callValue;
val2 += callValue * callValue;
}

float exprt = expf(-RISKFREE *T[opt]);
h_CallResult[opt] = exprt * val * INV_RAND_N;
float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
} //end of for
double eTime = second();
printf("\nComputation rate - %8.4f.\n", OPT_N/(eTime-sTime));
}


Notice that we also have done some hoist of divide-by-constant cooperation. These are the operations compiler may miss. These operations are outside the inner loop. The performance improvement is a more modest amount of 7.4%.

[sli@IDF-SLI-01 step3]$ vi MonteCarloStage2.h
[sli@IDF-SLI-01 step3]$ icpc -o MonteCarlo -xAVX -O2 -no-vec -mkl MonteCarloStage2.cpp
[sli@IDF-SLI-01 step3]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate - 352.8825.
Completed in 92.85883 seconds. Options per second:   352.88

This is an indication that no matter how much effort you put to the code, if the code is running in scalar and serial mode, there is little room you can improve your code. The answer to that is parallelism.

3.4 Stage 3: Vectorization

Vectorization may mean different things in different contexts but here vectorization means using SIMD instruction and registers. It means taking advantage of synchronous execution of same function to multiple data elements. There are many way you can introduce Vectorization to your program, ranging from using processor intrinsic functions to using Intel® Cilk Plus Array Notation. These compiler-based vectorization techniques vary in a terms of the amount of control the programmer has on generated code, and the expressiveness of the syntax, and the amount of changes required to the serial program.

Before we expect compiler to vectorize the serial code, and generate SIMD instruction, the programmer has to ensure proper memory alignment. Misaligned memory access can, in serious cases, generate processor faults and in benign cases, cause cache line split, redundant object code, all impacting performance. One way to ensure memory alignment is to always request and work with strictly alignment memory. Using Intel Parallel Composer XE 2011, user can request statically allocated memory by prefix the memory definition with __attribute__(align(64)) 64-byte boundary is the minimum alignment requirement for memory designated for Intel® Xeon Phi™ coprocessor vector registers. You can also use _mm_malloc and _mm_free to request and release dynamically allocated memory. There are memory allocation intrinsics supported by Intel and GCC on Linux operating system.

Beside memory alignment, Intel Compiler based auto-vectorization works best when function call in the target loop are kept minimum. Simply because function calls have overhead and the ability to continue to execute in SIMD mode may in doubt. You can inline function as much as possible because you can avoid function call overhead and also allow autovectorizer to analyze and vectorize the callee code and caller code together.


void MonteCarlo(
        float *h_CallResult,
        float *h_CallConfidence,
        float *S,
        float *X,
        float *T
)
{
        __attribute__((align(64))) float l_Random [RAND_N];
        VSLStreamStatePtr Randomstream;
        vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

        double sTime = second();
        vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, l_Random, 0.0f, 1.0f);

        for(int opt = 0; opt < OPT_N; opt++)
        {
                float VBySqrtT = VOLATILITY * sqrt(T[opt])*M_LOG2E;
                float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]*M_LOG2E;
                float Sval = S[opt];
                float Xval = X[opt];
                float val = 0.0, val2 = 0.0;
#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4)
                for(int pos = 0; pos < RAND_N; pos++){
                        float callValue =  max(0.0, Sval *exp2(MuByT + VBySqrtT * l_Random[pos]) - Xval);
                        val  += callValue;
                        val2 += callValue * callValue;
                }

                float exprt = expf(-RISKFREE *T[opt]);
                h_CallResult[opt] = exprt * val * INV_RAND_N;
                float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
                h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
        } //end of for
        double eTime = second();
        printf("\nComputation rate - %8.4f.\n", OPT_N/(eTime-sTime));
}


[sli@IDF-SLI-01 step4]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate - 1377.2942.
Completed in 23.79245 seconds. Options per second:  1377.24

Vectorization achieved 4.9X performance improvement. The inner loop might have achieved more than 4.9X but random number generation also takes non-trivial amount of time and it has improved by 3.35X already. Together with Compiler changes, MKL, Serial and vectorization, we have achieved close to 18X performance improvement. What’s more we did all this with only one core.

3.5 Stage 4: Parallelization

Modern microprocessors have all become multicore. They have evolved from hyper threads, dual core quad core not to 8 core even 10 cores and 12 cores. The first product from Intel® Xeon Phi™ coprocessor family starts with 50+ cores. Designing a program that scales with the number of cores in a program has become a big challenge. The common practice is to break the total job into a number of smaller tasks, implicitly or explicitly create small light weight process or threads to execute these smaller tasks and bind each thread to a hardware thread or OS scheduler entity. This process is also known as threading or multithreading and the program is called multithreaded program.

With Intel Parallel Composer XE 2013, application developer can choose variety of ways to create multithread program depending on how much control he desires or the problem at hand dictates.

In threading our Monte Carlo implementation, we are looking for a method with little or close-to-zero changes to the vectorized serial code we derived from step 3. Our Monte Carlo European options is also relative easy to break into sub task. Since there are close to 32K data sets, on a Dual Socket Intel® Xeon, we can break it into 16 subtasks and each work on 2K data sets. OpenMP fits our criteria very well.


#include <stdio.h>
#include <sys/time.h>
#include "mkl_vsl.h"
#include <omp.h>
#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) : (b))
#endif

const int OPT_N = 512*64;
const int RAND_N = 1 << 18;
static const float      RISKFREE = 0.06;
static const float    VOLATILITY = 0.10;
static const float  INV_RAND_N = 1.0f/RAND_N;
static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float STDDEV_DENOM = 1 / (F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1 / sqrtf(F_RAND_N);
#define RANDSEED 123

double second()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}

void MonteCarlo(
        float *h_CallResult,
        float *h_CallConfidence,
        float *S,
        float *X,
        float *T
)
{
        __attribute__((align(64))) float l_Random [RAND_N];
        VSLStreamStatePtr Randomstream;
        vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

        double sTime = second();
        vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, l_Random, 0.0f, 1.0f);
#pragma omp parallel for
        for(int opt = 0; opt < OPT_N; opt++)
        {
                float VBySqrtT = VOLATILITY * sqrt(T[opt])*M_LOG2E;
                float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]*M_LOG2E;
                float Sval = S[opt];
                float Xval = X[opt];
                float val = 0.0, val2 = 0.0;
#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4)
                for(int pos = 0; pos < RAND_N; pos++){
                        float callValue  = Sval * exp2(MuByT + VBySqrtT * l_Random[pos]) - Xval;
                        callValue = (callValue > 0) ?   callValue : 0;
                        val  += callValue;
                        val2 += callValue * callValue;
                }

                float exprt = expf(-RISKFREE *T[opt]);
                h_CallResult[opt] = exprt * val * INV_RAND_N;
                float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
                h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
        } //end of for
        double eTime = second();
        printf("\nComputation rate - %8.4f.\n", OPT_N/(eTime-sTime));
}


[sli@IDF-SLI-01 step5]$ ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate - 5561.2050.
Completed in 5.89369 seconds. Options per second:  5559.85

In the end, on a single-socket Intel® microarchitecture code name Sandy Bridge system, we have improved the performance 4.04X over vectorized serial implementation. In the end, we achieved 80.86X improvement over the baseline we establish with GCC.

Here is the summary of performance gain as each technology introduce. It calculated as the ratio before and after using applying these technologies. For example, using Intel Compiler as baseline, adding MKL increase performance by 3.33X.

3.6 Stage 5: From Multicore to Many Integrated Core

Highly parallel application such as Monte Carlo can reap immediate benefit from Intel® Xeon Phi™ coprocessor based on many integrated core architecture.

One of the major features of Intel Parallel Composer XE 2013 is the integration of multicore and the Intel® MIC Architecture development environment. The developer has one SKU to install. He can compile and run natively multicore applications; or he can compile the same source code to build the executable for Intel® Xeon Phi™ Coprocessor.

The same source, without changing a single line of code, can compile and run natively on the Intel® Xeon Phi™ coprocessor. The following demonstrate that the same code can complete a little more than 1.6 seconds, we can reach 20K options/sec.

[sli@cthor-knc2 mc_tutorial]$ icpc -mmic -o MonteCarlo -openmp -mkl MonteCarloStage4.cpp

[root@cthor-knc2-mic0 /tmp]# ./MonteCarlo
Benchmarking the Parallel Computing Result...
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 Option with path length of 262144.

Computation rate - 20001.0963.
Completed in  1.63868 seconds. Options per second: 19996.56

We can further optimize the code runs on Intel® Xeon Phi™ Coprocessor. The area to focus would be Cache blocking.

4. Conclusions

In this paper, we provide a brief introduction to Monte Carlo Method and Monte Carlo in derivative pricing application. We also looked at the latest release of Intel® Xeon Phi coprocessor and the integrated software development environment Intel® Parallel Composer XE 2013. We derived our own C/C++ implementation of Monte Carlo European call option pricing, which we use as a case study and demonstrated a stepwise optimization framework.

Following this optimization framework, Monte Carlo European Option Pricing achieved more than 80 times as it goes through 4 unique stages of performance optimization using the combination of Intel® Parallel Composer XE 2013. In the end the same code, with a simply change of compiler invocation can produce the code that execute on newly announced Intel® Xeon Phi™ Coprocessor natively.

Stepwise optimization framework has proved to be effective not only in financial numerical application, but also in general scientific computation as well. Intel Parallel Composer XE 2013 will make it even easier for scientific programmers to approach performance optimization as a structured activity and also allow them to quickly understand the limitation of the program and achieve every kind of parallelism available.

Additional Resources

  • Intel® Composer XE 2013 for Linux* including Intel® MIC Architecture**
  • Intel® C++ Compiler XE 12.0 User and Reference Guides**
  • C++ Technical Report 1

** This documentation is all installed with Intel® Composer.

About the Author

Shuo Li works at Software & Service Group, Intel Corporation. He has 24 years of experience in software development. His main interest is parallel programming, computational finance and application performance optimization. In his recent role as a software performance engineer covering financial service industry, Shuo works closely with software developers and modelers and help them achieve best possible performance on Intel platform. Shuo holds a Master's degree in Computer Science from University of Oregon and an MBA degree from the Duke University.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk, and Intel® Xeon Phi™ are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

This sample source code is released under the Intel Sample Source Code License Agreement

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.