Case Study: Achieving High Performance on Monte Carlo European Option Using Stepwise Optimization Framework

1. Introduction

Monte Carlo uses a statistical computing method for solving complex scientific computing problems. It innovatively uses random numbers to simulate the uncertainty of inputs to a problem and processes the repeated sampling of the parameter to obtain a deterministic result and solve problems that would otherwise be impossible. This method was originally pioneered by nuclear physicists involved in the Manhattan Project in late 1940s. It is named after the biggest casino in the principality of Monaco.

Phelim Boyle, a Canadian financial mathematician, is credited as the first person to use the Monte Carlo method in quantitative finance. In his historical paper Options: a Monte Carlo Approach in the May 1977 issue of the Journal of Financial Economics, he developed a simulation method to solve the European option value problem and independently verified the Black-Scholes formula. Since then, the Monte Carlo method has become a very popular numerical method among derivative pricing and risk management professionals on Wall Street. In a recent report, up to 40 percent of the quantitative finance problems involve the Monte Carlo method. Large clustered computer server installations known as Monte Carlo farms have become an almost must-have configuration for any firm that practices high performance financial computing.  In1996, Mark Broadie and Paul Glasserman applied Monte Carlo to Asian options. In 2001, Francis Longstaff and Eduardo Schwartz developed the first Monte Carlo algorithm for pricing American options. In 2006, Mike Giles from Oxford University created a multi-level Monte Carlo as a cost effective way to solve high dimensional problems with user-specified accuracy.

While Monte Carlo makes many high dimensional problems tractable with simple elegant mathematical expressions and perhaps much less memory required than direct PDE and the lattice method, it demands extensive raw computation power, especially in areas where deterministic and accurate results are mandatory. Most of the time, it takes hundreds of thousands, sometimes millions, of samplings for satisfactory results. It takes 100 times more samples, simply to get the accuracy to the next decimal digit. With so many problems that are computationally challenging, the financial industry looks for an innovative architecture to satisfy its insatiable demand for raw computation power.

1.1 Intel® Many Integrated Core Architecture (Intel® MIC)

The Intel® Many Integrated Core architecture (Intel® MIC) provides an ideal execution vehicle for Monte Carlo-related applications. Building around the same instruction set architecture as IA32/Intel 64 and using the very same programming model as the Intel® Xeon® multicore processor, Intel® Xeon Phi™ coprocessor product further extends the parallel execution infrastructure and allows highly parallel applications to reach the level of performance far exceeding that of a repurposed graphic card (sometimes referred to as a GPGPU) with little or no modification of source code. Users benefit from the ease of software development and deployment as well as the acceleration of software performance.

Intel Xeon Phi coprocessor (the coprocessor) is the first product based on the Intel Many Integrated Core architecture. It introduces and implements an entirely new instruction set known as Intel® Initial Many Core Instructions or Intel® IMIC. Intel IMCI inherits scalar instructions, including x87 instructions, from that of IA32/Intel 64 and introduces 512-bit vector instructions that operates on the newly designed vector processing unit or the VPU.  However, Intel IMCI does not support Intel SSE. SSE2, nor Intel AVX. For highly parallel application, Intel Compiler generates the vecotorized code that execute on the VPU.

1.1.1 Processor Architecture

At the microprocessor architecture level, the coprocessor is an SMP processor comprised of 61+ cores organized around two uni-directional rings. Eight on-die memory controllers support 16 GDDR5 channels and are expected to deliver up to 5.5 GT/sec. At the core level, each coprocessor core is a fully functional, in-order core in its own right, capable of running IA instructions independently of the other cores. Each core is also given hardware multi-threading support and can run four hardware contexts or threads. At any given clock cycle each core can issue up to two instructions from any single context.

1.1.2 Core and Vector Processor Units

The coprocessor core implements sixteen general purpose 64-bit registers as well as most of the new instructions associated with 64-bit extension. However, the only vector instructions supported are the initial Intel® Many Core Instruction set, or Intel® IMCI. There is no support for Intel® MMX™ technology, Intel® SSE, or Intel® AVX in the coprocessor cores, although the scalar math unit (x87) remains integrated and fully functional.

Designed from scratch is an all-new vector processing unit (VPU). It is a 512-bit SIMD engine, capable of executing 16-wide single precision, or 8-wide double precision floating point SIMD operations with full support of all four rounding modes as articulated by IEEE 754R. The VPU can also process 32-bit integer data and 64-bit long integer data similar to how it processes the floating-point data of equivalent bits.

Inside the VPU, a new extended math unit provides the fast implementation of the single precision transcendental functions: reciprocal, reciprocal square root, base 2 logarithm, and base 2 exponential functions using minimax quadratic polynomial approximation. These four hardware implemented elementary functions have a latency of 4 to 8 cycles and can achieve high throughput of 1to 2 cycles. Other transcendental functions can be derived from these elementary functions.

Function

Math

Latency Cyc

Tpt. Cyc

RECIP

1/x

 

1

RSQRT

1/√x

 

1

EXP2

2x

 

2

LOG2

log2x

 

1

1.1.3 Cache Architecture

Intel Xeon Phi coprocessor’s level one (L1) cache was designed to accommodate higher working set requirements for four hardware contexts per core. It has a 32 KB L1 instruction cache and 32 KB L1 data cache. Associativity is 8-way, with a 64-byte cache line. Bank width is 8 bytes. Data return can now be out-of-order. The access time has a 3-cycle latency.

The 512-KB unified level two (L2) cache comprises 64 bytes per way with 8-way associativity, 1024 sets, 2 banks, and 32 GB (35 bits) of cacheable address range. The expected idle access time is approximately 80 cycles.

The L2 cache has a streaming hardware prefetcher that can selectively prefetch code, read, and RFO (read-for-ownership) cache lines into the L2 cache. It supports 16 streams that can bring in up to a 4-KB page of data. Once a stream direction is detected, the prefetcher can issue up to 4 multiple prefetch requests. The L2 cache now has ECC support also. The replacement algorithm for both the L1 and L2 caches is based on a pseudo-LRU implementation.

Major parameters of the Coprocessor’s cache architecture are summarized in the following table:

Parameters

L1

L2

Coherence

MESI

MESI

Size

32K (I) + 32K  (D)

512 K (combined)

Associativity/Line Size/Banks

8-way/64 Bytes/8

8-way/64 Bytes/8

Access time (clocks)

1

11

Policy

Pseudo LRU

Pseudo LRU

Duty Cycle

1 per clock

1 per clock

Port

Read or Write

Read or Write

 

1.2 Software Development Environment for Intel® Xeon™ Processor and Intel® Xeon Phi™ Coprocessor

1.2.1 Intel® Parallel Studio XE 2013

In September 2012, Intel publically announced the Intel® Parallel Studio XE 2013. This is the integrated software development toolkit for Intel multicore and Intel MIC architecture product line. In this software package, you can find Intel® C++ and Fortran Composer XE along with Intel® VTune™ Amplifier XE Application Performance Profiler. Also included are Intel® Performance Libraries such Intel® Threading Building Blocks (Intel® TBB) and Intel® Math Kernel Library (Intel® MKL) that developers can use to build applications and libraries. Integrated Intel Multicore and Intel MIC support is one of the major features of the current release. Since January 2013, 5 different updates have been released. The program mentioned in this article was built with update 5. The performance mentioned in this article was recorded by running the executable file built by 13.1 update 3 of Intel® C++ Intel® 64 Composer XE.

1.2.2 Intel Xeon host processor system

The Intel Xeon-based, dual-socket platform is the host system, with an Intel Xeon Phi coprocessor connected to it via PCI Express* (PCIe) interface. Each socket of the host is populated with an Intel® Xeon® E5-2670 8-core CPU running 2.6 Ghz. Intel Xeon E5-2670 supports Intel advanced vector extensions for SIMD parallelism.

The integrated software development tool stack runs on the Intel Xeon-based host system. Application developers can create either offload applications or native applications for the Intel Xeon Phi coprocessor.

An offload application starts and initially runs on the host system. Part of the program can run on the coprocessor at the application developer’s directive or instruction. The compiler creates a single executable file with the binary for the host and the coprocessors. The integrated tool’s runtime library is responsible of extracting and copying the binary file meant for the coprocessor, setting up the input and output data, and invoking the coprocessor program.

A native application starts and runs only on the coprocessor card. The programmer informs the compiler that the whole program is intended to run on the coprocessor by using a special switch (–mmic ) at the compiler invocation line. The programmer is responsible for copying the executable files, together with shared libraries, from the host to the coprocessor, preparing the input data and output data, and invoking the program on the coprocessor.

This paper only covers creating native applications for the coprocessor. Offload applications are not addressed this paper.

1.2.3 Intel Xeon Phi Coprocessor

The coprocessor is a computing device running its own operating system, a modified version of the Linux* operating system known as Manycore Platform System Software or MPSS. The coprocessor device cannot boot on its own. Instead, the host can boot or shutdown the coprocessor through standard Linux device operation commands. The MPSS version that works in conjunction with update 5 of the Intel Parallel Studio XE 2013 is 2.1.6720-13. In this paper, we use the coprocessor model 7120p with 61 cores. Each runs at 1.238 GHz and has 16 GB of GDDR memory. Core memory speed is 5.5 GHz.

2. Monte Carlo European Option Pricing Algorithm

In this section, we give some basic background on financial derivation, specifically stock option pricing. We show how the Monte Carlo method can be used to model the uncertainty of the input, how each sample is taken and the payoff function is determined based on the option specification. In the end, we will arrive at an implementation of Monte Carlo using standard C++. We will also highlight the performance bottleneck for this implementation.

2.1 Option Pricing Background

In the financial world, a derivative is a financial instrument, whose value depends on the value of other, more basic, underlying variables. Very often, the variables underlying derivatives are the prices of traded assets. A stock option, for example, is a derivative whose value is dependent on the price of a stock. Not all variables and derivatives are dependent on traded assets. Some of these variables can be, for example, snow falls at a certain resort; others may be the average weather temperatures in a specific time intervals, etc.

An option is a derivative that specifies a contract between two parties for a future transaction on an asset at a reference price, known as an exercise. The buyer of the option gains the right, but not the obligation, to engage in that transaction, while the seller incurs the corresponding obligation to fulfill the transaction. There are two types of options. A call option gives the holder the right to buy the underlying asset by a certain date for a certain price. A put option gives the holder the right to sell the underlying asset by a certain date for a certain price. The price in the contract is known as the strike price. The date in the contract is known as the expiration date. European options can be exercised only on the expiration date. American options can be exercised at any time up to the expiration date.

2.2 Pricing European Stock Options using the Monte Carlo Method

Monte Carlo simulation uses the risk-neutral valuation method to value an option. It samples a path to obtain the expected payoff in a risk-neutral world and then discounts the payoff to current value using a risk-free interest rate. Let’s consider a stock option for a stock with current price S and provides a payoff at time T. Assuming the interest is constant, we can value the derivatives as follow:

  1. Sample a random path for S in a risk-neutral world.
  2. Calculate the payoff from the derivative.
  3. Repeat step 1 and 2 to get many sample values of the payoff from the derivative in risk-neutral world.
  4. Calculate the mean of the sample payoffs to get an estimate of the expected payoff in a risk-neutral world
  5. Discount the expected payoff at the risk-free rate to get an estimate of the value of the derivative.

Suppose the process followed by the underlying market variable in the risk-neutral world is:

where dWt is a Wiener process, µ is the expected return in a risk neutral work, and σ is the volatility. To simulate the path followed by lnS, we can divide the life of the derivative into N short intervals of length t and approximate equation 3.1 as:

or, equivalently:

where ε is a random sample from φ (0,1). This enables the value of S at time Δt to be calculated from the initial value of S, the value of 2 Δt to be calculated from the value at time Δt, and so on.

for each k between 1 and M. Here each εi is a draw from a standard normal distribution.

Since we know the values of European options at the time of expiration. equations for the call and put options are:

Using Monte Carlo, we can generate M numeric samples with the required (0, 1) distribution, corresponding to the underlying Wiener process, then average the possible end period stock profits, corresponding to each of the sample values:

Similar functions can be obtained for European put options.

The result is still the future value of the option at time T. Discounting this value by a factor of , we get the present value of European call options.

It follows from the central limit theorem to reduce the standard deviation by half, the number of sampling path needs to be quadrupled. In other words, the standard error for Monte Carlo converges at the rate of

The advantage of Monte Carlo simulation is that it can be used when the payoff depends on the path followed by the underlying variable S leading toward the expiration and the situation when payoffs take place multiple times during the life of the option. It is particularly useful when the payoff function involves multiple independent variables. And when all other analytical methods fail, Monte Carlo then becomes the only choice.

2.3 Implementation of Monte Carlo European Option Pricing Algorithm

Implementation of Monte Carlo European Option pricing can be relatively easy once we have a payoff function from the previous section. First and foremost, we need to find a random number from the φ (0, 1), then we can use the payoff function to get the expected value and confidence interval for the underlying options. A C/C++ implementation may look like this.


        for (int pos = 0; pos < RAND_N; pos++)
        {
            float callValue =  max(0.0, Sval *exp(MuByT + VBySqrtT *  gen()) - Xval);
            val  += callValue;
            val2 += callValue * callValue;
        }


Let’s take a look at a hypothetical situation in which a firm wants to calculate European options for millions of financial instruments. For each instrument, it has one set of parameters: current price, strike price, and option expiration time. The firm wants to calculate the European call options for each set of data using Monte Carlo simulation. In the following section, we demonstrate the code fragments that price European call values for 15 million option data sets using a path length of 265 K or 266,144.


#include "MonteCarlo.h"
#include <math.h>
#include <tr1/random>

#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) ; (b))
#endif
  
void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    typedef std::tr1::mt19937                     ENG;    // Mersenne Twister
    typedef std::tr1::normal_distribution<float> DIST;    // Normal Distribution
    typedef std::tr1::variate_generator<ENG,DIST> GEN;    // Variate generator

    ENG  eng;
    DIST dist(0,1);
    GEN  gen(eng,dist);


    for(int opt = 0; opt < OPT_N; opt++) 
    {
        float VBySqrtT = VOLATILITY * sqrt(T[opt]);
        float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) * T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

        for(int pos = 0; pos < RAND_N; pos++)
        {
            float callValue =  max(0.0, Sval *exp(MuByT + VBySqrtT *  gen()) - Xval);
            val  += callValue;
            val2 += callValue * callValue;
        }

        float exprt = exp(-RISKFREE *T[opt]);
        h_CallResult[opt] = exprt * val / (float)RAND_N;
        float stdDev = sqrt(((float)RAND_N * val2 - val * val)/ ((float)RAND_N * (float)(RAND_N - 1)));
        h_CallConfidence[opt] = (float)(exprt * 1.96 * stdDev / sqrtf((float)RAND_N));
    }
}


Note that the above code sequence uses the random number generation facilities in C++ with the TR1 (C++ Standards Committee Technical Report 1) extensions. The C++ random number generation classes and functions are defined in the <random> header and contained in the namespace std::tr1.

At the core of any pseudo-random number generation software is a routine for generating uniformly distributed random integers. These are then used in a bootstrap process to generate uniformly distributed floating point numbers. The uniformly distributed floating point number sequences are used to generate other distributions through transformations, acceptance-rejection algorithms, etc.

C++ TR1 gives you several choices of core generators that it calls “engines.” The following four engine classes are supported in GCC 4.3.x and Visual Studio* 2008 feature pack.

  • linear_congruential
  • mersenne_twister
  • subtract_with_carry
  • subract_with_carry_01

The C++ TR1 library supports non-uniform random number generation through distribution classes. These classes return random samples via the operator() method.

The template class variate_generator describes an object that holds an engine and a distribution and produces values by passing the wrapped engine object to the distribution object's operator().

Finally, compiled with GCC 4.4.6, the initial implementation of Monte Carlo for European options can price at a little more than 34 opt/sec on the dual-socket host system based on 8-core Intel Xeon E5-2670 running at 2.6 GHz.


[sli@localhost step1]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 options with path length of 262144.
Completed in 955.2525 seconds.
Computation rate -   34.303 options per second.


Obviously, this implementation has a lot of room for improvement. In the next section, we are going to take this program through a structured framework and attempt to improve the performance along the way.

3. Stepwise Optimization Framework and Performance Optimization

Just like any scientific process, performance optimization requires a systematic and structured approach. In this section, we highlight an optimization framework that allows you to take a holistic approach to application performance improvement. The objective of this framework is to achieve the highest performance with the best tools and library available. This framework has five optimization steps. Each step attempts to improve the application performance in one orthogonal direction by applying various techniques.

The goal of this methodology is to address all issues and hurdles related to the application’s performance and achieve the highest possible performance on Intel architecture, while utilizing all the possible parallel execution resources.

3.1 Stepwise Optimization Framework

The stepwise optimization framework is a five-step parallelism-enabling process that helps the performance engineer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment.

The first step is to select the optimization development environment. This environment should be able to create optimized code and allow you to use the existing optimizing library. The second step is to optimize the operations you are performing. It ensures intended operations are getting done optimally, and nothing else. Step three is all about vectorization and how to take advantage of SIMD instructions and explore synchronous data parallelism in your application and express it in a way the compiler can understand. The objective is to maximize the performance of a single core using one thread. Step four is threading parallelization where we bring parallelism to the problem with little synchronous operations. The end objective is to take full advantage of all cores. Step five is to scale your application from multicore to Intel MIC. This step is especially important for highly parallel applications. The purpose is to target the unique features in the microarchitecture level and perform additional optimization based on those features, for example, the differences between memory bandwidth vs. processing capability, the SIMD alignment requirement changes, cache per threads, etc. The purpose is to minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel Xeon processor) to another (Intel Xeon Phi Coprocessor.

3.2 Step 1: Leverage the Optimized Tools and Library

At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime.

Sometimes, it’s difficult to find a single environment to get all you want. In that case, you have to put a solution together based on different tools. For example, if you want to optimize a Java* program with a performance critical portion of the code in C/C++, you must find a Java SDK, a C/C++ compiler, a JVM environment, and perhaps a system wide profiler. Other times, the solution could be obvious. Intel C++ Composer XE integrates multicore and many-core development environments and is currently the only application compiler with vectorization capability. If your project involves scaling a highly parallel application from the Intel Xeon processor to Intel Xeon Phi coprocessor, Intel C++ Composer XE 2013 is currently the only choice for development tools. It is the best tool because the integrated performance profiling tools deliver performance monitor events that tell you exactly where you need to spend your optimization effort. It is a night vision goggle, if you will, that illuminates the targets for you.

Intel C++ Composer XE 2013 on Linux is compatible with GCC 4.4.6, which is released with RHEL 6.3 Server edition. What you compile with g++ can also be compiled with icpc, the g++ equivalent from Intel C++ Composer. For example, instead of g++ -o MonteCarlo -O2 MonteCarloStage1.cpp, you can issue icpc -o MonteCarlo -O2 MonteCarloStage1.cpp. When you run the binary created by the Intel C++ Composer, you will get an instant boost of 1.37X performance improvement without even touching the source code. Our objective is to improve the performance, but keep the calculation correct.


[sli@localhost step1]$ make -B CXX=icpc
icpc -c -O2  -o Driver.o Driver.cpp
icpc -c -O2  -o MonteCarloStep1.o MonteCarloStep1.cpp
icpc Driver.o MonteCarloStep1.o -o MonteCarlo
[sli@cthor-knc14 step1]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 options with path length of 262144.
Completed in 694.6562 seconds.
Computation rate -   47.172 options per second.


Intel MKL is precompiled library consisting of basic math routines such as BLAS, LAPACK, and random number generation functions. Intel MKL provides C and Fortran interfaces for these functions. You can call these functions from C/C++ or from Fortran.

Efficiency of random number generation (RNG) is an indicator of the performance aspect of any Monte Carlo simulation. Like C++ TR1, Intel MKL also provides the random number generation routines. Like C++ TR1, Intel MKL also allows you to independently choose different basic RNG engines and different distributions multiplied by two floating point and 2 integer data of 64-bit and 32-bit, if applicable. Unlike C++ TR1, Intel MKL’s RNG interface allows you to use them in both C, C++, and Fortran. Similar to other compiled libraries, you first include its interface declarations .h file, then make Intel MKL API calls in your C/C++ implementation files, and then link with precompiled Intel MKL libraries. In our example, we need to include three Intel MKL libraries: mkl_intel_lp64, mkl_sequential, and mkl_core. You can enumerate these libraries, together with –lpthread, such as -lmkl_intel_lp64 -lmkl_sequential -lmkl_core –lpthread, or you can pass –mkl to the linker, and the linker will handle it for you.

A major difference is that in C++ TR1, random numbers are delivered one at a time with a gen() method call from variate_generator class. While in Intel MKL, the process of creating a RNG engine object is replaced with creating a RNG stream, the process of creating a distribution object is replaced by selecting the right RNG interface API calls. One Intel MKL RNG interface call can deliver any number of random numbers. In our problem, 256 K random numbers are delivered in one RNG interface call.


#include "MonteCarlo.h"
#include "math.h"
#include "mkl_vsl.h"
#define RANDSEED 123

#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) : (b))
#endif
  
void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    float random [RAND_N];
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

    vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, random, 0.0, 1.0);

    for(int opt = 0; opt < OPT_N; opt++) 
    {
        float VBySqrtT = VOLATILITY * sqrt(T[opt]);
        float MuByT = (RISKFREE - 0.5*VOLATILITY*VOLATILITY)*T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

        for(int pos = 0; pos < RAND_N; pos++)
        {
            float callValue =  max(0.0, Sval*exp(MuByT+VBySqrtT*random[pos])-Xval);
            val  += callValue;
            val2 += callValue * callValue;
        }

        float exprt = exp(-RISKFREE *T[opt]);
        h_CallResult[opt] = exprt * val / (float)RAND_N;
        float stdDev = sqrt(((float)RAND_N*val2-val*val)/((float)RAND_N*(float)(RAND_N-1)));
        h_CallConfidence[opt] = (float)(exprt * 1.96 * stdDev / sqrtf((float)RAND_N));
    }
    vslDeleteStream(&Randomstream);
}


We actually reduced the number of lines of code by using the Intel MKL RNG facility. The performance difference is stunning. More than 5.53X improvement compared to the original code.


[sli@cthor-knc14 .solution]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 options with path length of 262144.
Completed in 125.5210 seconds.
Computation rate -  261.056 options per second.


Changing the RNG from C++ TR1 to Intel MKL not only delivered 5.53X improvement on the original code, it also enabled the compiler to do inline function calls and vectorized the code in the inner loop, which is discussed in step 3.

So far, we have made full use of the optimized software available to us as an optimized software development environment. This can save us a lot of time and allow us to focus on developing our own IPs and working on new problems. Most likely, if we expect big performance gain, we would have to make some source code changes. Before we apply any parallel mechanism, let’s first make sure our serial code is lean and clean.

3.3 Step 2: Scalar Serial optimization

Use scalar serial optimization to make sure your code runs at top efficiency by checking there are no redundancies.  Also at this step, you should ensure that the selection of precision of your data structures and accuracy of your numerical methods are indeed dictated by the problems at hand and not overkill, in which case, less expensive operations such as single precision can replace a more expensive double precision operation.

C/C++ still inherits a weak typing culture from the early days of C. The auto promotion is believed to achieve high accuracy in certain cases. While this practice may or may not marginally improve accuracy, it definitely kills any chances of high performance. Here is a list of other things to watch out for at the scalar and serial optimization stage.

In the Monte Carlo application, we have one transcendental function call, a natural-based exponential function inside the inner loop. Any improvement we can make to avoid this function call can be amplified into a big performance improvement.


#include "MonteCarlo.h"
#include "math.h"
#include "mkl_vsl.h"
#define RANDSEED 123

#ifndef max
  #define max(a,b) (((a) > (b)) ? (a) : (b))
#endif
  
static const float  RVV = RISKFREE-0.5f*VOLATILITY*VOLATILITY;
static const float  INV_RAND_N = 1.0f/RAND_N;
static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float STDDEV_DENOM = 1 / (F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1 / sqrtf(F_RAND_N);

void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    float random [RAND_N];
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

    vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, random, 0.0f, 1.0f);

    for(int opt = 0; opt < OPT_N; opt++) 
    {
        float VBySqrtT = VOLATILITY * sqrtf(T[opt]);
        float MuByT = RVV * T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

        for(int pos = 0; pos < RAND_N; pos++)
        {
            float callValue =  max(0.0, Sval *expf(MuByT + VBySqrtT *  random[pos]) - Xval);
            val  += callValue;
            val2 += callValue * callValue;
        }

        float exprt = expf(-RISKFREE *T[opt]);
        h_CallResult[opt] = exprt * val * INV_RAND_N;
        float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
        h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
    }
    vslDeleteStream(&Randomstream);
}


Notice that we also have hoisted the divide-by-constant operations out of the loop. These operations are outside the inner loop and often missed by the compiler. The performance improvement is a more modest amount of 41.11%.


[sli@localhost step3]$ make -B
icpc -c -O3 -ipo -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -o Driver.o Driver.cpp
icpc -c -O3 -ipo -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -o MonteCarloStep2.o MonteCarloStep2.cpp
icpc Driver.o MonteCarloStep2.o -o MonteCarlo -mkl
[sli@cthor-knc14 .solution]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 options with path length of 262144.
Completed in  88.9513 seconds.
Computation rate -  368.381 options per second.


This is an indication that no matter how much you try, if the code is running in scalar and serial mode, there is little room for improving your code. So we must look to parallelism.

3.4 Step 3: Vectorization

Vectorization may mean different things in different contexts. In this paper, vectorization means compiler-generated SIMD instructions that use the vector registers inside the CPU. It means taking advantage of synchronous execution of some instructions with multiple data elements.

There are many ways you can introduce vectorization into your program, ranging from using processor intrinsic functions to using Intel® Cilk™ Plus array notation. The compiler-based vectorization techniques vary in the amount of controls the programmer has on generated code, the expressiveness of the syntax, and the amount of changes required to the serial program.

Before we expect a compiler to vectorize the serial code and generate SIMD instructions, we have to ensure proper memory alignment. Misaligned memory access can, in serious cases, generate processor faults and in benign cases, cause cache line splits and redundant object code, all of which have a serious impact on application performance. One way to ensure memory alignment is to always request and work with the explicit alignment requirement. Using Intel C++ Composer XE 2011, you can request statically allocated memory by prefixing the memory definition with __attribute__(align(64)). 64-byte boundary is the minimum alignment requirement for memory designated for Intel Xeon Phi coprocessor vector registers. You can also use _mm_malloc and _mm_free to request and release dynamically allocated memory. There are memory allocation intrinsics supported by Intel and GCC on the Linux operating system.

 

Besides memory alignment, Intel compiler-based vectorization works best when the function calls in the hot loop are kept to a minimum, simply because function calls have overhead. Not all function calls allow the CPU’s vector engine to continue to execute in SIMD mode. You can inline functions as much as possible because inlining can avoid function call overhead and also allow compiler-based vectorizer to analyze and vectorize the callee code and caller code together.


#include "MonteCarlo.h"
#include "math.h"
#include "mkl_vsl.h"
#define RANDSEED 123

static const float  RVV = RISKFREE-0.5f*VOLATILITY*VOLATILITY;
static const float  INV_RAND_N = 1.0f/RAND_N;
static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float STDDEV_DENOM = 1 / (F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1 / sqrtf(F_RAND_N);

void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    __attribute__((align(4096))) float random [RAND_N];
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

    vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, random, 0.0f, 1.0f);
    for(int opt = 0; opt < OPT_N; opt++) 
    {
        float VBySqrtT = VOLATILITY * sqrtf(T[opt]);
        float MuByT = RVV * T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4)
        for(int pos = 0; pos < RAND_N; pos++)
        {
            float callValue  = Sval * expf(MuByT + VBySqrtT * random[pos]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val  += callValue;
            val2 += callValue * callValue;
        }

        float exprt = expf(-RISKFREE *T[opt]);
        h_CallResult[opt] = exprt * val * INV_RAND_N;
        float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
        h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
    }
    vslDeleteStream(&Randomstream);
}


#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <string.h>
#include <math.h>
#include "MonteCarlo.h"
#include <iostream>
using namespace std;

#define SIMDALIGN 64

double second()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}

inline float RandFloat(float low, float high){
    float t = (float)rand() / (float)RAND_MAX;
    return (1.0f - t) * low + t * high;
}

///////////////////////////////////////////////////////////////////////////////
// Polynomial approximation of cumulative normal distribution function
///////////////////////////////////////////////////////////////////////////////

double CND(double d){
    const double       A1 = 0.31938153;
    const double       A2 = -0.356563782;
    const double       A3 = 1.781477937;
    const double       A4 = -1.821255978;
    const double       A5 = 1.330274429;
    const double RSQRT2PI = 0.39894228040143267793994605993438;

    double
        K = 1.0 / (1.0 + 0.2316419 * fabs(d));

    double
        cnd = RSQRT2PI * exp(- 0.5 * d * d) *
        (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));

    if(d > 0)
        cnd = 1.0 - cnd;

    return cnd;
}

void BlackScholesFormula(
    double& callResult,
    double Sf, //Stock price
    double Xf, //Option strike
    double Tf, //Option years
    double Rf, //Riskless rate
    double Vf  //Volatility rate
){
    double S = Sf, X = Xf, T = Tf, R = Rf, V = Vf;

    double sqrtT = sqrt(T);
    double    d1 = (log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT);
    double    d2 = d1 - V * sqrtT;
    double CNDD1 = CND(d1);
    double CNDD2 = CND(d2);

    double expRT = exp(- R * T);
    callResult   = (S * CNDD1 - X * expRT * CNDD2);
}

int main(int argc, char* argv[])
{
    float 
        *CallResultParallel,
        *CallConfidence,
        *StockPrice,
        *OptionStrike,
        *OptionYears;
    double
	sTime, eTime;
    int 
        mem_size, rand_size, verbose = 0; 
  
    const int RAND_N = 1 << 18;

    if (argc > 2)
    {
      	printf("usage: MonteCarlo <verbose>  where verbose = 1 for validtating result, the default is not to validate result. \n");
      	exit(1);
    }
    if (argc == 1)
	verbose = 0;
    if (argc == 2)
	verbose = atoi(argv[1]);

    printf("Monte Carlo European Option Pricing in Single Precision\n");	
    mem_size = sizeof(float)*OPT_N;
    rand_size = sizeof(float)*RAND_N;

    CallResultParallel = (float *)_mm_malloc(mem_size, SIMDALIGN);
    CallConfidence     = (float *)_mm_malloc(mem_size, SIMDALIGN);
    StockPrice         = (float *)_mm_malloc(mem_size, SIMDALIGN);
    OptionStrike       = (float *)_mm_malloc(mem_size, SIMDALIGN);
    OptionYears        = (float *)_mm_malloc(mem_size, SIMDALIGN);

    if (verbose)
    {
        printf("...generating the input data.\n");
    }
    for(int i = 0; i < OPT_N; i++)
    {
        CallResultParallel[i] = 0.0;
        CallConfidence[i]= -1.0;
        StockPrice[i]    = RandFloat(5.0f, 50.0f);
        OptionStrike[i]  = RandFloat(10.0f, 25.0f);
        OptionYears[i]   = RandFloat(1.0f, 5.0f);
    }

    printf("Pricing %d options with path length of %d.\n", OPT_N, RAND_N);
    sTime = second();
    MonteCarlo(
        CallResultParallel,
        CallConfidence,
        StockPrice,
        OptionStrike,
        OptionYears);
    eTime = second();
    printf("Completed in %8.4f seconds.\n",eTime-sTime );
    printf("Computation rate - %8.3f options per second.\n", OPT_N/(eTime-sTime));

    if (verbose)
    {
        double
           delta, sum_delta, sum_ref, L1norm, sumReserve;
        double CallMaster;

        sum_delta = 0;
        sum_ref   = 0;
        sumReserve = 0;

        for(int i = 0; i < OPT_N; i++)
        {
            BlackScholesFormula(CallMaster,
                (double) StockPrice[i], 
                (double) OptionStrike[i],
                (double) OptionYears[i],
                (double) RISKFREE,
                (double) VOLATILITY);
            delta = fabs(CallMaster - CallResultParallel[i]);
            sum_delta += delta;
            sum_ref   += fabs(CallMaster);
            if(delta > 1e-6) 
                sumReserve += CallConfidence[i] / delta;
        }
        sumReserve /= (double)OPT_N;
        L1norm = sum_delta / sum_ref;
        printf("L1 norm: %E\n", L1norm);
        printf("Average reserve: %f\n", sumReserve);

        printf("...freeing CPU memory.\n");
        printf((sumReserve > 1.0f) ? "PASSED\n" : "FAILED\n");
    }
    _mm_free(CallResultParallel);
    _mm_free(CallConfidence);
    _mm_free(StockPrice);
    _mm_free(OptionStrike);
    _mm_free(OptionYears);
    return 0;
}


icpc -c -O3 -ipo -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -o Driver.o Driver.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc -c -O3 -ipo -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -o MonteCarloStep4.o MonteCarloStep4.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc Driver.o MonteCarloStep4.o -o MonteCarlo -mkl
Driver.cpp(130): (col. 5) remark: loop was not vectorized: existence of vector dependence.
Driver.cpp(141): (col. 5) remark: SIMD LOOP WAS VECTORIZED.
Driver.cpp(141): (col. 5) remark: loop was not vectorized: not inner loop.
Driver.cpp(161): (col. 9) remark: LOOP WAS VECTORIZED.
[sli@localhost step4]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 32768 options with path length of 262144.
Completed in  10.9624 seconds.
Computation rate - 2989.117 options per second.


Vectorization delivered 8.11X performance improvement—a little more than the expected value of 8X. Together with other compiler changes, Intel MKL, serial and vectorization, we have achieved close to 87.14X performance improvement. What’s more we did all this with only one thread running on one core.

3.5 Step 4: Parallelization

Modern microprocessors are all multicore. They have evolved from hyper threads, dual-core, quad-core, to 8-core, even 10- and 12-core. The first product in the Intel Xeon Phi coprocessor family starts with up to 61 cores. Designing a program that scales with the number of cores in a program has become a simple requirement for high performance software development. The common practice is to break the total job into a number of smaller tasks, implicitly or explicitly create small lightweight processes or threads to execute these smaller tasks and bind each thread to a hardware thread or an OS-schedulable entity. This process is also known as threading or multithreading and the program to do it is called a multithreaded program.

With Intel C++ Composer XE 2013, you can choose a variety of ways to create multithread programs depending on how much control you want to exert or the requirement of the problem at hand.

In threading our Monte Carlo implementation, we are looking for a method with little or close-to-zero changes to the vectorized serial code we derived from step 3. Our Monte Carlo European options program is also relative easy to break into subtasks. Since there are close to 32K data sets, on a dual socket Xeon processor, we can break the code into 16 subtasks and each works on 2K data sets. OpenMP* fits our criteria very well.


#include "MonteCarlo.h"
#include "math.h"
#include "mkl_vsl.h"
#define RANDSEED 123

static const float  RVV = RISKFREE-0.5f*VOLATILITY*VOLATILITY;
static const float  INV_RAND_N = 1.0f/RAND_N;
static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float STDDEV_DENOM = 1 / (F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1 / sqrtf(F_RAND_N);

void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    __attribute__((align(4096))) float random [RAND_N];
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);

    vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, random, 0.0f, 1.0f);
#pragma omp parallel for
    for(int opt = 0; opt < OPT_N; opt++) 
    {
        float VBySqrtT = VOLATILITY * sqrtf(T[opt]);
        float MuByT = RVV * T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4) 
        for(int pos = 0; pos < RAND_N; pos++)
        {
            float callValue  = Sval * expf(MuByT + VBySqrtT * random[pos]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val  += callValue;
            val2 += callValue * callValue;
        }

        float exprt = expf(-RISKFREE *T[opt]);
        h_CallResult[opt] = exprt * val * INV_RAND_N;
        float stdDev = sqrtf((F_RAND_N * val2 - val * val)* STDDEV_DENOM);
        h_CallConfidence[opt] = (exprt * 1.96f * stdDev * CONFIDENCE_DENOM);
    }
    vslDeleteStream(&Randomstream);
}


[sli@localhost step5]$ make -B
icpc -c -g -O3 -ipo -openmp -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -o Driver.o Driver.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc -c -g -O3 -ipo -openmp -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -o MonteCarloStep5.o MonteCarloStep5.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc Driver.o MonteCarloStep5.o -o MonteCarlo -L /opt/intel/composerxe/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -openmp
Driver.cpp(130): (col. 5) remark: loop was not vectorized: existence of vector dependence.
Driver.cpp(161): (col. 9) remark: loop was not vectorized: existence of vector dependence.
MonteCarloStep5.cpp(69): (col. 9) remark: SIMD LOOP WAS VECTORIZED.
MonteCarloStep5.cpp(58): (col. 5) remark: loop was not vectorized: not inner loop.
[sli@localhost step5]$ setenv KMP_AFFINITY "compact,granularity=fine" 
[sli@localhost step5]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 1998848 options with path length of 262144.
Completed in  43.9896 seconds.
Computation rate - 45439.090 options per second.


On a 2-socket Sandy Bridge system running at 2.6 GHz, we improved the performance 15.20X. On an execution environment with16 cores, 32 threads, 15.20X performance gain indicates that the majority of code each thread runs is in the parallel portion. Notice that we have to set the thread affinity to compact mode to maximize the shared cache effect and granularity set to fine. In the end, we achieve 1324.6X improvement over the baseline we established with GCC 4.4.6.

Here is the summary of performance gain as we have introduced each technology. It is calculated as a ratio before and after applying these technologies. For example, using the Intel compiler as a baseline, adding Intel MKL increased performance by 5.53X.

3.6 Step 5: Scale from Intel® Multicore to Intel® Many Integrated Core Architecture

Highly parallel applications such as Monte Carlo can reap immediate benefits from the Intel Xeon Phi coprocessor based on its many integrated core architecture. The first product based on Intel MIC architecture, the coprocessor has up to 61 cores and 244 threads that can take the performance of scalable software to new heights. Moreover, each of the 61 cores comes with 512-bit wider SIMD engine, which can handle 8 64-bit data, or 16 32-bit data integer or floating point.

3.6.1 Rebuild for Intel MIC architecture

One of the most important features of the Intel C++ Composer XE 2013 is the integration of Intel Multicore and Intel MIC development environments. You have one SKU to install. You can compile and run native multicore applications or compile the same source code to build the executable for the Intel Xeon Phi coprocessor.

You can just recompile the software obtained by following the stepwise optimization framework and create a binary file meant to run natively on the coprocessor.


icpc -c -g -O3 -ipo -openmp -mmic -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -opt-threads-per-core=4 -o Driver.o Driver.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc -c -g -O3 -ipo -openmp -mmic -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2 -opt-threads-per-core=4 -o MonteCarloStep5.o MonteCarloStep5.cpp
icpc: remark #10346: optimization reporting will be enabled at link time when performing interprocedural optimizations
icpc -mmic Driver.o MonteCarloStep5.o -o MonteCarlo -L /opt/intel/composerxe/mkl/lib/mic -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -openmp
Driver.cpp(130): (col. 5) remark: loop was not vectorized: existence of vector dependence.
Driver.cpp(161): (col. 9) remark: loop was not vectorized: existence of vector dependence.
MonteCarloStep5.cpp(69): (col. 9) remark: SIMD LOOP WAS VECTORIZED.
MonteCarloStep5.cpp(58): (col. 5) remark: loop was not vectorized: not inner loop.


To run the binary Monte Carlo created by the previous makefile, we have to copy it from host system to the card by using scp. Three Intel MKL shared libraries have to be copied as well. After that, we have to open the device OS command shell by using ssh. Some environment variables have to be set to ensure the correct OpenMP runtime behavior.


[sli@cthor-knc14 step5]$ scp MonteCarlo cthor-knc14-mic1:
MonteCarlo                                                  100%  493KB 492.8KB/s   00:00
[sli@cthor-knc14 step5]$ ssh cthor-knc14-mic1
~ $ export LD_LIBRARY_PATH=.
~ $ export KMP_AFFINITY="compact,granularity=fine"
~ $ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 1998848 options with path length of 262144.
Completed in   4.9851 seconds.
Computation rate - 400964.798 options per second.


The same program that took 44 seconds to complete on the host system, now completes in just less than 5 seconds, an instant 8.82X improvement. In this section, we explore ways to further improve the application performance.

3.6.2 Target Architect Optimization

Different implementations of IA or Intel architecture, create an ISA with unique performance profiles, which we can take advantage of as we move software across ISA. On the coprocessor, Intel implemented a hardware function approximation unit called the Extended Math Unit. Four elementary functions, log2(), exp2(), rsqrt(), and rcp(), are directly implemented in hardware with 1-2 cycles of throughput and 4-8 cycles of latency. Fast elementary functions can directly accelerate four other functions that can be composed using elementary functions and simple transformations.

For example, base 2’s exponent and base e’s exponent are related in the following mathematical equation known as change of base formula.


 ex = 2x*log2E
ln(x) = log2(x)/log2E  = log2(x)*ln2 = log2(x)*M_LN2
exp(x)= exp2(x * log2E)= exp2(x * M_LOG2E) 


Notice that LOG2E, base 2 logarithm of e, is a constant defined in math.h as M_LOG2E.

Elementary Function

Latency

Throughput

exp2()

8

2

log2()

4

1

rcp()

4

1

rsqrt()

4

1

Derived Functions

Latency

Throughput

pow()

16

4

sqrt()

8

2

div()

8

2

ln()

8

2

Similarly, pow() can be composed of exp2(), multiplication, and log2(); sqrt() can be composed of rsqrt() and rcp(); div() can be composed of rcp() and multiplication.

This creates some opportunity in our MonteCarlo code, where we can make some adjustment to the code so that base 2 versions of these functions are called instead of the nature based versions.


        float VBySqrtT = VLOG2E * sqrtf(T[opt]);
        float MuByT = RVVLOG2E * T[opt];
        float Sval = S[opt];
        float Xval = X[opt];
        float val = 0.0, val2 = 0.0;

#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4) 
        for(int pos = 0; pos < BLOCKSIZE; pos++)
        {
            float callValue  = Sval * exp2f(MuByT + VBySqrtT * random[pos]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val  += callValue;
            val2 += callValue * callValue;
        }


In this case, there are two expressions inside exp2f(), and we have to adjust them both and elevate the adjustment outside the inner loop.

You may think all we have to do is to replace exp(x) and log(x) with exp2(x*M_LOG2E) and LOG2(x)*M_LN2 (or LOG2(x)/M_LOG2E). However, this is not enough to reap all the benefit of the EMU function because of the additional multiplication. In most cases, when you can pre-adjust the multiple outside the inner loop, the cost of multiplication goes away and the performance benefit of using the EMU function can really shine.

3.6.3 Memory Access Pattern Optimization

In our current implementation of Monte Carlo, all threads share the pregenerated 256K random numbers, which have a footprint of 1 MB for single precision floating point. These random numbers do not fit in the worker thread’s L2 cache. Each thread can only bring a portion of the data into the cache. When it finishes the current portion, it generates an L2 miss event which triggers the ring interconnect to bring more data into L2 cache. If threads are not coordinated in accessing these data, each thread may access any portion of this 1 MB of data and the ring interconnect would be too busy to satisfy L2 cache misses from all 244 threads. Essentially, the ring interconnect’s bandwidth becomes the bottleneck even though only 1 MB of data is involved. The uncoordinated memory access makes the 1 MB data get accessed 244 times, creating a bandwidth demand equivalent of 244 MB.

Coordinated access ensures the data is fetched from memory once when the first thread requests the data, . The data then crosses the ring interconnect to land in the first thread’s L2 cache. When other threads need the data, they get the data from the first thread’s L2 cache. Each thread makes all the accesses to the data while it remains in L2 cache. Since memory access is coordinated, 1 MB of data crosses the ring interconnect once, and thus memory bandwidth pressure is relieved and reduced.

To maximize of the capacity of L2 cache, we need to calculate how much data can fit into each worker thread’s L2 cache. Each core has 512 K of cache and 4 threads. Each thread can have 128 K for all its needs. Subtracting the run-time need, only 64 K is under the application’s control. 64 KB is 16 K SP and 8 K DP floating point numbers.  

We can restructure our Monte Carlo to process a 16 K block of random numbers at a time and then do more in the next block when all threads are done with it. We have to save the intermediate, partial results in memory. When all data blocks are processed, we then process the partial results and turn them into the final results.


#include "MonteCarlo.h"
#include "math.h"
#include "mkl_vsl.h"
#include "omp.h"
#define RANDSEED 123

static const float  RVVLOG2E = (RISKFREE-0.5f*VOLATILITY*VOLATILITY)*M_LOG2E;
static const float  INV_RAND_N = 1.0f/RAND_N;
static const float  F_RAND_N = static_cast<float>(RAND_N);
static const float STDDEV_DENOM = 1 / (F_RAND_N * (F_RAND_N - 1.0f));
static const float CONFIDENCE_DENOM = 1 / sqrtf(F_RAND_N);
static const int BLOCKSIZE = 32*1024;
static const float  RLOG2E = RISKFREE*M_LOG2E;
static const float  VLOG2E = VOLATILITY*M_LOG2E;

void MonteCarlo(
    float *h_CallResult,
    float *h_CallConfidence,
    float *S,
    float *X,
    float *T
)
{
    __attribute__((align(4096))) float random [BLOCKSIZE];
    VSLStreamStatePtr Randomstream;
    vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEED);
#ifdef _OPENMP
    kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine");
#endif
#pragma omp parallel for
    for(int opt = 0; opt < OPT_N; opt++)
    {
        h_CallResult[opt]     = 0.0f;
        h_CallConfidence[opt] = 0.0f;
    }

    const int nblocks = RAND_N/BLOCKSIZE;
    for(int block = 0; block < nblocks; ++block)
    {
        vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, 1.0f);
#pragma omp parallel for
        for(int opt = 0; opt < OPT_N; opt++) 
        {
           float VBySqrtT = VLOG2E * sqrtf(T[opt]);
           float MuByT = RVVLOG2E * T[opt];
           float Sval = S[opt];
           float Xval = X[opt];
           float val = 0.0, val2 = 0.0;

#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4) 
           for(int pos = 0; pos < BLOCKSIZE; pos++)
           {
              float callValue  = Sval * exp2f(MuByT + VBySqrtT * random[pos]) - Xval;
              callValue = (callValue > 0) ? callValue : 0;
              val  += callValue;
              val2 += callValue * callValue;
           }

           h_CallResult[opt]     +=  val;
           h_CallConfidence[opt] +=  val2;
        }
    }
#pragma omp parallel for
    for(int opt = 0; opt < OPT_N; opt++)
    {
        const float val      = h_CallResult[opt];
        const float val2     = h_CallConfidence[opt];
        const float  exprt    = exp2f(-RLOG2E*T[opt]);
        h_CallResult[opt]     = exprt * val * INV_RAND_N;
        const float  stdDev   = sqrtf((F_RAND_N * val2 - val * val) * STDDEV_DENOM);
        h_CallConfidence[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM);
    }
    vslDeleteStream(&Randomstream);
}


Relieving the bandwidth pressure makes Monte Carlo a truly compute-bound workload. It shaves almost a second from the total runtime and improves the performance by 1.24X.


~ $ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 1998848 options with path length of 262144.
Completed in   4.0256 seconds.
Computation rate - 496530.481 options per second.


The memory access pattern optimization works on Intel Xeon processors and Intel Xeon Phi coprocessors. Our calculation for usable L2 capacity per thread happens to be the same on both flavors of IA. Therefore, we can expect to compile the Intel MIC optimized program back to run on Intel Multicore processors. As you can see, it compiles and runs 1.11X better than before.


[sli@localhost .solution]$ make -B
icpc -c -O3 -openmp -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2  -o Driver.o Driver.cpp
Driver.cpp(130): (col. 5) remark: loop was not vectorized: existence of vector dependence.
Driver.cpp(161): (col. 9) remark: LOOP WAS VECTORIZED.
icpc -c -O3 -openmp -xAVX -fimf-precision=low -fimf-domain-exclusion=31 -fimf-accuracy-bits=11 -no-prec-div -no-prec-sqrt -fno-alias -vec-report2  -o MonteCarloStep5.o MonteCarloStep5.cpp
MonteCarloStep5.cpp(60): (col. 5) remark: loop was not vectorized: existence of vector dependence.
MonteCarloStep5.cpp(53): (col. 5) remark: LOOP WAS VECTORIZED.
MonteCarloStep5.cpp(75): (col. 12) remark: SIMD LOOP WAS VECTORIZED.
MonteCarloStep5.cpp(64): (col. 5) remark: loop was not vectorized: not inner loop.
MonteCarloStep5.cpp(88): (col. 5) remark: LOOP WAS VECTORIZED.
icpc -xAVX -openmp Driver.o MonteCarloStep5.o -o MonteCarlo -L /opt/intel/composerxe/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
[sli@localhost .solution]$ ./MonteCarlo
Monte Carlo European Option Pricing in Single Precision
Pricing 1998848 options with path length of 262144.
Completed in  39.6060 seconds.
Computation rate - 50468.304 options per second.


4. Conclusions

In this paper, we provide a brief introduction to Monte Carlo Method and Monte Carlo in a derivative pricing application. We also looked at the latest release of the Intel Xeon Phi coprocessor and the integrated software development environment of the Intel C++ Composer XE 2013. We derived our own C/C++ implementation of Monte Carlo European call option pricing, which we use as a case study and demonstrated a stepwise optimization framework.

Following the five steps of the performance optimization framework using Intel® C++ Composer XE 2013, the Monte Carlo European Option Pricing achieved more than 1471X performance improvement on Intel Xeon Phi coprocessors. We also demonstrated that you can recompile and rebuild an Intel Xeon processor-optimized application to run natively on Intel Xeon Phi coprocessors. You can further optimize the application on Intel Xeon Phi coprocessor and address the scalability issues. The resulting code can still recompile back to run on Intel Xeon processors at even higher performance.

The stepwise optimization framework has proved to be effective not only in financial numerical applications, but also in general scientific computations. Intel C++ Composer XE 2013 will make it even easier for scientific programmers to approach performance optimization as a structured activity and allow them to quickly understand the limitation of the program and achieve every kind of parallelism available.  

Additional Resources

  • Intel® Composer XE 2013 for Linux* including Intel® MIC Architecture**
  • Intel® C++ Compiler XE 12.0 User and Reference Guides**
  • Intel® MIC Architecture Optimization Guide***
  • C++ Technical Report 1

** This documentation is all installed with Intel® Composer.

*** This documentation is available on the MIC developers’ portal: https://mic-dev.intel.com.

About the Author

Shuo Li works in the Software & Service Group at Intel Corporation. He has 24 years of experience in software development. His main interests are parallel programming, computational finance, and application performance optimization. In his recent role as a software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers and helps them achieve the best possible performance on Intel platforms. Shuo holds a Master's degree in Computer Science from University of Oregon and an MBA degree from Duke University.

Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione