By Artem Gruzdev, Published: 10/27/2015, Last Updated: 10/27/2015

The Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of multithreaded programs. The suite focuses on emerging workloads and was designed to be representative of next-generation shared-memory programs for chip-multiprocessors.

The benchmark suite with all its applications and input sets is available as open source free of charge. Some of the benchmark programs have their own licensing terms, which might limit their use in some cases.

The Black-Scholes benchmark is a one of the 13 benchmarks in the PARSEC. This benchmark does option pricing with Black-Scholes Partial Differential Equation (PDE). The Black-Scholes equation is a differential equation that describes how, under a certain set of assumptions, the value of an option changes as the price of the underlying asset changes.

The formula for a put option is similar. The cumulative normal distribution function, CND(x), gives the probability that a normally distributed random variable will have a value less than x. There is no closed form expression for this function, and as such it must be evaluated numerically. Alternatively, the values of this function may be pre-computed and hard-coded in the table; in this case, they can be obtained at runtime using table lookup.

Based on this formula, one can compute the option price analytically based on the five input parameters. Using this analytical approach to price option, the limiting factor lies with the amount of floating-point calculation a processor can perform.

If we look at the results of Intel® VTune™ Amplifier XE profiling, we see two major hotspots.

- Read from input file and write to output file.
- Black-Scholes calculations.

Let’s look at each one in more detail.

The problem is that the input file contains 10 million rows of data. Every row is an element of struct **OptionData** which contains nine parameters. That’s why before we start the calculations we have to spend a lot of time reading and parsing the data from the input file to an array of **OptionData** struct. This array is called **data**.

Obviously the same problem occurs after all the calculations, and you have to write the results in an output file.

These problems can be solved by using parallel read and write by pointers. For more details you can read about it here: Optimization of Data Read/Write in a Parallel Application | Intel® Developer Zone.

We will consider this case in more detail. All calculations contain two functions: **CNDF** and **BlkSchlsEqEuroNoDiv**.

This function realizes the cumulative distribution function of the standard normal distribution. For more details, refer to any book about the theory of probability.

In our case, this function is as follows:

```
#define inv_sqrt_2xPI 0.39894228040143270286
fptype CNDF ( fptype InputX )
{
int sign;
fptype OutputX;
fptype xInput;
fptype xNPrimeofX;
fptype expValues;
fptype xK2;
fptype xK2_2, xK2_3;
fptype xK2_4, xK2_5;
fptype xLocal, xLocal_1;
fptype xLocal_2, xLocal_3;
// Check for negative value of InputX
if (InputX < 0.0) {
InputX = -InputX;
sign = 1;
} else
sign = 0;
xInput = InputX;
// Compute NPrimeX term common to both four & six decimal accuracy calcs
expValues = exp(-0.5f * InputX * InputX);
xNPrimeofX = expValues;
xNPrimeofX = xNPrimeofX * inv_sqrt_2xPI;
xK2 = 0.2316419 * xInput;
xK2 = 1.0 + xK2;
xK2 = 1.0 / xK2;
xK2_2 = xK2 * xK2;
xK2_3 = xK2_2 * xK2;
xK2_4 = xK2_3 * xK2;
xK2_5 = xK2_4 * xK2;
xLocal_1 = xK2 * 0.319381530;
xLocal_2 = xK2_2 * (-0.356563782);
xLocal_3 = xK2_3 * 1.781477937;
xLocal_2 = xLocal_2 + xLocal_3;
xLocal_3 = xK2_4 * (-1.821255978);
xLocal_2 = xLocal_2 + xLocal_3;
xLocal_3 = xK2_5 * 1.330274429;
xLocal_2 = xLocal_2 + xLocal_3;
xLocal_1 = xLocal_2 + xLocal_1;
xLocal = xLocal_1 * xNPrimeofX;
xLocal = 1.0 - xLocal;
OutputX = xLocal;
if (sign) {
OutputX = 1.0 - OutputX;
}
return OutputX;
}
```

This is known as the Black-Scholes model, which gives a theoretical estimate of the price of European-style options. All we need to know for our purposes is a Black-Scholes formula, which will be used for calculations. The formula is as follows:

`C(S,t)=N(d`

- The value of a call option for a non-dividend-paying underlying stock._{1})S-N(d_{2})Ke^{-r(T-t)}

`P(S,t)=Ke`

- The price of a corresponding put option.^{-r(T-t)}N(-d_{2})-N(d_{1})S

`d`

_{1}= 1/(σ√(T-t))(ln(S/K)+(r+σ^{2}/2)(T-t))

`d`

_{2}= 1/(σ√(T-t))(ln(S/K)+(r-σ^{2}/2)(T-t))

For both, as above:

- N(∙) is the
**CNDF**function. - T - t is the time to maturity.
- S is the spot price of the underlying asset.
- K is the strike price.
- r is the risk.
- σ is the volatility of returns of the underlying asset.

In our case, this function is as follows:

```
fptype BlkSchlsEqEuroNoDiv( fptype sptprice,
fptype strike, fptype rate, fptype volatility,
fptype time, int otype, float timet )
{
fptype OptionPrice;
// local private working variables for the calculation
fptype xStockPrice;
fptype xStrikePrice;
fptype xRiskFreeRate;
fptype xVolatility;
fptype xTime;
fptype xSqrtTime;
fptype logValues;
fptype xLogTerm;
fptype xD1;
fptype xD2;
fptype xPowerTerm;
fptype xDen;
fptype d1;
fptype d2;
fptype FutureValueX;
fptype NofXd1;
fptype NofXd2;
fptype NegNofXd1;
fptype NegNofXd2;
xStockPrice = sptprice;
xStrikePrice = strike;
xRiskFreeRate = rate;
xVolatility = volatility;
xTime = time;
xSqrtTime = sqrt(xTime);
logValues = log( sptprice / strike );
xLogTerm = logValues;
xPowerTerm = xVolatility * xVolatility;
xPowerTerm = xPowerTerm * 0.5;
xD1 = xRiskFreeRate + xPowerTerm;
xD1 = xD1 * xTime;
xD1 = xD1 + xLogTerm;
xDen = xVolatility * xSqrtTime;
xD1 = xD1 / xDen;
xD2 = xD1 - xDen;
d1 = xD1;
d2 = xD2;
NofXd1 = CNDF( d1 );
NofXd2 = CNDF( d2 );
FutureValueX = strike * ( exp( -(rate)*(time) ) );
if (otype == 0) {
OptionPrice = (sptprice * NofXd1) - (FutureValueX * NofXd2);
} else {
NegNofXd1 = (1.0 - NofXd1);
NegNofXd2 = (1.0 - NofXd2);
OptionPrice = (FutureValueX * NegNofXd2) - (sptprice * NegNofXd1);
}
return OptionPrice;
}
```

The only function that is in main() is the **bs_thread(), t**his function is as follows:

```
#ifdef WIN32
DWORD WINAPI bs_thread(LPVOID tid_ptr){
#else
int bs_thread(void *tid_ptr) {
#endif
int i, j;
fptype price;
fptype priceDelta;
int tid = *(int *)tid_ptr;
int start = tid * (numOptions / nThreads);
int end = start + (numOptions / nThreads);
for (j=0; j<NUM_RUNS; j++) {
#ifdef ENABLE_OPENMP
#pragma omp parallel for private(i, price, priceDelta)
for (i=0; i<numOptions; i++) {
#else //ENABLE_OPENMP
for (i=start; i<end; i++) {
#endif //ENABLE_OPENMP
/* Calling main function to calculate option value based on
* Black & Scholes's equation.
*/
price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],
rate[i], volatility[i], otime[i],
otype[i], 0);
prices[i] = price;
#ifdef ERR_CHK
priceDelta = data[i].DGrefval - price;
if( fabs(priceDelta) >= 1e-4 ){
printf("Error on %d. Computed=%.5f, Ref=%.5f, Delta=%.5f\n",
i, price, data[i].DGrefval, priceDelta);
numError ++;
}
#endif
}
}
return 0;
}
#endif //ENABLE_TBB
```

Let’s take a loop from 0 to numOptions and place it to **BlkSchlsEqEuroNoDiv** and add one more parameter to this function. The **prices** array will be this parameter. Of course our function **BlkSchlsEqEuroNoDiv** will take *sptprice*, *strike*, *rate*, *volatility*, *time*, *otype* arrays entirely. We also need to add some OpenMP* pragmas to parallelize this loop:

```
#pragma omp parallel for private(i)
#pragma simd
#pragma vector aligned
```

Thus we will get a performance gain by vectorization and parallelization.

If we read more about the **CDF** function of the standard normal distribution we will see that this function can be expressed by using the **error function**:

`N(x)= 1/2(1+erf(x/√2))`

Let’s create this implementation of the **CDF** function and try to use it:

```
# define ERF(x) erff(x)
# define INVSQRT(x) 1.0f/sqrtf(x)
# define HALF 0.5f
fptype cdfnorm (fptype input)
{
fptype buf;
fptype output;
buf = input*INVSQRT(2);
output = HALF + HALF*ERF(buf);
return output;
}
```

Before we use it in a calculation, I want to mention that I used the **erff(x)** function, which is contained in the Intel® Math Kernel Library. I used **#include <mathimf.h>** at the top of source file.

Now we are ready to conduct the experiments!

The system I used to test my modification was the dual Intel® Xeon® E5-2697 v3 server. The Intel Xeon E5-2697 v3 processor contains 14 cores with 2.6 GHz core frequency. Also it has Intel® Hyper-Threading Technology on. Thus we have 28 threads and 35 MB of L3 cache per package. In sum we have 56 available threads.

I used the Intel® C++ Compilers 2013.

The compiler keys that I used are as follows:

```
export CFLAGS="-O2 -xCORE-AVX2 -funroll-loops -opt-prefetch -g -traceback"
```

These keys may be changed in the configure file «icc.bldconf».

I launched the blackscholes benchmark, and the results were as follows:

Thus, we made some performance gains due to the modifications of the computational functions.

With the I/O parallelization using the method specified earlier in this article, we can achieve a big performance gain.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804