Разместил shuo-li (Intel)

## 1. Introduction

In its relentless effort to lead innovation and deliver greater compute capacity and lower power consumption to satisfy the growing demands across the industry segments and the evolving usage models, Intel introduced a new set of instructions called the Intel® Advanced Vector Extensions (Intel® AVX) implemented on a wide range of Intel platforms.

Based on top of a rich legacy of Intel SSE, Intel AVX introduces another instruction set architecture extension to enable a flexible programming environment, succinct coding with vector and scalar datasets, and more power-efficient performance across data processing. Intel AVX provides the infrastructure and building blocks for the growing needs of applications such as financial analysis, and other High Performance Computing (HPC) segments.

This paper provides a brief introduction to the Intel Instruction Set Architecture (Intel ISA) and gives an overview of the new instructions and capabilities in the context of how they can be used to accelerate computation using one of the most popular models in the financial service industry - the Black-Scholes valuation.

**1.1 ISA and ISA Extensions, What are they and why you should care**

ISA is the part of a microprocessor architecture related to programming. It includes native data types, instructions, registers, addressing modes, memory architecture, interrupt and I/O, etc. The ISA defines a specification of microprocessor functionality implemented by a family of microprocessors. ISA usually evolves over time with new and enhanced functionality to deliver better performance and expose new features while at the same time maintaining a compatible programming model.

Developers benefit because the ISA extensions bring higher application performance, lower power consumption and software backward compatibility. For example, floating-point (FP) arithmetic functions collectively known as 8087 were defined and added as an extension to 8086 microprocessors. The same application software which used to call floating point library routines, can execute 8087 instructions. While earlier extensions such as x87 focused on adding instructions with single data, more recent extensions added instructions that execute a single instruction with multiple data, hence with Single Instruction Multiple Data or SIMD instructions.

Beginning with the Intel® Pentium® II and Intel® Pentium® with Intel® MMX technology processor families, six extensions have been introduced into the Intel 64 and IA-32 architectures to perform SIMD operations. These extensions include the MMX technology, Streaming SIMD extension, or SSE, SSE2, SSE3, Supplemental SSE3, and SSE4. Each of these extensions provides a group of instructions that perform SIMD operations on packed integer and/or packed floating-point data elements.

**1.2 Intel Advanced Vector Extensions **

Intel AVX is a new 256-bit SIMD FP vector extension of Intel Architecture. It accelerates the trend towards using SIMD technology for floating point computations in general-purpose scientific calculations such as financial derivative pricing. Intel AVX is a comprehensive ISA extension of the Intel 64 Architecture. The main elements of Intel AVX are:

- Support for wider vector data (up to 256-bit).
- Efficient instruction encoding scheme with 3 and 4 operand instruction syntax.
- Flexibility in the programming environment, ranging from branch handling to relaxed memory alignment requirements.
- New data manipulation and arithmetic compute primitives, including broadcast, permute, fused-multiply-add, etc.

While any application making heavy use of floating-point or integer SIMD can use Intel AVX, the applications that show the best benefit are those that are floating point compute intensive and can be vectorized. The traditional examples of such applications are audio/video processing and codecs. In this paper, we demonstrate that financial services analysis and modeling is another type of application that can also benefit from Intel AVX extensions.

## 2. Implementation of Black-Scholes Formula on Intel Processors with AVX

**2.1 Black-Scholes and the Black-Scholes Formula**

The Black Scholes Model is one of the most important concepts in modern quantitative finance theory. It was developed in 1973 by Fisher Black, Robert Merton and Myron Scholes and is still widely used today, and regarded as one of the best ways of determining fair prices of financial derivatives.

Robert Merton was the first one to publish a closed-form solution to the Black-Scholes Equation and for European call options *c*, and European put options *p*, obtained a solution known as the Black-Scholes-Merton Formula.

where,

The function cnd(x) is the cumulative normal distribution function. It calculates the probability that a variable with a standard normal distribution of ? (0, 1) be less than x. cnd(x) is approximated using a polynomial function defined as:

with

**2.2 Implementation of Black-Scholes Formula**

The Black-Scholes Formula is used widely in almost every aspect of quantitative finance. The Black-Scholes calculation has essentially permeated in every quantitative finance library by traders and quantitative analysts alike. The Black-Scholes calculation has become the hallmark of any computer architecture for the global FSI segment. In this paper, we look at how Black-Scholes calculations are performing on Intel microprocessors with AVX technology.

Let’s look at a hypothetic situation in which a firm has to calculate European options for millions of financial instruments. For each instrument, it has current price, strike price and option expiration time. For each set of these data, it makes several thousand Black-Scholes calculations, much like the way options of neighboring stock prices, strike prices and different option expiration times were calculated.

Parallel version runs at 0.00589 Billion option per second:

Completed pricing 1152 million options in 234.880353 seconds:

**Figure 1: Code and output of Black-Scholes when compiled with gcc -O2**

## 3. Optimization of Black-Scholes Implementation

Straightforward implementation of the Black-Scholes formula does not guarantee high performance. Using GNU Compiler Collection version 4.4.6 with the `-O2`

optimization switch on an Intel® Xeon® EP 3.1 Ghz system took 234.88 seconds to price 1.152 million options. The throughput rate is merely 5.89 million options per second.
In this section, we are going to improve the performance of our Black-Scholes implementation to achieve the lowest possible elapsed time and highest possible throughput.

**3.1 Stepwise Optimization Framework**

In this section, we highlight an optimization framework that takes a systematic approach to application performance improvement. This framework takes an application into 4 optimization stages. Each stage attempts to improve the application performance using one orthogonal direction by applying a single technique. Following this methodology, an application can achieve the highest performance possible on Intel Microprocessor Architecture.

**Figure 2: Optimization framework**

**3.2 Stage 1: Leverage Optimized Tools and Library**

In the initial stage, before you even start any optimized effort, ask yourself: Are you trying to reinvent the wheel? If the problem has been solved by someone else, the best strategy is to leverage the existing work and spend your effort on the problem that is yet to be solved.

In the Black-Scholes implementation, we should ask if we have access to a better compiler and C/C++ runtime library. This question leads us to Intel® Parallel Composer XE 2011 SP1 update2. While GCC may target general purpose application development, Intel Parallel Composer XE targets high performance application development. If your target execution environment is based on a more recent release of an Intel microprocessor product and you allow the compiler to perform mathematical transformation based on the law of association, distribution etc., you definitely should use Intel Compilation tools. Replacing g++ with icpc using the same compiler switches, the program ran in 46 seconds, a quick, easy 5X improvement without any code changes:

Parallel version runs at 0.02986 Billion option per second:

Completed pricing 1152 million options in 46.295983 seconds:

Intel Parallel Composer XE 2011 also includes Intel’s enhancement to C/C++ runtime library. One of the more popular transcendental functions, error function `erf()`

, is included in Parallel Composer XE’s runtime library libm, but not in GCC’s. Since we have decided to use Intel Compilation tools, we can further explore the inherent connection between Intel libm’s error function and cumulative normal distribution function in the Black-Scholes Formula.

const float INV_SQRT2 = 1.0/sqrt(2.0); const float HALF = 0.5; float CND(float d) { return HALF + HALF*erf( INV_SQRT2*d); }

We can achieve another 13% improvement when we make this connection.

Parallel version runs at 0.03382 Billion option per second:

Completed pricing 1152 million options in 40.875621 seconds:

In summary, maximize use of existing higher performance solutions so that you can increase the amount of time you have to focus on your own application. With a simple compiler change and use of transcendental functions, it is possible to achieve significant speedup, as we have been able to show by increasing performance by 5.75x for the Black-Scholes formula.

**3.3 Stage 2: Scalar and Serial Optimization **

Once you have exhausted the available optimization solutions, in order to extract greater performance from your application, you will need to obtain the application’s source code and begin the optimization process. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. Normally, you look at the data and algorithm related issues such as:

- Choosing the right floating point precision
- Choosing the right approximation method accuracy: polynomial vs. rational
- Avoiding jump algorithms
- Reducing the loop operation strength by using iteration calculations
- Avoiding or minimizing conditional branches in your algorithm
- Avoiding repetitive calculations, using previously calculated results.

You also have to deal with language-related performance issues. Since we have chosen C/C++, the C/C++ related issues are:

- Use explicit typing for all constants to avoid auto-promotion
- Choose the right types of C runtime function
`exp()`

vs.`expf()`

;`abs()`

vs.`fabs()`

- Explicitly tell compiler about point aliases
- Explicitly Inline function calls to avoid overhead

Like hundreds of routines in Numerical Recipes in C, the Black-Scholes Formula was written in float. The dynamic range of a 32-bit floating point is good enough for most quantitative finance applications. Input data, output data and arithmetic are all in 32 bits. Any accidental promotion from 32 to 64 bit would result in performance loss and zero accuracy gain. All the constants and library function calls should be explicitly typed to float.

Mathematicians like perfect symmetry. Sometimes their formula reflects this kind of preference. However, this tendency can show up as a performance penalty when their formulae are implemented in verbose. Call and Put options calculations are two such cases.

Merton’s formula is very symmetric; however, once you have spent cycles on the call option, you don’t have to spend the same amount of cycles on the put option; even Merton Formula might have suggested so. The reason is that call and put options satisfy *put-call* parity.

*put-call* parity can also be derived mathematically from the fact that

In summary, once you have obtained the call option, the put option is just two additions away. Here is the modified code followed by command lines:

Parallel version runs at 0.03909 Billion option per second:

Completed pricing 1152 million options in 35.364703 seconds:

**Figure 3: Modified code to leverage the put-call parity**

Putting all things together, compiling with `-O2 -xAVX`

to request AVX extensions and without introducing either vectorization (using `-no-vec`

if necessary) or parallelization, we achieved 6.64X performance over GCC on the same hardware.

**3.4 Stage 3: Vectorization **

Optimized scalar code paved a solid foundation for vectorization. In this section, we introduce vectorization to the Black-Scholes source code. Vectorization may mean different things to different people; in this context, it means we are taking advantage of SIMD registers and SIMD instructions at the processor level. There are many ways you can introduce vectorization to your program, ranging from using processor intrinsic functions to using Intel® Cilk™ Plus Array Notation. These compiler-based vectorization techniques vary in terms of the amount of control the programmer has on generated code, the expressiveness of the syntax, and the amount of changes required to the serial program.

**Figure 4: Vectorization techniques**

Before we expect the compiler to vectorize serial code and generate SIMD instructions, the programmer has to ensure proper memory alignment. Misaligned memory access can, in serious cases, generate processor faults and in benign cases, cause cache line splits and redundant object code - all impacting performance. One way to ensure memory alignment is to always request and work with strictly aligned memory. Using Intel Parallel Composer XE 2011, you can request statically allocated memory by prefixing the memory definition with `__attribute__(align(32))`

. A 32-byte boundary is the minimum alignment requirement for memory designated for YMMX registers. You can also use `_mm_malloc`

and `_mm_free`

to request and release dynamically allocated memory.

CallResult = (float *)_mm_malloc(mem_size, 32); PutResult = (float *)_mm_malloc(mem_size, 32); StockPrice = (float *)_mm_malloc(mem_size, 32); OptionStrike = (float *)_mm_malloc(mem_size, 32); OptionYears = (float *)_mm_malloc(mem_size, 32); ... _mm_free(CallResult); _mm_free(PutResult); _mm_free(StockPrice); _mm_free(OptionStrike); _mm_free(OptionYears);

**Figure 5: Examples of calls requesting and releasing memory**

After taking care of memory alignment, we are ready to choose a vectorization method for our Black-Scholes implementation. To minimize the amount of work and maintain portability, we take the user-initiated semiautomatic vectorization approach to vectorization: the user informs the compiler that a loop should be vectorized by using `#pragma SIMD`

. The compiler does everything it can to vectorize the loop and generates an error message if it cannot vectorize the loop marked `#pragma SIMD`

. This behavior is different from the previous model in which the user simply provides suggestions in `#pragma IVDEP`

and the compiler evaluates if the vectorized code can be executed faster than serial code. Vectorized code will be generated when the compiler thinks it can outperform the serial version. In this model, it’s the programmer’s responsibility to ensure that vectorization overhead does not exceed any speedup gain.

for(int i = 0; i < NUM_ITERATIONS; i++) #pragma simd for(int opt = 0; opt < OPT_N; opt++) { float CallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; float sqrtT = sqrtf(T); float d1 = (logf(S / X) + (RISKFREE + 0.5f * VOLATILITY * VOLATILITY) * T) / (VOLATILITY * sqrtT); float d2 = d1 - VOLATILITY * sqrtT; float CNDD1 = HALF + HALF*erff(INV_SQRT2*d1); float CNDD2 = HALF + HALF*erff(INV_SQRT2*d2); float expRT = expf(-RISKFREE * T); CallVal += S * CNDD1 - X * expRT * CNDD2; PutVal += CallVal + expRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ; }

Parallel version runs at 0.27955 Billion option per second:

Completed pricing 1152 million options in 4.945128 seconds:

**Figure 6: Example of user-initiated vectorization**

Vectorized code takes advantage of SIMD instructions in modern microprocessors and delivers application performance without higher frequency or higher core count. In our case, vectorization delivered a 7.15X performance increase out of an 8X maximum.

**3.5 Stage 4: Parallelization**

The number of CPU cores in a microprocessor package has increased over the past 10 years. So far our vectorized Black-Scholes Formula makes full utilization of one processor core. To make use of additional cores, we need to introduce multithreading to our code. We want to make all the cores work concurrently with each completing the same amount of work as the core our vectorized Black-Scholes Formula is running on.

Like vectorization, multithreading also presents various threading choices for the programmer. Each choice varies in terms of explicit programmer control, composability and code maintainability.

**Figure 7: Multithreading choices**

Our Black-Scholes Formula has two loops. The inner loop is vectorized already. The easiest way to achieve multithreading is to have each thread run the entire outer loop on different data. OpenMP* is a good choice for this exercise.

float RLOG2E =-RISKFREE*M_LOG2E; kmp_set_defaults("KMP_AFFINITY=scatter,granularity=thread"); #pragma omp parallel for(int i = 0; i < NUM_ITERATIONS; i++) #pragma omp for #pragma simd #pragma vector aligned for(int opt = 0; opt < OPT_N; opt++) { float CallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; float sqrtT = sqrtf(T); float d1 = (logf(S / X) + (RISKFREE + 0.5f * VOLATILITY * VOLATILITY) * T) / (VOLATILITY * sqrtT); float d2 = d1 - VOLATILITY * sqrtT; float CNDD1 = HALF + HALF*erff(INV_SQRT2*d1); float CNDD2 = HALF + HALF*erff(INV_SQRT2*d2); float expRT = exp2f(RLOG2E * T); CallVal += S * CNDD1 - X * expRT * CNDD2; PutVal += CallVal + expRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ; }

Parallel version runs at 4.07992 Billion option per second:

Completed pricing 1152 million options in 0.338830 seconds:

**Figure 8: Threaded code**

In this example, `#pragma omp parallel`

creates or forks the threads. Each thread runs its own outer loop on its own subset of data. #pragma omp for binds the threads to the inner loop.

Running this program gives 14.21X performance improvement over the vectorized scalar implementation. Given the fact that there are 16 cores on the system and OpenMP thread creation can create some overhead, the performance improvement is impressive.

Notice that this vectorized, parallel implementation of the Black-Scholes Formula runs on systems with any number of threads. Each thread simply runs NUM_ITERATION/NumThreads of iteration. This per-thread iteration may not be a multiplier’s width. So the compiler will generate two versions of the loop body, one for the vectorized loop and one for data less than SIMD length. As a result, the object code size for the vectorized parallel program could be bigger than the vectorized serial code.

## 4. Conclusions

In this paper, we provide a brief introduction to the latest Intel ISA extension and its application to the numerically intensive financial application and scientific computing world. We use the Black-Scholes formula as a showcase, and demonstrate a stepwise application optimization framework. Following this optimization framework, the Black-Scholes calculation achieved a mind-boggling 637 times performance improvement using the combination of Intel Parallel Composer XE 2013, Intel Advanced Vector Extensions and OpenMP. At the center of this optimization effort is a stepwise optimization framework which proved to be effective not only in financial numerical applications, but also to general scientific computations as well. The next version of the Intel Parallel Composer XE 2013 should make it even easier for scientific programmers to approach performance optimization as a structured activity and quickly understand the limitations of their programs and achieve as much parallelism as possible.

## Additional Resources

- Intel® Composer XE 2011 for Linux* including Intel MIC Architecture
- Intel® C++ Compiler XE 12.0 User and Reference Guides

## About the Author

Shuo Li works in the Software & Service Group at Intel Corporation. He has 24 years of experience in software development. His main interest is parallel programming, computational finance and application performance optimization. In his recent role as a software performance engineer covering the financial service industry, Shuo works closely with software developers and modelers to help them achieve the best possible performance on Intel platforms. Shuo holds a Master's degree in Computer Science from University of Oregon and an MBA degree from Duke University.

**Notices**

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, Cilk, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the US and/or other countries.

Copyright © 2012 Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others.