## 1. Introduction

At the International Supercomputing Conference (ISC'12) in Hamburg, Germany, Intel announced that Intel® Xeon Phi™ is the new brand name for all future Intel® Many Integrated Core Architecture (Intel® MIC Architecture) based products. Targeted at supercomputers, Intel® Xeon Phi™ Coprocessor will have up to 61 cores and deliver power consumption breakthroughs while scaling performance when conducting complex scientific calculations.

This paper provides a brief introduction to the Intel® Xeon Phi™ coprocessor, and gives an overview of the processor micro architecture in the context of how it can be used to accelerate the computation using one of the most popular models in financial service industry - Black-Scholes valuation.

1.1 Intel® Xeon Phi™ Coprocessor extends the execution resources of Intel® Xeon® Processor Family

Intel® Many Integrated Core Architecture (Intel® MIC Architecture) provided an ideal execution vehicle for Black Scholes formula related quantitative financial applications. Intel MIC Architecture is built upon the same IA-32/Intel-64 Instruction Set Architecture (ISA) and maintains the same programming model as the Intel® Xeon™ processors. Intel MIC Architecture further extends the parallel execution infrastructure and allows the highly parallel application to reach a level of performance far exceeding anything offered by general purpose graphics computing or GPGPU device with little or no modification of source code. End user benefits from acceleration of software development and deployment as well as the acceleration of software performance.

1.2 Processor Architecture

At the microprocessor architecture level, it is a SMP processor comprised of up to 61 cores organized around two unidirectional rings. There are 8 on-die memory controllers supporting 16 GDDR5 channels and expected to deliver up to 5.5 GT/sec. At the core level, each coprocessor core is a fully functional, in-order core in its own right, capable of running Intel® architecture instructions independently of the other cores. Each core is also given hardware multi-threading support and supports four hardware contexts or threads. At any given clock cycle each core can issue up to two instructions from any single context.

1.3 Core and Vector Processor Unit

The coprocessor core implements full sixteen general purpose 64-bit registers as well as most of the new instructions associated with 64-bit extension. However, the only vector instruction set supported is the Intel® Initial Many Core Instructions (Intel® IMCI). There is no support for Intel® MMX™, Intel® Streaming SIMD Extensions (Intel® SSE) or Intel® Advanced Vector Extensions (Intel® AVX) in the cores, although the math co-processor (x87) remains integrated and fully functional.

Designed from the scratch, is an all-new Vector Processing Unit or VPU. It is a 512-bit SIMD engine, capable of executing 16-wide single precision, or 8-wide double precision floating point SIMD operations with full support of all four rounding modes as articulated by IEEE* 754R. Inside the Vector Processing Unit, a new extended math unit provides the fast implementation of the single precision transcendental functions: reciprocal, reciprocal square root, base 2 logarithm, and base 2 exponential functions using lookup tables. These hardware implemented function can achieve high throughput of 1-cycle or 2-cycle other transcendental functions can be derived from elementary functions.

1.4 Cache Architecture

Intel® Xeon Phi™ coprocessor’s Level One (L1) cache was designed to accommodate higher working set requirements for four hardware contexts per core, it has a 32 KB L1 instruction cache and 32 KB L1 data cache. Associativity was increased to 8-way, with a 64 byte cache line. Bank width is 8 bytes. Data return can now be out-of-order. The access time has 3-cycle latency.

Completed redesigned, 512 KB unified Level Two (L2) cache comprises 64 bytes per way with 8-way associativity, 1024 sets, 2 banks, 32GB (35 bits) of cacheable address range. The expected idle access time is approximately 80 cycles.

The L2 cache has a streaming hardware prefetcher that can selectively prefetch code, read, and RFO (Read-For-Ownership) cache lines into the L2 cache. It supports 16 streams that can bring in up to a 4-KB page of data. Once a stream direction is detected, the prefetcher can issue up to 4 multiple prefetch requests. The L2 in Intel® Xeon Phi™ coprocessor now has ECC support. The replacement algorithm for both the L1 and L2 caches is based on a pseudo-LRU implementation.

Developers benefit because the ISA extensions bring higher application performance, lower power consumption and also the software backward compatibility. For example, floating-point (FP) arithmetic functions collectively known as 8087 were defined and added as an extension to 8086 microprocessors. The same application software which used to call floating point library routines, can execute 8087 instructions. While earlier extensions such as x87, focus on adding instructions with single data, more recent extension start adding instructions that execute multiple data or Single Instruction, Multiple Data or SIMD instructions.

Beginning with the second generation Intel® Pentium® processor family, Intel® Pentium® processor family with Intel® MMX™ technology, six extensions have been introduced into the IA-32 and Intel-64 architectures to perform SIMD operations. These extensions include the MMX technology and Intel Streaming SIMD extension, which includes Intel SSE, Intel SSE2, Intel SSE3, Supplemental Intel SSE3, and Intel SSE4. Each of these extensions provides a group of instructions that perform SIMD operations on packed integer and/or packed floating-point data elements.

## 2. Implementation of Black-Scholes Formula on Intel® Xeon Phi™ Coprocessor

**2.1 Black-Scholes and Black-Scholes Formula**

In the financial world, a derivative is a financial instrument, whose value depends on the value of other, more basic, underlying variables. Very often the variables underlying derivatives are the prices of traded assets. A stock option, for example, is a derivative whose value is dependent on the price of a stock. Not all variables derivatives dependent on are traded assets. Some of these variables can be snowfall at a certain resort, the average weather temperatures in a specific time intervals, etc.

An option is a derivative that specifies a contract between two parties for a future transaction, known as an exercise, on an asset at a reference price. The buyer of the option gains the right, but not the obligation, to engage in that transaction, while the seller incurs the corresponding obligation to fulfill the transaction. There are two types of option. A call option gives the holder the right to buy the underlying asset by a certain date for a certain price. A put option gives the holder the right to sell the underlying asset by a certain date for a certain price. The price in the contract is known as the strike price. The date in the contract is known as the expiration date. European options can be exercised only on the expiration date. American options can be exercised at any time up to the expiration date.

The Black Scholes Model is one of the most important concepts in modern quantitative finance theory. It was developed in 1973 by Fisher Black, Robert Merton and Myron Scholes and is still widely used today, and regarded as one of the best ways of determining fair prices of financial derivatives.

Robert Merton was the first one to publish a closed-form solution to the Black-Scholes Equation and for European call options c, and European put options *c*, obtained a solution known as Black-Scholes-Merton Formula.

where,

The function cnd(x) is the cumulative normal distribution function. It calculates the probability that a variable with a standard normal distribution, of (0, 1) be less than x. cnd(x) is approximated using a polynomial function defined as:

with

**2.2 Implementation of Black-Scholes Formula**

Black-Scholes Formula is used widely in almost every aspect of quantitative finance. Black-Scholes calculation has essentially permeated into every quantitative finance library by traders and quantitative analysts alike. Black-Scholes calculation has become the hallmark of any computer architecture for global financial service industry segment. In this paper, we look at how Black-Scholes calculations are performing on Intel® Xeon Phi™ coprocessor.

Let’s look at a hypothetic situation in which a firm has to calculate European options for millions of financial instruments. For each instrument, it has current price, strike price and option expiration time. For each set of these data, it makes several thousands of Black-Scholes calculations, much like the way options of neighboring stock prices, strike prices and different option expiration times were calculated.

[me@host BlackScholes]$ g++ -o BlackScholesStep0 -O2 BlackScholesStep0.cpp

[me@host BlackScholes]$ ./BlackScholesStep0

Black-Scholes valuation priced 1024 million options in 263.918349 seconds.

Program performs at the rate of 0.00466 Billion options per second.

## 3. Optimization of Black-Scholes Implementation

Straightforward implementation of Black-Scholes formula does not guarantee high performance. Using GNU* Compiler Collection version 4.4.6 and using -O2 optimization switch and even on a system with a 2.6 GHz Intel® Xeon™ processor E5-2670, it took 234.88 seconds to price 1.152 Million Options.^{2} The throughput rate is merely 5.89 million options per second.

In this section, we are going to improve the performance of our Black-Scholes implementation to achieve the lowest possible elapsed time and highest possible throughput.

**3.1 Stepwise Optimization Framework**

In this section, we highlight an optimization framework that takes a systematically approach to application performance improvement. This framework takes an application into 5 optimization stages. Each stage attempts to improve the application performance using one orthogonal direction by applying a single technique. Following this methodology, an application can achieve the highest performance possible on Intel® Xeon Phi™ coprocessor family product.

**3.2 Stage 1: Leverage Optimized Tools and Library**

reinvent a wheel? If the problem been solved by someone else, the best strategy is to leverage on the existing work and spend your effort on the problem that yet to be solve.

In Black-Scholes implementation, we really should ask if we have better compiler and C/C++ runtime library. This question leads us Intel® Composer XE 2013. While GCC* may target general purpose application development, Intel Composer XE 2013 targets the high performance application development. If your target execution environment is based on a more recently released Intel microprocessor product and you allow the compiler to perform the mathematical transformation based on the law of association, distribution etc, you definitely should use an Intel® compiler. Replacing g++ with icpc, the g++ equivalent from Intel® Parallel Studio same compilation switch, the program now runs 46 seconds, a quick easy 4X improvement without any code changes.

[sli@cthor-knc2 bs_distributed]$ icpc -o BlackScholes -O2 -inline-level=0 BlackScholesStep0.cpp

[sli@cthor-knc2 bs_distributed]$ ./BlackScholes

Black-Scholes valuation priced 1024 million options in 63.698990 seconds.

Program performs at the rate of 0.01929 Billion options per second.

Intel® Composer XE 2013 also includes Intel’s enhancement to C/C++ runtime library. One of popular transcendental functions error function erf() is part of the runtime library libm of Intel Composer, but it is absent from GCC’s library. Since we have decided to use Intel Composer XE 2013, we can further explore the inherent connection between libm’s error function, which Intel provides, and cumulative normal distribution function in Black-Scholes Formula.

const float HALF = 0.5;

float CND(float d)

{

return HALF + HALF*erf(M_SQRT1_2*d);

}

We can achieve another 17% improvement, when we take advantage of the inherent connection between these two functions.

[sli@cthor-knc2 bs_distributed]$ icpc -o BlackScholes -O2 -inline-level=0 BlackScholesStep1.cpp

[sli@cthor-knc2 bs_distributed]$ ./BlackScholes

Black-Scholes valuation priced 1024 million options in 54.376429 seconds.

Program performs at the rate of 0.02260 Billion options per second.^{-4}

In Summary, leverage existing higher performance solutions to the max so that you can focus on the problem in your own application. If we do it right, we can achieve 5.75X for Black-Scholes formula.^{5}

**3.3 Stage 2: Scalar and Serial Optimization **

Now that you exhausted the optimized solution available to you and your application still falls short of performance requirements, you have to get hold of the application source code and start the optimization process. Before you even plunge yourself into active parallel programming, you need to make sure your application delivers right result before you vectorize and before you parallelize it. Equally important, you need to make sure it does minimum operations to get that right result. Normally you look at the data and the algorithm related issues such as:

- Choose the right floating point precision
- Choose the right approximation method accuracy: polynomial vs. rational
- Avoid jump algorithm
- Reduce the loop operation strength by using iteration calculation
- Avoid or minimize conditional branches in your algorithm
- Avoid repetitive calculations, use the previous result.

You also have to deal with language related performance issues. Since we have chosen C/C++, here is a list of C/C++ related issues

- Use explicit typing for all constants to avoid auto-promotion
- Choose the right types of C runtime function exp() vs. expf(); abs() vs. fabs()
- Explicitly tell compiler about point aliases
- Explicitly Inline function calls to avoid overhead

Like hundreds of routines in Numerical Recipes in C, Black-Scholes Formula was written in float. The dynamic range of a 32-bit floating point is good enough for most of quantitative finance application. Input data, output data and arithmetic are all in 32 bits. Any accidental promotion from 32 to 64 bit would result in performance loss and zero accuracy gain. All the constants, library function calls should be explicitly typed to float.

Mathematicians like perfect symmetry. Sometimes their formula reflects this kind of preference. However this tendency can shows up as performance penalty when their formulae are implemented in verbose. Call and Put options calculation are two of such cases.

*c = Scnd(d _{1}) - Ke_{-rT}cnd(d^{2})*

p = Ke_{-rT}cnd(-d^{2})- Scnd(-d^{1})

Merton’s formula is very symmetric, however once you have spent the cycles on the call option, you don’t have to spend the same amount of cycles on put option even Merton Formula might have suggested so. The reason is that call and put option satisfy *put-call* parity.

p – c = Ke_{-rT}- S

*put-call* parity can also be derive mathematically from the fact that

cnd(-d^{2}) = 1 - cnd(d^{2} )

In summary, once you obtained the call option, put option is just two additions away.

Here is modified code followed by command lines

[sli@cthor-knc2 bs_distributed]$ icpc -o BlackScholes -O2 -inline-level=0 BlackScholesStep2.cpp

[sli@cthor-knc2 bs_distributed]$ ./BlackScholes

Black-Scholes valuation priced 1024 million options in 45.104644 seconds.

Program performs at the rate of 0.02724 Billion options per second.^{6}

Putting all things together, compiling with –O2 and without introducing either vectorization or parallelization, we achieved additional 20% of performance that’s 5.85 performance over GCC on the same hardware.^{7}

**3.4 Stage 3: Vectorization **

Optimized scalar code paved a solid foundation for Vectorization. In this section, we introduce vectorization to Black-Scholes source code. Vectorization may mean different things to different people. But in this context, we are taking advantage of SIMD registers and SIMD instruction at processor level. There are many ways you can introduce Vectorization to your program, ranging from using processor intrinsic functions to using Intel® Cilk™ Plus Array Notation. These compiler-based vectorization techniques vary in a terms of the amount of control the programmer has on generated code, and the expressiveness of the syntax, and the amount of changes required to the serial program.

Before we expect compiler to vectorize the serial code, and generate SIMD instruction, the programmer has to ensure proper memory alignment. Misaligned memory access can in serious cases generate, processor faults and in benign cases, cause cache line split and/or redundant object code, all impacting performance. One way to ensure memory alignment is to always request and work with strictly alignment memory. Using Intel Composer XE 2013, the user can request statically allocated memory by prefix the memory definition with `__attribute__(align(32))`

32-byte boundary is the minimum alignment requirement for memory designated for YMMx registers. You can also use `_mm_malloc and _mm_free`

to request and release dynamically allocated memory.

CallResult = (float *)_mm_malloc(mem_size, 64);

PutResult = (float *)_mm_malloc(mem_size, 64);

StockPrice = (float *)_mm_malloc(mem_size, 64);

OptionStrike = (float *)_mm_malloc(mem_size, 64);

OptionYears = (float *)_mm_malloc(mem_size, 64);

...

_mm_free(CallResult);

_mm_free(PutResult);

_mm_free(StockPrice);

_mm_free(OptionStrike);

_mm_free(OptionYears);

After taking care of memory alignment, we are ready to choose a vectorization method for our Black-Scholes implementation. To minimize the amount of work and keep the portability, we take the user initiated semiautomatic vectorization approach to SIMD parallelism. The user informs the compiler that a loop needs vectorization by using `#pragma#pragma`

SIMD. This behavior is different from the previous model in which user simply provide suggestions in `#pragma`

IVDEP and compiler would go through the cost model to determine if the vectrorized code can executed faster than serial code. Vectorized code will be generated when the compiler thinks it can outperform the serial version. In this model, it’s the programmer’s responsibility to ensure that vectorization overhead does not to exceed any speedup gain.

[sli@cthor-knc2 bs_distributed]$ icpc -o BlackScholes -O2 -xAVX -fimf-precision=low -fimf-domain-exclusion=15 -vec-report1 BlackScholesStep3.cpp

BlackScholesStep3.cpp(136): (col. 2) remark: SIMD LOOP WAS VECTORIZED.

BlackScholesStep3.cpp(88): (col. 2) remark: SIMD LOOP WAS VECTORIZED.

[sli@cthor-knc2 bs_distributed]$ ./BlackScholes

Black-Scholes valuation priced 1024 million options in 6.395131 seconds.

Program performs at the rate of 0.19215 Billion options per second.

Vectorized code takes advantage of SIMD instructions in modern microprocessors and delivers application performance without higher frequency or higher core count. In our case, vectorization delivered 7.05X performance increase out of 8X maximum. ^{9}

**3.5 Stage 4: Parallelization **

The number of CPU cores in a microprocessor package keeps increasing over the past 10 years. So far our vectorized Black-Scholes Formula makes a full utilization of one processor core. To make use of additional cores, we need to introduce multithreading to our code. We want to make all the cores working concurrently and each of them completes the same amount of work as the core our vectorized Black-Scholes Formula is running on.

Like Vectorization, multithreading also presents various threading choice the programmer. Each choice varies in terms of explicit programmer control, composerbility and code maintainability.

Our Black-Scholes Formula has two loops. The inner loop is vectorized already. The easiest way to achieve multithreading is to have each thread run the entire outer loop on different data. OpenMP, given all its problems does appear to be the obvious choice.

`#pragma omp`

parall creates or forks the threads. Each thread runs its own outer loop on its own subset of data. `#pragma omp`

for binds the threads to the inner loop.

[sli@cthor-knc2 bs_distributed]$ icpc -o BlackScholes -O2 -openmp -xAVX -fimf-precision=low -fimf-domain-exclusion=15 -vec-report1 BlackScholesStep4.cpp

BlackScholesStep4.cpp(118): (col. 2) remark: SIMD LOOP WAS VECTORIZED.

BlackScholesStep4.cpp(118): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

BlackScholesStep4.cpp(70): (col. 2) remark: SIMD LOOP WAS VECTORIZED.

BlackScholesStep4.cpp(67): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

[sli@cthor-knc2 bs_distributed]$ ./BlackScholes

Black-Scholes valuation priced 1024 million options in 0.333349 seconds.

Program performs at the rate of 3.68623 Billion options per second.

Running this program gives 19.18X performance improvement over vectorize scalar implementation11. Given the fact that there are 16 cores on the system and OpenMP thread creation can create some overhead. This performance improvement is still impressive.

Notice that this vectorized, parallel implementation of Black-Scholes Formula runs on systems with any number of threads. Each thread simply runs REPETITION number of iteration for input array size of DATASIZE/NumOfThreads. The per thread iteration may not be multipliers width. So the compiler will generate two versions of the loop body, one for vectorized loop; one for data less than SIMD length, as a result, the object code size for vectorized parallel program could be bigger than the vectorized serial code.

**3.6 Stage 5: Scale from Multicore to Many Core**

The vectorized parallel code we obtained at the end of stage 4 can be compiled by Intel® Composer XE 2013 for the Intel® Xeon Phi™ coprocessor as is with no further source code modification. At the compiler invocation line, you just have to add `–mmic`

everything else are the same. You can transfer the executable file to the coprocessor, it runs instantly.

Now that you have been assured Black-Scholes valuation has been ported to the coprocessor, you can concentrate 100% of your effort on optimization.

One area to look for additional performance is extended math unit or EMU. The coprocessor’s EMU is designed from scratch to support fast single precision transcendental function calls, reciprocal, reciprocal square root, exponential and logarithm functions, at least 3 of which our version of Black-Scholes valuation can potentially benefit from. Using EMU implemented transcendental functions results in a 50-80% performance gain.^{12} However the coprocessor EMU support base 2 version of exponential and logarithmic functions, while Black-Scholes valuation requires nature base version. This means you or the compiler cannot use EMU transcendental function directly.

To resolve this problem we have to make changes to the source code. Our strategy is to either adjust the parameter before calling base 2 versions or adjust the result from calling the based 2 version so that the result would be the same.

Let’s say we want to calculate exp(x), or ex, by calling exp2(x) or 2^{y}. The result has to be equal. We get ex = 2^{y}. Taking log_{2} on both side, we get log_{2} ex^{x} = log_{2} 2^{y}, which give us x*log_{2}E= log_{2} 2y, or y = x*log_{2}E. log_{2}E is a constant defined in <math.h> as M_LOG2E. We have exp(x) = exp2(x*M_LOG2E). Similarly, from change of base operation, we have ln(x) = log_{2}(x)/log2E = log_{2}(x)/M_LOG2E. This means that can call base 2 version of exponential function by adjust the parameter before calling the function and we can call base 2 logarithmic function instead of nature base by adjust the result after calling the function.

Now that we don’t have any math problem in using exp2(x) and log2(x) for any other base, does it make sense always to use them when indeed exp(x) and log(x) are desired? The answer to that is no. The reason being the speed difference between exp(x) and exp2(x) is not big enough for another full multiply operation. In a way we can consider exp(x) as an optimization of a full exp2(x) and multiply.

In cases where exp(x)’s parameter has to multiply a constant, we can adjust that constant so that the there is no additional multiple when exp2(x) is called. The same rule applies to log(x). Replacement makes sense when the post function result adjustment appears free. This optimization makes a lot of sense especially when exp(x) calls happen inside a loop, and adjustment to constant can happen before entering the loop.

Generally speaking, when you are trying to calculate base a exponential ax, always convert it to base 2 equivalent, not natural base equivalent. You need to remember mathematical equivalence expa(x) = ax = 2^{x*log2a}. Likewise, log_{a}(x) = log2(x)/log_{2}a = log_{2}(x) * log_{a}2

Here is the modified code for Intel® Xeon Phi™ Coprocessors

3.6.1 Building the Native Intel® Xeon Phi™ Application

Besides –mmic, we also want to tell compiler that we are going to use all 4 threads per core on Intel Xeon Phi. We also want to tell compiler to avoid the function call overhead by inlining the transcendental function calls.

icpc -o BlackScholes.mic -O3 -openmp -mmic -fimf-precision=low -fimf-domain-exclusion=15 -mGLOB_default_function_attrs=use_fast_math=on -mCG_lrb_num_threads=4 -vec-report6 -no-prec-div -no-prec-sqrt -mP2OPT_hlo_report -O3 BlackScholesStep5.cpp

3.6.2 Running the native Black-Scholes valuation program on Intel® Xeon Phi™ Coprocessor

[root@cthor-knc2-mic0 /tmp]# export LD_LIBRARY_PATH=.

[root@cthor-knc2-mic0 /tmp]# export OMP_NUM_THREADS=240

[root@cthor-knc2-mic0 /tmp]# export KMP_AFFINITY="granularity=fine,proclist=[1-240:1],explicit"

[root@cthor-knc2-mic0 /tmp]# ./BlackScholes.mic

Black-Scholes valuation priced 245760 million options in 20.014538 seconds.

Program performs at the rate of 12.27907 Billion options per second.

## 4. Conclusions

In this paper, we provide a brief introduction to the Intel® Xeon Phi™ coprocessor and its instruction set architecture in relation to scientific programmer. We also looked at one implementation of most popular quantitative finance application, Black-Scholes valuation. We use this implementation of Black-Scholes formula as a showcase, and demonstrated a stepwise application optimization framework. Following this optimization framework, the Black-Scholes calculation achieved a mind-boggling 791X times performance improvement using the combination of Intel® Composer XE 2013. The same program, using the same tools, Intel® Xeon Phi deliver another 3.33X performance on top of fully parallelized Xeon program. The follow is the speedup of using each tools compared to the original baseline performance from GCC.

At the center of this optimization effort is a stepwise optimization framework which proved to be effective not only in financial numerical application, but also to general scientific computation as well. Soon to be released Intel® Composer XE 2013 will make it even easier for scientific programmers to approach performance optimization as a structured activity and also allow them to quickly understand the limitation of the program and achieve every kind of parallelism available.

## Additional Resources

- Intel® Composer XE 2013 for Linux* including Intel® MIC Architecture**
- Intel® C++ Compiler XE 12.0 User and Reference Guides**

## About the Author

Shuo Li works at Software & Service Group, Intel Corporation. He has 24 years of experience in software development. His main interest is parallel programming, computational finance and application performance optimization. In his recent role as a software performance engineer covering financial service industry, Shuo works closely with software developers and modelers and help them achieve best possible performance on Intel platform. Shuo holds a Master's degree in Computer Science from University of Oregon and an MBA degree from the Duke University.

## Notices

"Intel, Xeon, Cilk, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries."

*Other names and brands may be claimed as the property of others.

## Performance Notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

^{1, 2} Configurations: Dual socket server system with two 2.6 GHz Intel® Xeon™ processor E5-2670 32GB, 8 x 4GB DDR3-1600MHz. GCC version 4.4.6.

^{3,4,5,6,7,8,9,10,11} Configurations: Dual socket server system with two 2.6 GHz Intel® Xeon™ processor E5-2670 32GB, 8 x 4GB DDR3-1600MHz. Intel Composer XE 2011 SP2.

^{12,13} Configurations: Dual socket server system with two 2.6 GHz Intel® Xeon™ processor E5-2670 32GB, 8 x 4GB DDR3-1600MHz. with pre-release version of Intel® Xeon Phi™ Coprocessor: 61 Cores at 1091MHz. Intel Composer XE 2013 Release.

Sample source code found in this document is released under the Intel Sample Source Code License Agreement

## Intel Sample Source Code License Agreement

The example code which links to this license agreement (hereafter “Example Code”) is subject to all of the following terms and conditions:1. Intel Corporation ("Intel") grants to you a non-exclusive, non-assignable copyright license to make only the minimum number of copies of the Example Code reasonably necessary for your internal use, and to modify the Example Code that are provided in source code (human readable) form.

2. You may not reverse-assemble, reverse-compile, or otherwise reverse-engineer any software provided solely in binary form.

3. You may not distribute to any third party any portion of the Example Code in any form.

4. Title to the Example Code and all copies thereof remain with Intel or its suppliers. The Example Code are copyrighted and are protected by United States copyright laws and international treaty provisions. You will not remove any copyright notice from the Example Code. You agree to prevent any unauthorized copying of the Example Code. Except as expressly provided herein, Intel does not grant any express or implied right to you under Intel patents, copyrights, trademarks, or trade secret information. Subject to Intel’s ownership of the Example Code, all right, title and interest in and to your modifications shall belong to you.

5. The Example Code are provided "AS IS" without warranty of any kind. INTEL OFFERS NO WARRANTY EITHER EXPRESS OR IMPLIED INCLUDING THOSE OF MERCHANTABILITY, NONINFRINGEMENT OF THIRD- PARTY INTELLECTUAL PROPERTY OR FITNESS FOR A PARTICULAR PURPOSE. NEITHER INTEL NOR ITS SUPPLIERS SHALL BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR OTHER LOSS) ARISING OUT OF THE USE OF OR INABILITY TO USE THE EXAMPLE CODE, EVEN IF INTEL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU.

6. THE EXAMPLE CODE ARE NOT DESIGNED, INTENDED, OR AUTHORIZED FOR USE IN ANY TYPE OF SYSTEM OR APPLICATION IN WHICH THE FAILURE OF THE EXAMPLE CODE COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR (E.G MEDICAL SYSTEMS, LIFE SUSTAINING OR LIFE SAVING SYSTEMS). Should you use the Example Code for any such unintended or unauthorized use, you shall indemnify and hold Intel and its officers, subsidiaries and affiliates harmless against all claims, costs, damages, and expenses, and reasonable attorney fees arising out of, directly or indirectly, any claim of product liability, personal injury or death associated with such unintended or unauthorized use, even if such claim alleges that Intel was negligent regarding the design or manufacture of the part.

7. You agree that any material, information or other communication you transmit or post to an Intel website or provide to Intel under this Agreement will be considered non-confidential and non-proprietary ("Communications"). Intel will have no obligations with respect to the Communications. You agree that Intel and its designees will be free to copy, modify, create derivative works, publicly display, disclose, distribute, license and sublicense through multiple tiers of distribution and licensees, incorporate and otherwise use the Communications and all data, images, sounds, text, and other things embodied therein, including derivative works thereto, for any and all commercial or non-commercial purposes. You are prohibited from posting or transmitting to or from an Intel website or provide to Intel any unlawful, threatening, libelous, defamatory, obscene, pornographic, or other material that would violate any law. If you wish to provide Intel with your confidential information, Intel requires a non-disclosure agreement (“NDA”) to receive such confidential information, so please contact your Intel representative to ensure the proper NDA is in place.

8. The term of this Agreement will commence on the date this Agreement is accepted by you and will continue until terminated. You may terminate this Agreement at any time. Intel may terminate this Agreement at any time with written notice to you should you breach any provision in this Agreement. Upon termination, you will immediately destroy the Example Code along with any copies you have made.

9. U.S. GOVERNMENT RESTRICTED RIGHTS: The Example Code are provided with "RESTRICTED RIGHTS". Use, duplication or disclosure by the Government is subject to restrictions set forth in FAR52.227-14 and DFAR252.227-7013 et seq. or its successor. Use of the Materials by the Government constitutes acknowledgment of Intel's rights in them.

10. APPLICABLE LAWS: Any claim arising under or relating to this Agreement shall be governed by the internal substantive laws of the State of Delaware or federal courts located in Delaware, without regard to principles of conflict of laws. You may not export the Example Code in violation of applicable export laws./p>