| Last Modified On : | March 28, 2008 12:17 PM PDT |
Rate |
|
Financial applications make heavy usage of math functions such as exp(), pow(), log(), sqrt(). Intel provides at least 3 different ways to help optimize performance of math functions. Simple use of the Intel compiler’s math library is demonstrated in this tuning note as a way to pick low-hanging fruit of performance. Please note that Intel Math Kernel Library (MKL) and Intel Short Vector Math Library (SVML) are not discussed here; they are explored in other such tuning notes.
The Intel compiler’s math library provides optimized math functions that deliver superior performance. The implementation of these functions is tuned to exploit processor/platform capabilities the application runs on. The Intel compiler’s math library is part of the Intel compiler, but it can be used with or without the Intel compiler. It is important to note, however, that the Intel compiler generates the best performing binaries and that process is discussed in a separate tuning note. Just linking in with the Intel compiler’s math library is a simple technique that can be employed where using Intel compiler may not be feasible for some (usually logistical/non-technical) reason.
The best part of using the Intel compiler’s math library is that no source code change is necessary as it exposes the same interfaces. The math function calls are diverted to the Intel compiler’s math library when linked as explained here. For example, your application will still be built the same way with only a few subtle project setting changes. The Intel compiler’s math library is binary compatible with other standard compilers such as the Microsoft’s Visual Studio’s compiler and the GNU compiler.
This tuning note demonstrates how to use the Intel compiler’s math library to extract higher application performance with minimum build environment change and no source-code change. The example here details exp() function optimization that can be achieved in matter of minutes. Similar analysis then can be applied to other math functions. While this tuning note is for the Windows environment, another tuning note will cover the Linux environment.
The following simple code snippet is used as an example throughout this tuning note:
for (e = 0; e < elements; e++) {
d_exp[e] = exp(d_in[e]);
}
Where d_in[] and d_exp[] are arrays of doubles. This loop is iterated many times to get meaningful performance numbers. Performance numbers mentioned here are in number of clock-ticks per exp() call on the system on which these tests were run. An Intel Core 2 micro-architecture based system running 32-bit Windows 2003 server is used for all these tests.
The optimization process below is presented as a series of incremental steps, with observations and recommendations at each step.
1. Visual Studio .NET 2003 with default project settings delivers the slowest performance of 141 clock-ticks per exp() call. A careful look at the assembly code generated revels the reason for the poor performance:
; 98 : d_exp[e] = exp(d_in[e]);
fldl2e
inc eax
cmp eax, esi
fmul QWORD PTR _d_in$[esp+eax*8+160064]
fld ST(0)
frndint
fxch ST(1)
fsub ST(0), ST(1)
f2xm1 fld1
faddp ST(1), ST(0)
fscale
fstp ST(1)
fstp QWORD PTR _d_exp$[esp+eax*8+160064]
jl SHORT $L56189
The underperforming x87 inline code instead of a function call to exp() seems to be the culprit here. One would expect inlined intrinsic to perform better over explicit function call, so next step is to see how disabling “Generate Intrinsic Function” perform. Before doing that, performance on Visual Studio 2005 is explored below.
2. Visual Studio 2005 with default project settings delivered 1.5x better performance over Visual Studio .NET 2003 (96 clock-ticks per exp() call vs. 141). The key difference is the absence of x87 inline code, and call to __CIexp() instead:
; 98 : d_exp[e] = exp(d_in[e]);
fld QWORD PTR _d_in$[esp+edi*8+160128]
call __CIexp
fstp QWORD PTR _d_exp$[esp+edi*8+160128]
add edi, 1
cmp edi, esi
jl SHORT $LL17@main
3. To force compiler to insert explicit function call exp() instead of generating intrinsic functions, the compiler switch /Oi- is used as shown below. Refer to http://msdn2.microsoft.com/en-us/library/f99tchzc.aspx for more details.
The explicit call to _exp() is now made as can be seen in the assembly code below:
; 98 : d_exp[e] = exp(d_in[e]);
fld QWORD PTR _d_in$[esp+edi*8+160128]
sub esp, 8
fstp QWORD PTR [esp]
call _exp
fstp QWORD PTR _d_exp$[esp+edi*8+160136]
add edi, 1
add esp, 8
cmp edi, esi
jl SHORT $LL17@main
/Oi- delivers 1.58x speedup under Visual Studio .NET 2003 (141 clock-ticks per exp() call vs. 89) and 1.12x speedup under Visual Studio 2005 (85 clock-ticks per exp() call vs. 96). The key difference here is the explicit call to _exp().
4. The Intel compiler’s math library is now linked in as shown below, while keeping the compiler switch /Oi- from the previous step. In example here, we use libmmd.lib with libmmd.dll for multi-threaded dynamic linking as we use /MD. Refer to Intel compiler documentation for other library options such as static linking, etc.
As can be seen below, the assembly code remains the same as expected from the previous step:
; 98 : d_exp[e] = exp(d_in[e]);
fld QWORD PTR _d_in$[esp+edi*8+160128]
sub esp, 8
fstp QWORD PTR [esp]
call _exp
fstp QWORD PTR _d_exp$[esp+edi*8+160136]
add edi, 1
add esp, 8
cmp edi, esi
jl SHORT $LL17@main
While re-running the application, following pop-up for missing libmmd.dll may be encountered:
Set path to point to Intel compiler math library location – something like [C:Program FilesIntelCompilerC++10.1.020IA32lib] Or simply copy.
The performance jumps over 2x compared to the previous step (41 clock-ticks per exp() call vs. 85). The key difference here is the explicit call to _exp() from the Intel compiler’s math library.
Note: The optimization process can be taken to the next level using the Intel compiler, MKL, etc., but those approaches are discussed in other tuning notes.
1. Solid performance improvement can be achieved for math functions by forcing Visual Studio compilers, both Visual Studio .NET 2003 and 2005, to make explicit calls to those functions instead of inlining them.
2. Intel compiler’s math library delivers significant performance gains that can be realized with minimum build environment change and no source-code change.
In case of exp(), the overall performance speedup is 3.43x using Visual Studio .NET 2003, where 141 clock-ticks per exp() call reduces to just 41. In the case of Visual Studio 2005 the performance speedup is 2.34x , where 96 clocks per exp() call reduces to just 41.
The Visual Studio 2008 delivered the same performance as Visual Studio 2005 in this example.
Visual Studio Compiler with Intel Compiler Math Library
|
|
Visual Studio.NET 2003 |
Visual Studio 2005 |
Visual Studio 2008 |
|
VS Compiler Default |
141 |
96 |
96 |
|
VS Compiler with /Oi- |
89 |
85 |
85 |
|
VS Compiler w/Intel Compiler Math Library |
41 |
41 |
41 |
Integrating fast math libraries for the Intel Pentium 4 processor
Intel C++ Compiler for Windows

English | 中文 | Русский | Français
J.D. Patel (Intel)
|