Introduction
Simple math functions are known to be a bottleneck in a wide variety of floating-point applications, ranging from financial analytics to 2D image manipulation to 3D physics engines. This has become even more evident on the Pentium® 4 Processor, where the latency of trigonometric, logarithmic, and exponential floating-point instructions can approach 200 clock cycles.
This paper will present four different optimized math libraries from Intel, each providing a speedy alternative to the slow hardware-based operations. These are the Approximate Math, Vector Math, Short Vector Math, and LibM libraries. These libraries trade off different levels of performance, parallelism, and accuracy, so you can choose the one most appropriate for your application. With the information contained here, you will learn how to extract maximum performance out of math function calls in any context.TESTING
Which Library Should I Use?
To effectively choose one of the four libraries discussed here, you will need to know the precision requirements of your algorithm as well as the amount of parallelism available around each math function call. The latter determines whether SIMD instructions can be used to compute multiple results in parallel. For Windows* developers, your choice of compiler should not be an issue, as all of the libraries are fully compatible with both the Intel® C++ Compiler and the Microsoft* Visual C++ Compiler. Consideration of math library optimizations on other operating systems is beyond the scope of this paper.
Approximate Math LibraryThe Approximate Math Library is a set of open source math functions that operate on both scalar and packed data using SSE (Streaming SIMD Extensions) and SSE2 instructions. Of all the options, AM library provides the best performance, but it comes at the cost of low precision. Specifically, the precision is comparable to that of the SSE reciprocal approximation instructions (which have an absolute relative error less than or equal to 1.5 * 2-12). As such, the library only operates on single precision data. (Approximating results at double-precision is not especially practical). In the scalar form, the argument must be passed in the lowest field of an XMM register, while in the packed form the argument consists of four numbers arranged adjacently in one XMM register. The level of parallelism available in your algorithm will determine whether the packed version can be used.
As the "Open Source" tag implies, the AM Library is completely free and does not require a license to use. The library package is available by clicking the link for the Approximate Math (AM) library
here.
Vector Math LibraryThe Vector Math Library (VML) is a subset of the Intel® Math Kernel Library, intended for parallel math computations on large vectors of data. Unlike the AM library, VML does not compromise precision, so each math function is available in both single and double precision form. The limiting requirement for VML is that you must organize your input data into a single array to be passed as an argument into a VML call.
For example, VML would be appropriate for the loop shown in Figure 1, but only if the loop bound (iArrayLength) is a large value. If the
src array contained double-precision data, this whole loop could be replaced by a single call to the
vdsin() function. Likewise,
vssin() is the counterpart for single-precision.
Figure 1. A potential VML opportunityThe
Intel Math Kernel Library (including VML) is available at Intel's web site. If you decide to use this is in a retail product, you must request a product license.
LibM Math LibraryThe LibM library provides highly optimized scalar math functions that serve as direct replacements for the standard C calls. The LibM versions are fully accurate and do not attempt to extract algorithm-level parallelism, so they can be applied in even the most rigid coding situations.
LibM is packaged with Intel compilers and thus requires the purchase of an Intel Compiler license if you choose to build it into a retail product (even if you don't use the compiler itself). For more info on Intel Compilers, including a free 30-day trial version,
click here. Exact implementation steps for LibM will be discussed in Section 3.
Short Vector Math LibraryThe Short Vector Math Library is perhaps the trickiest of the four libraries to apply, but is worth the extra pain for the large performance gains. SVML leverages SSE & SSE2 instructions to process either four packed single-precision numbers or two packed double-precision numbers in one call. Results are fully accurate. As with VML, the programmer is responsible for lining up the data in parallel fashion – something not always possible depending on the nature of the algorithm.
Like LibM, SVML comes with the Intel Compiler. In fact, SVML was developed solely as an enabler for the Intel Compiler's automatic SIMD vectorization capability. Using SVML, the compiler can vectorize simple loops containing calls to math functions, but still there are many more complex opportunities where programmer intervention (i.e. a human brain) is needed to fully extract the parallelism. With that in mind, Section 4 will divulge methods for coding SVML calls by hand.
LibM Tips & Tricks
LibM is a snap to implement because it usually does not require any changes to your code. The trick is simply to turn off the generation of intrinsic calls in the compiler. (Many of the math functions are generated inline as an optimization). If you are using the Microsoft* Visual C++ compiler, this can be done in one of two ways:
- Apply the /Oi- compiler switch (note the minus sign). This turns off the "Generate Intrinsic Functions" optimization, forcing the compiler to insert explicit calls to the C-runtime library or a substitute math library.
- Insert "#pragma function(function name)" in the files containing your math calls, where function name is the operation you are using (e.g sin, cos, exp, log, etc.). This is a more controlled way of forcing explicit calls only for the functions you specify.
A note to Intel Compiler users: "#pragma function" is accepted by the compiler, but has no effect, so you must use the
/Oi- switch. The "#pragma optimize" sequence also cannot be used to toggle specific optimizations, so localization of
/Oi- can only be achieved at a function-level granularity by separating functions into different files.
Once you've got the necessary intrinsic calls disabled, just add "libm.lib" at the head of your linker command line and you should be in business using LibM. If you are using the Microsoft* Visual C++ compiler and haven't fully installed the Intel Compiler, you'll also need to list "libirc.lib" (the Intel C-runtime library), and be sure that Microsoft* Visual Studio can find the Intel libraries in your LIB path setting (under the Tools->Options->Directories pulldown).
Each LibM call begins with a series of SSE/2 capability checks before branching to the appropriate code path, so you automatically maintain backward compatibility with earlier architectures.
A Common Pitfall with Visual C++If you initially don't see a performance boost from LibM, you may be running into an Intel® NetBurst™ Architecture stall known as a store-forwarding violation. This can happen with the Microsoft Visual C++ 6.0 compiler if your argument is a double precision value coming directly out of memory, much like the
sin() call back in Figure 1. When compiled with LibM optimizations, this will lead to the assembly code in Figure 2.
Figure 2. LibM disassemblyHere, the compiler causes a store-forwarding violation by loading the double precision operand in two 4-byte chunks, pushing them consecutively on to the local memory stack, and then reloading the full 8-bytes into
XMM0 inside the LibM
sin() call. (The top of the stack is referenced through the
esp register, so the argument is at
[esp+4] after the function return pointer is pushed on to the stack.) This code causes a stall in the Intel NetBurst™ Architecture, because the data from the two 4-byte stores cannot be forwarded to the subsequent 8-byte load. For more on store-forwarding issues, see the
Intel® Pentium® 4 Processor Optimization Reference ManualThe store-forwarding penalty is enough to offset the gains from employing LibM, but fortunately it can be avoided with some clever coding. The goal is to trick the compiler into using x87 floating-point loads and stores to set up the argument on the local stack. The
FLD and
FST instructions can move data 8-bytes at a time, so they will not cause a violation of the store-forwarding rules.
Figures 3 and 4 show two methods for accomplishing the desired assembly code. In Figure 3, we've assigned the array value to a temporary variable that gets passed into the call. Even though this code is functionally equivalent, the compiler now uses an
FLD,
FSTP instruction sequence to move the 8-byte array value to the top of the local stack.
Figure 3. Solution using 'temp' variableThe solution in Figure 4 is a bit more convoluted but still functionally equivalent. By casting the array value to a
float data type, we again influence the compiler to move the data via x87 instructions. However this does not affect the precision of the computation because all standard math functions input and return double-precision arguments by default. So, the compiler-generated assembly code is exactly the same as that shown in Figure 3.
Figure 4. Solution using (float) cast
Exposing SVML
Since SVML is only intended for use by the Intel Compiler vectorizer, the compiler does not provide a header file exposing the entry points into the library. However, it's easy enough to determine them by looking at a binary dump of the library (svml_disp.lib) and the assembly code generated when the compiler employs SVML. This has already been done for you in Appendix A (below). Just cut and paste these declarations into a header file that you can #include wherever you need SVML.
The best way to set up the SVML calls is via SSE/2 instrinsics. (Note, you'll need to install the Visual Studio Processor Pack to enable intrinsics in the Microsoft compiler). Figure 5 shows how to apply SVML to the original sine example. The variable '
temp' is declared as an __
m128d data type, meaning that it will represent two double-precision values packed into one
XMM register. The appropriate SVML call for this code is
vmldSin2(). Picking apart the function naming convention, 'vml' obviously stands for 'vector math library'. The 'd' indicates double-precision FP data. Next is the math function name, and finally a number indicating the width of data being processed. This number will always be '2' for double-precision and '4' for single-precision, in direct relation to the width of SIMD operations allowed by SSE/2 for the respective data types. Note that the loop counter 'i' is being incremented by two instead of one because the loop is producing two results per iteration.
Figure 5. Solution using SVML & intrinsics In the disassembly you'll see the temp value being passed into and returned from the vmldSin4 call via the
XMM0 register. This is the convention you must follow to code SVML calls in assembly. Also be aware that the SVML call is going to blow away any data you were keeping in the other
XMM registers. Since the use of SSE/2 instructions cannot b e contained within the SVML call, you must implement code to check for SSE/2 capabilities and branch to different code paths as appropriate.
Library Feature Summary
Table 1 presents a summary of the capabilities of each math library.
Table 1. Summary of Library Features
Performance
The following charts compare performance of each library for a select set of math functions. Each test case was similar in nature to the sine() examples laid out in this paper- a
for loop processing a long array of input values. The binaries were built using Microsoft Visual C++ 6.0 service pack 5 (incl. the Visual Studio Processor pack) with compiler options '-G6 –O2'. The measurements were taken on a 2.2 Ghz Intel Pentium® 4 Processor with 256 MB RAM running Windows XP Professional.

Some overall notes on the libraries:
- The Approximate Math library does not support square root, cube root, or hyperbolic functions.
- The Microsoft LIBC run-time library does not support cube root, or either of the inverse root functions.
- For square root, the SVML calls are essentially equivalent to their respective SSE/2 intrinsics (_mm_sqrt_ps & _mm_sqrt_pd), but with higher overhead. Use the intrinsics for superior performance.
Conclusion
This paper gave an overview of four different math libraries that are highly optimized for the Pentium® 4 Processor. Each library has unique characteristics that make it suitable for specific algorithm contexts. With the information contained here, hopefully you can choose the library most appropriate for your application and easily integrate it into your code.
About the Author
Mike Stoner is a Senior Applications Engineer with Intel's Software Solutions Group. He has been with Intel since 1996, mainly in the role of helping independent software vendors develop optimized code for Intel platforms. Prior to joining Intel Mike received Bachelor's and Master's degrees in Electrical Engineering from The Ohio State University.
Appendix A - Short Vector Math Library Function Declarations
This appendix lists the entry points for SVML. Please keep in mind that SVML hand-coding is not a supported usage of the library, and the information below could change without notice in later revisions of the Intel Compiler.