Free Speedup with Compiler Switches for Fast Math and Intel® Streaming SIMD Extensions

Download Article

Download Free Speedup with Compiler Switches for Fast Math and Intel® Streaming SIMD Extensions [PDF 238KB]


The intention of this introductory article is to make developers aware of a simple optimization step they can do that doesn't require much effort or specialized knowledge.


Compilation that can utilize Intel® Streaming SIMD Extensions (Intel® SSE) instructions, available on most x86 CPUs, can improve floating point performance even if the source code is not set up for single instruction multiple data (SIMD) processing. However, the Microsoft Visual Studio* compiler option to enable Intel SSE instructions is not enabled by default. This mini-article describes the simple steps to enable these instructions, as well as showing how to quickly recognize if your code is being properly optimized or not. In a sample floating-point-intensive serial loop, the performance improves 2X with just a recompilation.


Each generation of Intel® architecture brings hardware improvements and new assembly instructions. The original Intel® Pentium® processor included the x87 math coprocessor, which ran the original floating point math instructions. In 1999, Intel introduced the first generation of Intel SSE with the Intel® Pentium® III processor, and Intel® Streaming SIMD Extensions 2 (Intel® SSE2) with the Intel® Pentium® 4 processor in 2001. Clearly, these instructions have been around for a long time and are available on most CPUs being used today. In addition to the single instruction multiple data (SIMD) parallel instructions, Intel SSE includes a set of serial instructions that are faster than their x87 counterparts.

A Windows* application developer using Microsoft Visual Studio* C++ compiler will get their code debugged and working in Debug mode. With features implemented and correctness ensured, the next step is usually to make the application faster by building the program in Release mode. Unlike the Debug configuration, Release settings are generally intended for faster execution. Release mode is not a specific compiler option, but rather a "configuration", or set of compiler options. The specific compiler setting to enable Intel SSE instructions is turned off by default. For the compiler to generate the fastest possible program, it needs to be allowed to emit the best instructions for the job. This is easy to override by using the arch compiler option either on the compile command line or from within Microsoft Visual Studio* by changing the Enable Enhanced Instruction Set option in the project's C/C++ code generation properties dialog. The downside is that older CPUs from the 1990s will not be able to run the binary executable that includes Intel SSE instructions. In practice, many applications target a minimum hardware specification that is already at or above Intel SSE2. All applications that require a dual-core processor can assume Intel® Streaming SIMD Extensions 3 (Intel® SSE3) are available (

The remainder of this article walks through the steps to enable Intel SSE instructions, demonstrates how to check what the compiler is generating, and finally shows the performance improvements that are possible with just serial Intel SSE operations.


Getting the compiler to use Intel SSE and verifying the program (.exe) is using these instructions is easy. Checking the output of the compiler does not require experience with (or even willingness to deal with) assembly programming. If you can distinguish between the letter 'f' and 's' then you have the skills to distinguish the patterns that indicate which architecture features are being used. Navigating to where you need to go can be done with a breakpoint and selecting a right-click menu option.

For this walkthrough, we consider a simple example loop that normalizes an array of 3D vectors.

for(int i=0; i!=n ;i++)
   N[i] = V[i] / magnitude(V[i]);

Obviously the first thing that needs to be done is to ensure the code is working properly in Debug mode. Assuming this is so, the next step is to compile in Release mode. Next put a breakpoint at this loop, and then run the program by hitting F5. When the running program hits the breakpoint, right-click to bring up the menu and select Go To Disassembly.


If the compiler generated code using x87 instructions, then the disassembly view will appear similar to the following:

for(int i=0; i!=n ;i++)
   003238C8   mov         esi,dword ptr [ebp+8]
   003238CB   push        edi
   003238CC   mov         edi,dword ptr [dest]
   003238CF   add         esi,8
   003238D2   mov         ebx,4000h
            N[i] = V[i] / magnitude(V[i]);
   003238D7   fld         dword ptr [esi-4]
   003238DA   fld         dword ptr [esi-8]
   003238DD   fld         dword ptr [esi]
   003238DF   fld         st(1)
   003238E1   fmulp       st(2),st
   003238E3   fld         st(2)
   003238E5   fmulp       st(3),st
   003238E7   fxch        st(1)
   003238E9   faddp       st(2),st
   003238EB   fmul        st(0),st
   003238ED   faddp       st(1),st
   003238EF   fstp        dword ptr [ebp-4]
   003238F2   fld         dword ptr [ebp-4]
   003238F5   call        _CIsqrt (3255B0h)
   003238FA   fstp        dword ptr [ebp-4]

The assembly instructions beginning with the letter 'f', including fmul, fld, faddp, and fmulp, are legacy pre-Intel Pentium processor x87 math coprocessor instructions. Furthermore, in the 2nd to last line the code, call _CIsqrt, is invoking a function call to compute the square root rather than putting this inline. This sort of assembly is not ideal for high-performance code.

This can be changed by going to the project properties by right clicking on the project file and selecting 'Properties'. In the Properties dialog, expand the C/C++ group and click on Code Generation. Among the options that appear on the right will be Enable Enhanced Instruction Set. Change this setting to Streaming SIMD Extensions 2. Also change the Floating Point Model to Fast to remind the compiler that using only 32 bits (instead of double) will suffice for floating point calculations. Changing these options will require a recompile of all the code to take effect.

To inspect the new assembly, rerun the program and hopefully it will stop at the same breakpoint. Note that sometimes the optimizer makes it hard for the debugger to line up the source code with the assembly. If the compiler is aggressive with putting code inline, it might never hit the breakpoint. To work around this, move the breakpoint to where the function gets called. However you manage to view the disassembly, the guts of the loop should now look like:

       01385E80   movss          xmm0,dword ptr [eax-4]
       01385E85   movss          xmm1,dword ptr [eax-8]
       01385E8A   movss          xmm2,dword ptr [eax]
       01385E8E   movaps         xmm4,xmm0
       01385E91   mulss          xmm4,xmm0
       01385E95   movaps         xmm3,xmm1
       01385E98   mulss          xmm3,xmm1
       01385E9C   addss          xmm3,xmm4
       01385EA0   movaps         xmm4,xmm2
       01385EA3   mulss          xmm4,xmm2
       01385EA7   addss          xmm3,xmm4
       01385EAB   sqrtss         xmm4,xmm3
       01385EAF   movaps         xmm3,xmm5
       01385EB2   divss          xmm3,xmm4
       01385EB6   mulss          xmm0,xmm3
       01385EBA   mulss          xmm1,xmm3
       01385EBE   mulss          xmm2,xmm3
       01385EC2   movss          dword ptr [p],xmm1
       01385EC7   movss          dword ptr [ebp-28h],xmm0
       01385ECC   movq           xmm0,mmword ptr [p]
       01385ED1   movss          dword ptr [ebp-24h],xmm2

The code here is using SSE serial instructions. The ss suffix on addss, mulss, and sqrtss indicate add, multiply and sqrt for one (serial) 32bit single precision floating point number. The xmm registers are 128-bit, but only the first 32 bits are actually used.

Performance Results

Although the assembly code isn't taking advantage of the full 128 bits that Intel SSE offers, it is still faster than the previous x87 code. Using x87, the runtime is 45 cycles per loop, whereas it only takes about 23 cycles per loop after flipping the fast math and Intel SSE switches on. These results were generated on an Intel® Core™ i7 processor and may differ on other x86 processors. Furthermore, any such results are dependent on the compiler and on how the source code is written. Note that this example was an ideal case showing only the timing difference of this particular loop - not the overall results of the application. Furthermore, not all floating point sections of code will be able to demonstrate this amount of speedup.

Conclusion and Further Performance Improvements

In this introductory article we showed how Release code can be made even faster simply by flipping a switch to enable the compiler to use Intel SSE serial instructions and fast math. The performance benefits occur even without touching the source code. It is such a simple thing to do, feel free to tap your coworker on the shoulder and pass on this tidbit of knowledge.

This loop could actually be sped up even more. Rather than using costly sqrt and div instructions, Intel SSE has a much faster approximate inverse square root assembly instruction that may be sufficient, or refined with Newton-Rhapson, depending on how much accuracy is required in the normalized result. Obviously, further benefits come from utilizing the parallel rather than the serial instructions. In other words, it is possible to normalize more than one vector at a time. However, these next steps require some extra programming effort, usage of libraries or header files that have already been SIMD optimized, or a compiler that can automatically vectorize the code. SIMD with Intel SSE is a widely covered topic with many articles and examples available on the web, including the Intel® Developer Zone. Follow-up articles to this one will dive deeper into how to effectively use SIMD and talk about the upcoming Intel® Advanced Vector Extensions (Intel® AVX) to x86 that are now available in hardware. (/en-us/avx/) In particular, using 256-bit AVX, it is possible to rearrange the data on-the-fly and normalize 8 vectors at a time bringing the runtime average-cost down to 2.7 cycles per vector.

For more complete information about compiler optimizations, see our Optimization Notice.