The intention of this introductory article is to make developers aware of a simple optimization step they can do that doesn't require much effort or specialized knowledge.
Compilation that can utilize Intel® Streaming SIMD Extensions (Intel® SSE) instructions, available on most x86 CPUs, can improve floating point performance even if the source code is not set up for single instruction multiple data (SIMD) processing. However, the Microsoft Visual Studio* compiler option to enable Intel SSE instructions is not enabled by default. This mini-article describes the simple steps to enable these instructions, as well as showing how to quickly recognize if your code is being properly optimized or not. In a sample floating-point-intensive serial loop, the performance improves 2X with just a recompilation.
Each generation of Intel® architecture brings hardware improvements and new assembly instructions. The original Intel® Pentium® processor included the x87 math coprocessor, which ran the original floating point math instructions. In 1999, Intel introduced the first generation of Intel SSE with the Intel® Pentium® III processor, and Intel® Streaming SIMD Extensions 2 (Intel® SSE2) with the Intel® Pentium® 4 processor in 2001. Clearly, these instructions have been around for a long time and are available on most CPUs being used today. In addition to the single instruction multiple data (SIMD) parallel instructions, Intel SSE includes a set of serial instructions that are faster than their x87 counterparts.
A Windows* application developer using Microsoft Visual Studio* C++ compiler will get their code debugged and working in Debug mode. With features implemented and correctness ensured, the next step is usually to make the application faster by building the program in Release mode. Unlike the Debug configuration, Release settings are generally intended for faster execution. Release mode is not a specific compiler option, but rather a "configuration", or set of compiler options. The specific compiler setting to enable Intel SSE instructions is turned off by default. For the compiler to generate the fastest possible program, it needs to be allowed to emit the best instructions for the job. This is easy to override by using the arch compiler option either on the compile command line or from within Microsoft Visual Studio* by changing the Enable Enhanced Instruction Set option in the project's C/C++ code generation properties dialog. The downside is that older CPUs from the 1990s will not be able to run the binary executable that includes Intel SSE instructions. In practice, many applications target a minimum hardware specification that is already at or above Intel SSE2. All applications that require a dual-core processor can assume Intel® Streaming SIMD Extensions 3 (Intel® SSE3) are available (http://en.wikipedia.org/wiki/SSE3).
The remainder of this article walks through the steps to enable Intel SSE instructions, demonstrates how to check what the compiler is generating, and finally shows the performance improvements that are possible with just serial Intel SSE operations.
Getting the compiler to use Intel SSE and verifying the program (.exe) is using these instructions is easy. Checking the output of the compiler does not require experience with (or even willingness to deal with) assembly programming. If you can distinguish between the letter 'f' and 's' then you have the skills to distinguish the patterns that indicate which architecture features are being used. Navigating to where you need to go can be done with a breakpoint and selecting a right-click menu option.
For this walkthrough, we consider a simple example loop that normalizes an array of 3D vectors.
for(int i=0; i!=n ;i++) N[i] = V[i] / magnitude(V[i]);
Obviously the first thing that needs to be done is to ensure the code is working properly in Debug mode. Assuming this is so, the next step is to compile in Release mode. Next put a breakpoint at this loop, and then run the program by hitting F5. When the running program hits the breakpoint, right-click to bring up the menu and select Go To Disassembly.
If the compiler generated code using x87 instructions, then the disassembly view will appear similar to the following:
for(int i=0; i!=n ;i++) 003238C8 mov esi,dword ptr [ebp+8] 003238CB push edi 003238CC mov edi,dword ptr [dest] 003238CF add esi,8 003238D2 mov ebx,4000h N[i] = V[i] / magnitude(V[i]); 003238D7 fld dword ptr [esi-4] 003238DA fld dword ptr [esi-8] 003238DD fld dword ptr [esi] 003238DF fld st(1) 003238E1 fmulp st(2),st 003238E3 fld st(2) 003238E5 fmulp st(3),st 003238E7 fxch st(1) 003238E9 faddp st(2),st 003238EB fmul st(0),st 003238ED faddp st(1),st 003238EF fstp dword ptr [ebp-4] 003238F2 fld dword ptr [ebp-4] 003238F5 call _CIsqrt (3255B0h) 003238FA fstp dword ptr [ebp-4] ...
The assembly instructions beginning with the letter 'f', including fmul, fld, faddp, and fmulp, are legacy pre-Intel Pentium processor x87 math coprocessor instructions. Furthermore, in the 2nd to last line the code, call _CIsqrt, is invoking a function call to compute the square root rather than putting this inline. This sort of assembly is not ideal for high-performance code.
This can be changed by going to the project properties by right clicking on the project file and selecting 'Properties'. In the Properties dialog, expand the C/C++ group and click on Code Generation. Among the options that appear on the right will be Enable Enhanced Instruction Set. Change this setting to Streaming SIMD Extensions 2. Also change the Floating Point Model to Fast to remind the compiler that using only 32 bits (instead of double) will suffice for floating point calculations. Changing these options will require a recompile of all the code to take effect.
To inspect the new assembly, rerun the program and hopefully it will stop at the same breakpoint. Note that sometimes the optimizer makes it hard for the debugger to line up the source code with the assembly. If the compiler is aggressive with putting code inline, it might never hit the breakpoint. To work around this, move the breakpoint to where the function gets called. However you manage to view the disassembly, the guts of the loop should now look like:
01385E80 movss xmm0,dword ptr [eax-4] 01385E85 movss xmm1,dword ptr [eax-8] 01385E8A movss xmm2,dword ptr [eax] 01385E8E movaps xmm4,xmm0 01385E91 mulss xmm4,xmm0 01385E95 movaps xmm3,xmm1 01385E98 mulss xmm3,xmm1 01385E9C addss xmm3,xmm4 01385EA0 movaps xmm4,xmm2 01385EA3 mulss xmm4,xmm2 01385EA7 addss xmm3,xmm4 01385EAB sqrtss xmm4,xmm3 01385EAF movaps xmm3,xmm5 01385EB2 divss xmm3,xmm4 01385EB6 mulss xmm0,xmm3 01385EBA mulss xmm1,xmm3 01385EBE mulss xmm2,xmm3 01385EC2 movss dword ptr [p],xmm1 01385EC7 movss dword ptr [ebp-28h],xmm0 01385ECC movq xmm0,mmword ptr [p] 01385ED1 movss dword ptr [ebp-24h],xmm2
The code here is using SSE serial instructions. The ss suffix on addss, mulss, and sqrtss indicate add, multiply and sqrt for one (serial) 32bit single precision floating point number. The xmm registers are 128-bit, but only the first 32 bits are actually used.
Although the assembly code isn't taking advantage of the full 128 bits that Intel SSE offers, it is still faster than the previous x87 code. Using x87, the runtime is 45 cycles per loop, whereas it only takes about 23 cycles per loop after flipping the fast math and Intel SSE switches on. These results were generated on an Intel® Core™ i7 processor and may differ on other x86 processors. Furthermore, any such results are dependent on the compiler and on how the source code is written. Note that this example was an ideal case showing only the timing difference of this particular loop - not the overall results of the application. Furthermore, not all floating point sections of code will be able to demonstrate this amount of speedup.
Conclusion and Further Performance Improvements
In this introductory article we showed how Release code can be made even faster simply by flipping a switch to enable the compiler to use Intel SSE serial instructions and fast math. The performance benefits occur even without touching the source code. It is such a simple thing to do, feel free to tap your coworker on the shoulder and pass on this tidbit of knowledge.
This loop could actually be sped up even more. Rather than using costly sqrt and div instructions, Intel SSE has a much faster approximate inverse square root assembly instruction that may be sufficient, or refined with Newton-Rhapson, depending on how much accuracy is required in the normalized result. Obviously, further benefits come from utilizing the parallel rather than the serial instructions. In other words, it is possible to normalize more than one vector at a time. However, these next steps require some extra programming effort, usage of libraries or header files that have already been SIMD optimized, or a compiler that can automatically vectorize the code. SIMD with Intel SSE is a widely covered topic with many articles and examples available on the web, including the Intel® Developer Zone. Follow-up articles to this one will dive deeper into how to effectively use SIMD and talk about the upcoming Intel® Advanced Vector Extensions (Intel® AVX) to x86 that are now available in hardware. (/en-us/avx/) In particular, using 256-bit AVX, it is possible to rearrange the data on-the-fly and normalize 8 vectors at a time bringing the runtime average-cost down to 2.7 cycles per vector.