Don't Spill That Register - Ensuring Optimal Performance From Intrinsics

Download Article

Download Don't Spill That Register - Ensuring Optimal Performance From Intrinsics [PDF 79KB]


The goal of this article is to help developers ensure their C/C++ code with intrinsics produces the optimal assembly, and to show how to spot unnecessary register spilling.


Programming with intrinsics can be as optimal as implementing code directly in assembly. Compared to assembly code, C/C++ code using intrinsics is subject to more compilation steps to generate the final code. Compilation in Debug mode, and possibly in Release mode with improperly set compilation flags, may generate code with seemingly unnecessary instructions for copying registers to and from the stack. Register copying can also result from the source code using more __m256 or __m128 variables than the number of corresponding registers available in hardware. From a simple example using intrinsics, this short article shows good and bad assembly produced and then explains what happened and how to avoid it.


The topics of x86 intrinsics, Intel® Streaming SIMD Extensions (Intel® SSE), and the Intel® Advanced Vector Extensions (Intel® AVX) are discussed in detail online on the Intel® Developer Zone site ( An intrinsic looks just like a function call in C/C++ code, but the compiler sees it and turns that intrinsic into a single line of assembly. For example, consider the following code:

         __m128 a = _mm_rsqrt_ss(b);  // a = 1.0f/sqrt(b) approx

This line of code will cause the compiler to emit an RSQRTSS instruction at this spot.

Intrinsics let the programmer do instruction level optimization directly, but without the burden of dealing with register allocation, loop syntax, etc. Developers sometimes ask, "Are intrinsics as optimal as assembly?" The answer is usually yes, or at least close to optimal. Furthermore, code with intrinsics is more future-proof, since code initially written for Intel SSE can be recompiled using Intel AVX. Intel AVX versions normally run faster than their Intel SSE counterparts on the same hardware. Given the ease of use and forward compatibility, intrinsics are the logical choice for optimizing to the hardware.

To use intrinsics with the confidence that the program is optimal, it is worthwhile knowing how code gets compiled and be aware of what to look out for. We look at a short intrinsic example and see the corresponding assembly that should result, as well as compilation that generated suboptimal results.

Generated Assembly

For this example, we use an intrinsic implementation of the loop s[i]=a*x[i]+y[i], commonly known as "saxpy", and show the code generated.

  inline void saxpy_simd4(float* S,float _a,const float* X,const float *Y,int n)
     __m128 a = _mm_set1_ps(_a); 
     for(int i=0; i!=n ;i+=4)  // process 4 elements at a time
         __m128 x = _mm_load_ps(X+i);
         __m128 y = _mm_load_ps(Y+i);  
         __m128 s = _mm_add_ps(_mm_mul_ps(a,x),y);  // a*x + y 
         _mm_store_ps(S+i, s );

When this x86 code gets compiled, it should look something like:

    (AVX assembly instructions from loop listed only):
 001B4FF0  vmulps      xmm1,xmm0,xmmword ptr xpoints (1B97A0h)[eax]  
 001B4FF8  vaddps      xmm1,xmm1,xmmword ptr ypoints (1B9390h)[eax]  
 001B5000  vmovaps     xmmword ptr dest (1C0440h)[eax],xmm1  
 001B5008  add         eax,10h  
 001B500B  cmp         eax,400h  
 001B5010  jl          saxpy_128+20h (1B4FF0h)  

This assembly sequence was generated by Microsoft Visual Studio* C++ Compiler 2010 with default release mode settings and /arch:AVX added to the command line. Only the loop instructions within the loop that are repeated many times are shown. Variable xmm0 is initially loaded with the constant a. Clearly the first 3 assembly instructions directly map to the intrinsics in the C++ code. The assembly is actually shorter than the corresponding C code, since the register loading intrinsics have been combined with the vmulps and vaddps instructions. The last 3 instructions correspond to the for i loop.

Compiling this small function in another project without optimization flag set resulted in the following assembly:

 00B5F7C2  mov         eax,dword ptr [i]  
 00B5F7C5  add         eax,4  
 00B5F7C8  mov         dword ptr [i],eax  
 00B5F7CB  cmp         dword ptr [i],100h  
 00B5F7D2  jge         saxpy_128+122h (0B5F872h)  
 00B5F7D8  mov         eax,dword ptr [i]  
 00B5F7DB  vmovaps     xmm0,xmmword ptr xpoints (0B715A0h)[eax*4]  
 00B5F7E4  vmovaps     xmmword ptr [ebp-1D0h],xmm0  
 00B5F7EC  vmovaps     xmm0,xmmword ptr [ebp-1D0h]  
 00B5F7F4  vmovaps     xmmword ptr [x],xmm0  
 00B5F7F9  mov         eax,dword ptr [i]  
 00B5F7FC  vmovaps     xmm0,xmmword ptr ypoints (0B71180h)[eax*4]  
 00B5F805  vmovaps     xmmword ptr [ebp-1B0h],xmm0  
 00B5F80D  vmovaps     xmm0,xmmword ptr [ebp-1B0h]  
 00B5F815  vmovaps     xmmword ptr [y],xmm0  
 00B5F81A  vmovaps     xmm0,xmmword ptr [x]  
 00B5F81F  vmovaps     xmm1,xmmword ptr [a]  
 00B5F824  vmulps      xmm0,xmm1,xmm0  
 00B5F828  vmovaps     xmmword ptr [ebp-190h],xmm0  
 00B5F830  vmovaps     xmm0,xmmword ptr [y]  
 00B5F835  vmovaps     xmm1,xmmword ptr [ebp-190h]  
 00B5F83D  vaddps      xmm0,xmm1,xmm0  
 00B5F841  vmovaps     xmmword ptr [ebp-170h],xmm0  
 00B5F849  vmovaps     xmm0,xmmword ptr [ebp-170h]  
 00B5F851  vmovaps     xmmword ptr [s],xmm0  
 00B5F859  vmovaps     xmm0,xmmword ptr [s]  
 00B5F861  mov         eax,dword ptr [i]  
 00B5F864  vmovaps     xmmword ptr dest (0B78240h)[eax*4],xmm0  
 00B5F86D  jmp         saxpy_128+72h (0B5F7C2h)  

Here we see that the compiler issued additional instructions that do not correspond to loop management or intrinsics in the original source code. What is happening here is that after registers are loaded from memory, they are copied to and from the stack. The explanation is that, from the C language perspective, the __m128 variables reside on the stack, and the compiler is just putting the data to the place where it was declared. It is the O2 optimization step, not the fact that we used intrinsics, that is normally responsible for removing such unnecessary copying. The extra copying will likely happen in Debug, but may also happen in Release mode if the project's Optimization setting is not Maximum Speed or /O2. The example shown here compiles to Intel AVX, but the same thing happens with Intel SSE as well.

Register Shortage

When using temporary __m128 or __m256 variables for single instruction multiple data (SIMD) programming, the optimizing compiler usually does a good job of keeping these as registers. Even with optimizations, the compiler may still sometimes generate assembly code that copies temporary values to the stack. Consider for example a 3D spring (distance constraint) update written using hybrid structure of arrays (SOA) style programming. The following example is based on code from the AVX cloth sample available on Intel's website:

  void springupdate(__m256 A[][3], __m256 B[][3],__m256 &restlen)
    __m256 half = _mm256_set1_ps(0.5f)
    for(int i=0;i != N ; i++)  // 8*N constraints in total
      // each a and b contain the xyz endpoints for 8 pseudo-springs
      __m256 *a=A[i];
      __m256 *b=B[i];
      __m256 vx  = _mm256_sub_ps(b[0],a[0]); // v.x=b.x-a.x
      __m256 vy  = _mm256_sub_ps(b[1],a[1]); // v.x=b.x-a.x
      __m256 vz  = _mm256_sub_ps(b[2],a[2]); // v.x=b.x-a.x
      __m256 dp  = vx*vx+vy*vy+vz*vz;        // assume operator overloads for add and mul 
      __m256 imag= _mm256_rsqrt_ps(dp);      // inverse magnitude
      // normalize v
      vx = _mm256_mul_ps(vx,imag); // vx *= inverse magnitude 
      vy = _mm256_mul_ps(vy,imag); // vy *= imag
      vz = _mm256_mul_ps(vz,imag); // vz *= imag 
      __m256 half_stretch = ( dp*imag - restlen) * half;
      // move endpoints a and b together 
      a[0]=a[0]+ vx * half_stretch;    
      a[1]=a[1]+ vy * half_stretch;    
      a[2]=a[2]+ vz * half_stretch;    
      b[0]=b[0]- vx * half_stretch;    
      b[1]=b[1]- vy * half_stretch;    
      b[2]=b[2]- vz * half_stretch;   

For brevity, the above code assumes the obvious operator overloads are implemented and inlined. Even if all intrinsic calls were written out by hand, there is a good chance that compiling a routine like this will produce assembly code that will copy data to and from the stack and/or load the same data from arrays A and B multiple times. While there is no limit to how many variables of this type a programmer uses, there are a limited number of hardware registers available. When Intel AVX code is compiled into a 32-bit executable, the compiler has only 8 YMM registers available. Even with this oversimplified distance constraint equation, the code uses 6 registers for the endpoints, another 3 for the vector v between them, as well as registers for the inverse magnitude, rest length, magnitude minus rest length, half constant, and half stretch amount. Clearly, the compiler must use the same register for more than one variable in this loop. Therefore, it will have to reload values from (and possibly copy values to) the stack. In this situation, the solution to avoid register spilling is to compile the code for 64-bit. Then, instead of just 8, the compiler has 16 YMM (256-bit) registers at its disposal, which is more than enough for this particular simulation.


Starting from a C/C++ intrinsics sample, we've shown the good and the bad of what sort of assembly code can be generated. The suboptimal extra register copying code generation can result from compiler settings such as not using fast-code optimization (O2), or from not using 64-bit when more than 8 registers are needed. There may be other reasons why a compiler might not generate the assembly the programmer expects. Therefore, while intrinsics are often the preferred choice for code optimization, it is still a good idea to inspect the generated assembly to ensure the compiled result is as expected.