Vectorized Reduction 2: Let the Compiler do that Voodoo that it do so well

By Clay B.,

Published:04/15/2016   Last Updated:04/15/2016

As I mentioned in my previous post about writing a vectorized reduction code from Intel vector intrinsics, that part of the code was just the finishing touch on a loop computing squared difference of complex values. When I wrote the code in C++, the Intel compiler was able to vectorize the computations I needed. Part of that must be taking the partial values stored in a vector register and summing them up to the final answer (a reduction operation). I decided to see if I could discover how that was done by looking through the generated assembly language code.

I took a course in Assembly language programming in my second year at college. Just to date myself a bit, this was done on the 6502 processor, which you found in the Apple II desktop system. After that I have been aware that subsequent processor architectures have used different and much expanded instruction sets in the assembly language that can be executed on that architecture. I can still tell assembly language code from statements of a high-level programming language. I can even figure out what some of the more basic instructions within the assembly code are supposed to do. But that’s about all I can do without some kind of help. (It’s like my ability to pronounce words in German from my two years of language study in high school; I likely don’t know what those words mean, but I can speak them.)

Thus, poring through several hundred lines of generated assembly code seemed a fairly daunting task. Fortunately, if you compile your code with the -S flag, the Intel compiler will generate an assembly code version of the source, but not link, and that assembly code file is annotated with source code line numbers. This made the process of finding the point in the code that implements the reduction operation a little easier. There was no explicit reduction in the original C++ code; it is just the end of the computation loop where all partial solutions need to be summed into the single variable that was designated to contain that final computed value.

Not to bore you with the details of my hunt, here is the code I located that performs the reduction of four double-precision floating point values, using AVX, into one value as it was given in the generated assembly code file:


   .loc    1  54  is_stmt 1

        vextractf128 $1, %ymm0, %xmm1                           #54.17


        vaddpd    %xmm1, %xmm0, %xmm2                           #54.17


        vunpckhpd %xmm2, %xmm2, %xmm3                           #54.17


        vaddsd    %xmm3, %xmm2, %xmm0                           #54.17

I recognize two of the operations as being addition, but the other two are a little trickier. Certainly, they don’t appear to be anything that is obviously permuting data, which is how I wrote my reduction operation. A little sensible use of Google and the purpose of these unknown instructions became clear. Oddly, though, the site I used had all the parameters reversed from the order they are shown in the generated assembly code file. From the context and what I knew needed to happen, I simply overlooked that and decoded exactly what is going on.

As I did with the version I wrote in my previous blog, I will look at each of the four instructions and describe how it is being used for this reduction computation. Also, as before, I’ve added a visual device to demonstrate what values are being held and manipulated within the processor registers (and these will be the vector registers, not just my stand-in names for registers). Rather than Pai Gow tiles, though, I’ve used pictures of billiard balls and their printed value to represent a double-precision value. (I hope my readers appreciate this since I’m now banned for life from Ginger’s Billiard Emporium and Tattoo Parlor.)

Before getting into the reduction I must tell you, dear reader, that there is an instruction performing a move of some value into the %ymm0 AVX register. The move is taking data from somewhere using some form of register and addressing offset mode that I didn’t bother to look into. Let’s just assume that the final four partial sums have ended up in the %ymm0 register. For the example here, I’m going to start with the same four values that I used in the write-up of my hand-written version. Thus, the values of 2.0, 3.0, 2.0, and 5.0 are stored in the %ymm0 register. In the visualization it would look like:

YMM0 register contents after computation

Once the final computation of partial squared differences is done, the first instruction to execute is:

        vextractf128 $1, %ymm0, %xmm1

This instruction extracts two of the double-precision values (128 bits) from the 256-bit AVX register %ymm0 and puts them into the 128-bit SES vector register %xmm1. The $1 parameter is the bitmask to controls which values are extracted. (I did know that the “$” notation is used to signify a constant value.) A value of 0 in the bitmask will take the lower half (bits 127:0) while a value of 1 copies out the upper half (bits 255:128). Knowing this we can determine that the two upper values (2.0 and 3.0) are moved into the %xmm1 register:

XMM1 register contents after extraction

The second instruction to execute is simply an addition.

      vaddpd    %xmm1, %xmm0, %xmm2

One of the addends is the %xmm1 register that has just been loaded with half of the partial solution values and the destination register for the sum is %xmm2. The other addend is another SES register: %xmm0. Where did that come from? I read back through the assembly language code to find where this register was loaded and how it might be involved. Nothing made sense.

As I looked at the final two statements, I surmised that perhaps there was some relationship between the %ymm0 and %xmm0 registers. I posed the question to a colleague within the compiler support team here at Intel. She wrote back to tell me that, yes, the %xmm0 was simply the lower half of the %ymm0 register. She pointed me to the online Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture. There at the top of page 326 (Chapter 14 - Programming with AVX, FMA and AVX2) is Figure 14-1 that shows how the %xmm0 register is simply the lower half of the %ymm0 register.

With that relationship made clear, it is easy to see that this addition will sum up two of the original partial sums with the other two and yield two further partial solutions. In the illustration below, I’ve extended the %xmm1 and %xmm2 registers to the full AVX register that they are a part of to line this up with the %xmm0 portion of the %ymm0 register (note the background color difference between the full ymm register and xmm register portion). Those parts that aren’t involved with the computation have been grayed out or replace by the cue ball (as a kind of “doesn’t matter” value representation).

Register contents after vector addition

Now, halfway to the final result, the next instruction is another data movement instruction.

        vunpckhpd %xmm2, %xmm2, %xmm3

This is not a permutation or a simple move of the contents from one register to another. It is an “unpack” instruction that interleaves parts of two registers into the destination register. In this case, the single value in the upper half of two xmm registers are placed into the upper and lower parts of the destination xmm register. The 'h' in the instruction name stands for "high" bits (versus 'l' for "low" bits), but I had to look that up, too. 

The two source registers in the above instruction are both %xmm2 and the destination register is %xmm3. Essentially, the value stored in the upper half of the %xmm2 is duplicated into the two values of %xmm3. The contents of the source and destination registers, after executing this instruction, are shown below. (Since this instruction only deals with the 128-bit xmm registers I’ve now confined the illustration to just that part of the register.)

XMM2 and XMM3 register contents after unpack instruction

The last instruction is to add the two partial solutions together to compute the final value. The instruction that does this is:

        vaddsd    %xmm3, %xmm2, %xmm0

At first glance you might expect that this vector addition will make two additions. One that will give me the expected final value I am hoping for and another that does a superfluous addition with the other values in the vector registers because there is no penalty to doing that, as long as that value is ignored later. I first thought this. However, there is a subtle difference between this addition operation and the previous one.

Notice the difference in “spelling” of the two addition operation within the four lines of assembly code. The first instruction ends with “pd”. The ‘d’ denotes that the operands are double-precision floats; the ‘p’ denotes that the data is packed, or that there are multiple values packed into the register that are to be used. Contrast that with the final instruction of the reduction where the second-to-last character in the instruction name is ‘s’. This signifies a scalar operation. Hence, only the lower 64 bits, or one double-precision value, will be used in the addition out of the source registers (here %xmm2 and %xmm3). No wasted effort, no extraneous operations.

This final computation then would be illustrated as I’ve shown here. The upper 64 bits from each of the registers is ignored or “don’t care”.

Register contents after final (scalar) addition

And that is how the Intel compiler “writes” an AVX vector reduction of double-precision float values. It’s not magic or voodoo or some sleight of hand. It is simply knowing everything there is to know about your tools and devices, then being able to take advantage of all the little tricks while avoiding the traps.

You might be tempted to ask if one version of this operation is somehow “better” than the other. One measure would be execution time. To figure that out you would need to find documentation that gives the clock ticks for each of the individual assembly instructions used within both versions. Even if you were willing to go through all that research (I’m not), it really is only four assembly code instructions. To make any tangible difference in your own code, that application would need to compute thousands or hundreds of thousands or millions of reductions. It would seem to me that to compute thousands or hundreds of thousands or millions of loops and computations required to set up the data for all those reductions would overshadow the slim margin one version might have over the other.

A better metric is ease of use. The above version wins that measure hands down. To generate the code above, I simply coded a for-loop and made sure that it vectorized. If the compiler doesn’t recognize that the loops you've written will vectorize, you can add some pragmas or try the OpenMP 4.0 SIMD pragmas to force that. (It took me more time to write that last sentence than it would take me to insert one line to my loop of interest.) Like threading, the compiler with OpenMP will assume that you, the programmer, knows what you're doing and vectorize as requested, even if it will yield incorrect answers.

The added benefit of allowing the compiler to do the vectorization--with hints from you for when you know things are safe--is that this will “future-proof” your code for future architectures and bigger and better vectorization instruction sets. There is plenty of advice and examples over at the Intel® Modern Code Developer Community to help you help the compiler vectorize your applications to the fullest extent possible. If you code your own intrinsics to implement some vectorization scheme, there will come a time when all of that will need to be recoded if you want to port your code to the next generation of vector hardware. Personally, I’d rather be in the situation where I simply call in from some beach in Bali to instruct my colleagues chained to their desks that a simple recompile of the code with the new vector flags will keep the application relevant. (And that, in my broken German, would be Ausgezeichnet!)

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at