By Clay B., Published: 04/12/2016, Last Updated: 04/12/2016

In order to prove to a client that a change of their indexed collection of complex numbers from an array of structures (with a real part and an imaginary part) to a structure of arrays, I created a simplified version of their hotspot code for computing the sum of squared differences. At its most basic, this was a for-loop that takes two equal-sized vectors of complex values, computes the difference of corresponding real values and the difference of corresponding imaginary values, squares the two computed differences, adds the two differences together and then adds that sum to a running total.

To illustrate the point I was trying to make, I implemented the for-loop computation with the client’s original array of complex **struct**s and then added another version of the computational loop, but with a structure that held an array of real and an array of imaginary values. In both cases I was using double-precision floating point values in my C++ code. As expected, due to the better layout of data, my structure of arrays method did measurably better than the original data structure. Wanting to go a bit further, I compiled my example to use the AVX vectorization. Those results were an even better argument for the change of data structure.

Thinking to push myself, I decided to write another version of the two tests using AVX vector intrinsics in my source code. I realized that the heart of the computation was pretty simple: copy data to vector registers, duplicate and permute vector register contents, do a couple multiplications and some additions. I knew all these operations were supported, so I went to the online Intel Intrinsics Guide page to figure out exactly what intrinsics would be needed. Choosing the technology supported by my target system and selecting categories of operations, I quickly scoped out all the intrinsic statements I needed. That is, until I came up against the realization that I would need a reduction operation to make the final addition of the separate summations within the final vector register.

What could be easier? I mean, reduction is a basic operation in parallel computation, right? OpenMP and Intel Threading Building Blocks both have built-in reduction operations and half of Chapter 6 in The Art of Concurrency describes how to implement parallel sum (reduction). Heeding the advice in the latter (and being the lazy programmer that I am), I went in search of an online example code. I presumed someone somewhere must have already coded up a vector intrinsic version of reduction. After 20 minutes with Google, I hadn’t found anything even close. Thus, I turned back to the Intel Intrinsics Guide and knuckled down to write one myself.

In order to prevent any stress to my reader due to prolonged anticipation, here is my stand-alone implementation of the algorithm I devised to do a reduction of 4 double-precision values within an AVX vector register and print out the final answer (just to check results and show it was done correctly). There are four steps in the algorithm that will take the values in the **ymm0** vector variable, sum up the four double-precision values, and leave that sum in each of the four elements of the final vector variable (**ymm4**). After the full code I will describe each of the four algorithmic steps and intrinsics used in more detail.

```
#include <iostream>
// get AVX intrinsics
#include <immintrin.h>
double indata[4] = {2.0, 3.0f, 2.0f, 5.0f};
double outdata[4] = {0.0, 0.0, 0.0, 0.0};
__m256d ymm0, ymm1, ymm2, ymm3, ymm4;
int main()
{
ymm0 = _mm256_loadu_pd(indata);
ymm1 = _mm256_permute_pd(ymm0, 0x05);
ymm2 = _mm256_add_pd(ymm0, ymm1);
ymm3 = _mm256_permute2f128_pd(ymm2, ymm2, 0x01);
ymm4 = _mm256_add_pd(ymm2, ymm3);
_mm256_storeu_pd(outdata, ymm4);
std::cout << "Out vector: " << outdata[0] << " " << outdata[1] << " " <<
outdata[2] << " " << outdata[3] << std::endl;
}
```

To help in the description, since I myself am a visual learner, I have put together an illustration of the contents of the vector variables at each phase of the reduction algorithm that will assist in showing the operation of each of the AVX intrinsics. Just for the sake of doing something different and interesting I will be using Pai Gow dominoes to represent the four values within the AVX vector registers. For my purposes I will ignore the configuration of dots, the color of the dots, the name of the tiles, and I hope you can, too. Simply count the number of spots on each of the tiles to determine the value within each part of the vector. (My code is using double-precision floating point values while the illustrative dominoes have an integral number of dots.)

In the example code, I have loaded the data on which to perform the reduction into the **ymm0** variable. (Though I’ve named these variables after the vector registers within the processor, there is no implied or explicit mapping of my names to the actual registers used within the compiled code. Also, rather than saying “vector variable” for these, I will call them “registers” just to be less verbose.) To get started, the program simply loads the values of **indata** into **ymm0**. If this algorithm were incorporated in the sum of squared differences code that was driving this whole thing, this initial data would be the results of those vectorized computations. In any event, the data to be reduced starts in the **ymm0** register. Here’s my illustration of initial data of 4 values:

The first instruction for the reduction is to copy the data within the **ymm0** register, permute the order of the vector elements and store this permutation in the **ymm1** register.

` ymm1 = _mm256_permute_pd(ymm0, 0x05);`

The first parameter is the source of the data, **ymm0**, and the second is an integer bitmask to control how the elements from the source are permuted. The least significant bit (index [0]) specifies the source of the least significant element in the vector (bits 63:0), the second least significant bit in the mask (index [1]) specifies the next 64-bit element in the vector (bits 127:64), and so on for the upper two vector elements. A ‘0’ in the least significant bit merely copies the current value of the least significant 64 bits into the destination; a ‘1’ in this position will copy the second least significant element (bits 127:64) into the least significant destination vector element. For the second least significant bit in the mask, the opposite values will perform the same operations into the second destination vector element. That is, a ‘1’ will copy the same value while a ‘0’ will copy the least significant vector element. This instruction works with 128-bit lanes, so the upper half of the vector register elements will be permuted the same way (just add 128 to all the bit indices).

(Now, if you followed my marginally tortuous explanation of the operation of this intrinsic, then you’re a better reader than I am. For the rest of us, there is a picture below illustrating what will be stored in **ymm1**, which will be worth at least 1000 less convoluted words. Either way, I suggest that you consult the Intel Intrinsics Guide for the official explanation of the instructions, the parameters, and the operation of all the intrinsics I’m using.)

For my reduction operation, the bit mask I used is ‘0101’ (or 0x05 in hexadecimal). This will exchange the position of the values in the lower pair of vector elements and exchange the position of the values in the upper pair of vector elements. Compare the picture below with the picture in the figure above to see exactly how the elements end up in the **ymm1** register.

The next instruction is an addition of the original data and the permuted copy.

```
ymm2 = _mm256_add_pd(ymm0, ymm1);
```

This is a pretty basic instruction taking two source registers (**ymm0**, **ymm1**), adding the corresponding elements, and storing the sums into the corresponding elements of the destination register. I’ve illustrated this below with the two example source registers and what would be stored in the **ymm2** destination.

After the second operation, the reduction is halfway complete. From the illustration we can see that the sum of the two upper and the two lower halves of the original vector register have been summed and are duplicated in the two upper elements and two lower elements, respectively.

To finish the reduction, I simply need to execute two similar operations: permute and add. The permutation, however, will need to copy 128-bits of vector data. This is done with the following intrinsic:

```
ymm3 = _mm256_permute2f128_pd(ymm2, ymm2, 0x01);
```

If you look up this instruction in the Intel Intrinsics Guide, you will notice that the first two parameters are (potentially) two different vector source registers. I’ve used the same register here since I really only need to permute (or copy) data from a single source. As with the previous permutation instruction I’ve used, the last parameter is a bitmask to control what gets copied to the destination register.

From the documentation I was using, the bitmask uses eight bits with the lower nibble dictating what is copied to the lower 128 bits of the destination and the upper nibble controlling the value copied into the upper 128 bits of the 256-bit register. With two input sources, there are 4 possibilities from where to copy 128-bit register contents: lower half (bits 127:0) of first source (‘0’), upper half (bits 255:128) of first source (‘1’), lower half of second source (‘2’), and upper half of second source (‘3’).

One thing I found confusing when researching this intrinsic in the Intrinsics Guide is that while the bitmasks are used 4-bits at a time, only 2 bits are used to determine which of the four sources of data are to be copied. When specifying the bitmask, it would appear that a binary value of ‘0001’ will be treated the same as ‘1101’ since the two uppermost bits appear to be thrown out. The operation code shown on the site would imply that this is the case. (It isn’t. If you’re curious, I urge you to play around with the program above and the bitmask in the second permutation intrinsic.) For my purposes, I didn’t need to worry about this since I was simply swapping the upper and lower 128-bit sections of the same source register, **ymm2**, into the destination register.

The example illustration below shows the source register, **ymm2**, and the contents of the destination register, **ymm3**, after this permutation instrinsic is executed. (Can you see that a bitmask of ‘0x21’ in the above instruction would yield the same results?)

Finally, now that everything is lined up, a simple vector addition of the **ymm2** and **ymm3** registers will complete the reduction of the values from the original vector register.

```
ymm4 = _mm256_add_pd(ymm2, ymm3);
```

And that would look like this in my Pai Gow tile illustration:

And that’s all there is for doing a sum reduction on the four double-precision elements from an AVX 256-bit vector register. The last couple of instructions from the full code example copy the four results values from the final vector register and then print them out for confirmation of the correct answer.

Can this algorithm be done with different data types and under different vector technologies? I haven’t looked at the general problem in too much detail, but if there are a set of appropriate permutation intrinsics available, it should be possible to construct a similar reduction algorithm for different sizes and types of data.

Now you might be wondering whether or not the effort I put into researching intrinsics, decoding the operation details, and trying different combinations of parameters and bit masks until I figured out the right formula was worth it all. Sadly, not really. Comparing my hand-coded vectorization of the sum of squared differences, including the four-step reduction, to the compiler generated vectorization of the simple for-loop code version, there was a very slim margin of better performance from my hand-coded vectorization. This tells me that someone working on the compiler has probably already figured out something similar to what I have described above. The lesson here just reinforces the adage about *letting the compiler find and transform your source code into vectorized code with the programmer giving hints or small rewrites as needed*.

All in all, the whole exercise does have a silver lining. I’m thinking that, with this publication, it might be the first (and only) place on the web where a Google search for “vectorized reduction” will yield a tangible algorithm. I doubt anyone will be beating down my door to get this, though. Perhaps I should have kept installing microwave ovens custom kitchen deliveries and moving those refrigerators and color TVs.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804