By Thomas Willhalm, Published: 05/17/2013, Last Updated: 05/17/2013

It is only a few weeks until you will get a chance to get your hands on the 4th Generation Intel® Core&tm; Processor Family
formerly code-named Haswell. This architecture will come with some very nice features including Intel® Advanced Vector Extensions 2 (Intel® AVX2). Most notably, Intel®
AVX2 is extending the vector length from 128 to 256 bit for most of the vector instructions. Apart from this, some completely new instructions are introduced. My previous blog was about writing your first program using some of those instructions. I now would like to use the opportunity to show you an example of how
two of my favorite new instructions can be used.
The example that I came up with consists in checking if a bit is set in an array of bits. So, given an array of bits `B`

and a list of integers `i`

, check if the bits _{0},
…,i_{n}`B[i`

are set. The result is again stored as an array of bits. For example, the following sequence of
input integers could be processed:
_{0}],…,B[i_{n}]

In C++, a bit can be conveniently described as a `bool`

and the natural representation of an array of bits is therefore a vector of Boolean values vector. This allows
an easy description of the problem as follows:

```
void check_bits(vector<bool> const& B,
vector<int> const& Input,
vector<bool> & Output)
{
for (int i=0; i<Input.size(); ++i)
if (B[Input[i]])
Output[i] = true;
}
```

In plain C, such abstract data structures are not part of the language. On the other hand, the syntactic sugar in C++ is also hiding a lot of the computations that are executed underneath. The following C version therefore exposes much clearer the instructions that need to be performed:

```
void check_bits(unsigned B[],
unsigned Input[],
unsigned char Output[],
int Length)
{
for (int i=0; i<Length/8; ++i) {
unsigned char Result = 0;
for (int j=0; j<8; ++j) {
int Pos = Input[i*8+j];
unsigned int Bits = B[Pos >> 5]; // extract one double-word
unsigned int SingleBit = (Bits >> (Pos & 31)) & 1; // extract one bit
Result |= SingleBit << j; // accumulate result in double-word
}
Output[i] = Result;
}
}
```

The loop has been turned into a nested loop so that always 8 bits can be extracted in the inner loop and written as a single write of one byte. The variable `Pos`

therefore
holds the overall index. We first read the double-word in `B`

that contains bit number `Pos`

. In other words, we read 32 bits where one of them is the bit we are
interested in. This double-word named `Bits`

has the index `Pos/32 = Pos>>5`

. The subsequent line extracts the bit of interest. Inside the double-word
`Bits`

, this bit is located at position
`Pos%32=Pos&31`

. Our bit is therefore shifted to the right by `Pos&31`

and then masked out with 1. We collect the bits of the inner loop in the Result
variable, which is finally written at the end of the inner loop.
Let’s now convert this code to SIMD instructions. There are two roadblocks that prevent an efficient implementation using Intel® Streaming SIMD Extensions (Intel® SSE) or Intel® Advanced Vector Extensions (Intel® AVX):

- There is an array lookup where the index is not known at compile time:

`unsigned int bits = B[ij >> 5];`

- One of the shift operations has a variable as the shift operand:

`unsigned int SingleBit = (bits >> (ij & 31)) & 1;`

- Intel® AVX2 introduces an instruction that gathers values from memory into a vector register. The base address is given as an argument and the offsets of the values are
provided in another vector register. Additionally, a scaling factor is given for a greater flexibly in the size of the array elements. The instruction that gathers double-words using
double-words as indexes is called VPGATHERDD. The corresponding intrinsic is:

`__m256i _mm256_i32gather_epi32 ( int const * base, __m256i index, const int scale);`

Please refer to the Intel® Advanced Vector Extensions Programming Reference for a more precise definition. The programming reference also describes gather for other data types as well as the usage of gather with masks. - Vector shifts have been available since Intel® Streaming SIMD Extensions 2 (Intel® SSE2). However, all data elements in the vector register were always shifted by the same
number of bits. What is new with Intel® AVX2 is the ability to provide a vector register for a variable shift-count per data element. For logical right shifts, the instruction is
called VPSRLVD and the corresponding intrinsic is:

`__m256i _mm256_srlv_epi32 (__m256i m, __m256i count);`

Other variants in data type and shifts are again listed in the Intel® Advanced Vector Extensions Programming Reference.

Equipped with these powerful instructions, we can now use Intel® AVX2 to implement a SIMD version of the check_bits function:

```
void check_bits(unsigned B[],
unsigned Input[],
unsigned char Output[],
const int Length)
{
__m256i* Input256 = (__m256i*) Input;
for (int i=0; i<Length/8; ++i)
{
__m256i Offsets = Input256[i];
__m256i OffsetsGather = _mm256_srli_epi32(Offsets, 5);
__m256i BGathered = _mm256_i32gather_epi32(B, OffsetsGather, 4);
__m256i OffsetsShift = _mm256_and_si256(Offsets, _mm256_set1_epi32(31));
__m256i BitPos = _mm256_sub_epi32(_mm256_set1_epi32(31), OffsetsShift);
__m256i Bits = _mm256_sllv_epi32(BGathered, BitPos);
Output[i] = _mm256_movemask_ps((__m256)Bits);
}
}
```

The Intel® AVX2 version of check_bits compares eight values in one loop trip. First, the offsets for the array lookup with the gather instruction are computed and the double-word
elements are read from the array of bits B. Then, the bit counts for the shift instruction are determined. These offsets are subtracted from 31 to compute how many bits we need to shift
left such that the bits that we need to extract are in the upmost bit of each double-word. After shifting, these upmost bits need to be converted to a sequence of single bits, each for
one data element. The instruction `movmsk`

performs exactly this operation as it extract the sign bit of each data element.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804