Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2)

It is only a few weeks until you will get a chance to get your hands on the 4th Generation Intel® Core&tm; Processor Family formerly code-named Haswell. This architecture will come with some very nice features including Intel® Advanced Vector Extensions 2 (Intel® AVX2). Most notably, Intel® AVX2 is extending the vector length from 128 to 256 bit for most of the vector instructions. Apart from this, some completely new instructions are introduced. My previous blog was about writing your first program using some of those instructions. I now would like to use the opportunity to show you an example of how two of my favorite new instructions can be used. The example that I came up with consists in checking if a bit is set in an array of bits. So, given an array of bits B and a list of integers i0, …,in, check if the bits B[i0],…,B[in] are set. The result is again stored as an array of bits. For example, the following sequence of input integers could be processed:

In C++, a bit can be conveniently described as a bool and the natural representation of an array of bits is therefore a vector of Boolean values vector. This allows an easy description of the problem as follows:

void check_bits(vector<bool> const& B, 
                vector<int> const& Input, 
                vector<bool> & Output) 
     for (int i=0; i<Input.size(); ++i)    
		if (B[Input[i]])
	      		Output[i] = true;

In plain C, such abstract data structures are not part of the language. On the other hand, the syntactic sugar in C++ is also hiding a lot of the computations that are executed underneath. The following C version therefore exposes much clearer the instructions that need to be performed:

void check_bits(unsigned B[],
	        unsigned Input[],
	        unsigned char Output[],
	        int Length)
    for (int i=0; i<Length/8; ++i) {
        unsigned char Result = 0;
        for (int j=0; j<8; ++j)  {
            int Pos = Input[i*8+j]; 
            unsigned int Bits = B[Pos >> 5]; // extract one double-word
            unsigned int SingleBit = (Bits >> (Pos & 31)) & 1; // extract one bit
            Result |= SingleBit << j; // accumulate result in double-word
        Output[i] = Result;
The loop has been turned into a nested loop so that always 8 bits can be extracted in the inner loop and written as a single write of one byte. The variable Pos therefore holds the overall index. We first read the double-word in B that contains bit number Pos. In other words, we read 32 bits where one of them is the bit we are interested in. This double-word named Bits has the index Pos/32 = Pos>>5. The subsequent line extracts the bit of interest. Inside the double-word Bits, this bit is located at position Pos%32=Pos&31. Our bit is therefore shifted to the right by Pos&31 and then masked out with 1. We collect the bits of the inner loop in the Result variable, which is finally written at the end of the inner loop.

Let’s now convert this code to SIMD instructions. There are two roadblocks that prevent an efficient implementation using Intel® Streaming SIMD Extensions (Intel® SSE) or Intel® Advanced Vector Extensions (Intel® AVX):

  1. There is an array lookup where the index is not known at compile time:
    unsigned int bits = B[ij >> 5];
  2. One of the shift operations has a variable as the shift operand:
    unsigned int SingleBit = (bits >> (ij & 31)) & 1;
With Intel® AVX2 there are two new instructions that address exactly these problems:
  1. Intel® AVX2 introduces an instruction that gathers values from memory into a vector register. The base address is given as an argument and the offsets of the values are provided in another vector register. Additionally, a scaling factor is given for a greater flexibly in the size of the array elements. The instruction that gathers double-words using double-words as indexes is called VPGATHERDD. The corresponding intrinsic is:
    __m256i _mm256_i32gather_epi32 ( int const * base, __m256i index, const int scale);
    Please refer to the Intel® Advanced Vector Extensions Programming Reference for a more precise definition. The programming reference also describes gather for other data types as well as the usage of gather with masks.
  2. Vector shifts have been available since Intel® Streaming SIMD Extensions 2 (Intel® SSE2). However, all data elements in the vector register were always shifted by the same number of bits. What is new with Intel® AVX2 is the ability to provide a vector register for a variable shift-count per data element. For logical right shifts, the instruction is called VPSRLVD and the corresponding intrinsic is:
    __m256i _mm256_srlv_epi32 (__m256i m, __m256i count);
    Other variants in data type and shifts are again listed in the Intel® Advanced Vector Extensions Programming Reference.

Equipped with these powerful instructions, we can now use Intel® AVX2 to implement a SIMD version of the check_bits function:

void check_bits(unsigned B[],
	 unsigned Input[],
	 unsigned char Output[],
	 const int Length)
          __m256i* Input256 = (__m256i*) Input;

          for (int i=0; i<Length/8; ++i)
              __m256i Offsets = Input256[i];
              __m256i OffsetsGather = _mm256_srli_epi32(Offsets, 5);
              __m256i BGathered = _mm256_i32gather_epi32(B, OffsetsGather, 4);
              __m256i OffsetsShift = _mm256_and_si256(Offsets, _mm256_set1_epi32(31));
              __m256i BitPos = _mm256_sub_epi32(_mm256_set1_epi32(31), OffsetsShift);
              __m256i Bits = _mm256_sllv_epi32(BGathered, BitPos);
              Output[i] = _mm256_movemask_ps((__m256)Bits);
The Intel® AVX2 version of check_bits compares eight values in one loop trip. First, the offsets for the array lookup with the gather instruction are computed and the double-word elements are read from the array of bits B. Then, the bit counts for the shift instruction are determined. These offsets are subtracted from 31 to compute how many bits we need to shift left such that the bits that we need to extract are in the upmost bit of each double-word. After shifting, these upmost bits need to be converted to a sequence of single bits, each for one data element. The instruction movmsk performs exactly this operation as it extract the sign bit of each data element.

Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.