Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

As announced last week by James, future Intel Xeon processors will add support for byte and word processing in AVX-512. It is therefore time to revisit my blog from last year, where I showed how to use Intel AVX2 for checking if a bit is set in an array of bits. This time however, I will assume that the input consists of bytes, which allows the really nice trick to replace the gather instruction by a permutation.

Recall that, given an array of bits B and a list of integers i0,…,in, we want to check if the bits B[i0],…,B[in] are set. The result is an array of bits where each bit represent the result of the look-up. For example, the following sequence of input integers could be processed:

The C++ code stays the same as before except for the type of the input array, which is now vector<unsigned char>:

void check_bits(vector<bool> const& B, 
          vector<unsigned char> const& Input, 
          vector<bool> & Output) 
{
    for (int i=0; i<Input.size(); ++i)    
		        if (B[Input[i]])
	      		       Output[i] = true;
}

In the C version of the code, I’ll do some more modification.  We use the 16-bit type unsigned short, which will make it easier for us to convert the code to use words in AVX-512. The inner loop therefore processes 32 entries. Consequently, the result is stored as an array of unsigned ints:

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
	  for (int i=0; i<Length/32; ++i) {
	      unsigned int Result = 0;
	      for (int j=0; j<32; ++j) {
		         int Pos = Input[i*32+j];
		         unsigned short Bits = B[Pos >> 4]; // extract one word
		         unsigned int SingleBit = ((Bits >> (Pos & 15))) & 1; // extract one bit
		         Result |= SingleBit << j; // accumulate result in double-word
		      }
	      
	      Output[i] = Result;
	  }
}

As previously, the variable Pos holds the index. We first read the word in B that contains bit number Pos. In other words, we read 16 bits where one of them is the bit we are interested in. This word named Bits has the index Pos/16 = Pos>>4. The subsequent line extracts the bit of interest. Inside the word Bits, this bit is located at position Pos%16=Pos&15. The bit is therefore shifted to the right by Pos&15 and then masked out with 1. As in the previous version, we collect the bits of the inner loop in the Result variable, which is finally written at the end of the inner loop.

This code can be translated almost directly to Intel AVX-512 instructions. The only roadblock is the array lookup, which normally would be implemented by a gather instruction:

unsigned short Bits = B[Pos >> 4]; // extract one word

Now recall that our input is a list of bytes. Therefore, the bit-vector B is at more 28 = 256 bits long, which nicely fits in a vector register. Therefore, the gather instructions can be replaces by the permute instruction for words that is part of AVX-512BW. This is very attractive as the latency of a permutation is much lower than for a gather instruction. The intrinsic for VPERMV is:

__m512i _mm512_permutexvar_epi16( __m512i idx, __m512i a);

A precise definition as well as further permutation instructions including variations for byte and masks can be found in the Intel® Architecture Instruction Set Extensions Programming Reference.

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
       	__m256i* Input256 = (__m256i*) Input;
	__m256i Bitvector256 = *((__m256i*)pBitvector);
	__m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      __m512i Masked = _mm512_and_epi32(BGathered, BitPos);
	      Output[i] = _mm512_cmpeq_epi16_mask(Masked, BitPos);
	    }
}

The input values are first converted to 16-bit values, then the position of the words in the bit-vector and their shift offset are computed. As discussed above, the whole bit-vector is stored in Bitvector512 outside of the loop and serves as the input to the permutation.

void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
   __m256i* Input256 = (__m256i*) Input;
	   __m256i Bitvector256 = *((__m256i*)pBitvector);
	   __m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      __m512i Masked = _mm512_and_epi32(BGathered, BitPos);
	      Output[i] = _mm512_cmpeq_epi16_mask(Masked, BitPos);
	    }
}

The comparison at the end reveals another novelty of Intel AVX-512. Comparison can now produce their result in the new mask registers:

__mmask32 _mm512_cmpeq_epi16_mask(__m512i a, __m512i b);

The result can therefore be directly written to memory as a bit-vector and there is no need for a movmsk instruction anymore as it was the case in the AVX2 version.

As a last optimization, we take advantage of another new instruction. VPTESTM{B,W,D,Q} ands the content of two vector registers <a1,…,an> and <b1,…,bn>. The instruction then returns for each element of the vector, if this operation results in a non-zero value: <(a1 & b1)!=0,…, (an & bn)!=0>. The return type is also a mask register:

__mmask32 _mm512_test_epi16_mask( __m512i a, __m512i b);

This VPTESTMW instruction can therefore replace the and and comparison in our example:

 void check_bits(unsigned B[],
	 unsigned char Input[],
	 unsigned int Output[],
	 const int Length)
{
       	__m256i* Input256 = (__m256i*) Input;
	__m256i Bitvector256 = *((__m256i*)pBitvector);
	__m512i Bitvector512 = _mm512_castsi256_si512(Bitvector256);

 	  for (int i=0; i<Length/32; ++i) {
	      __m512i Offsets = _mm512_cvtepu8_epi16(Input256[i]);
	      __m512i OffsetsGather = _mm512_srli_epi16(Offsets, 4);
	      __m512i OffsetsShifts = _mm512_and_epi32(Offsets, _mm512_set1_epi16(15));
	      __m512i BGathered = _mm512_permutexvar_epi16(OffsetsGather, Bitvector512);
	      __m512i BitPos = _mm512_sllv_epi16(_mm512_set1_epi16(1), OffsetsShifts);
	      Output[i] = _mm512_test_epi16_mask(BGathered, BitPos);
	    }
}

This finishes our little journey into the Intel AVX-512bw instruction set. There are many more new instructions, which open new opportunities for faster code, and I hope that you enjoy exploring them as much as I do.

 

For more complete information about compiler optimizations, see our Optimization Notice.
Tags: