As we all know AVX2 has extended (256 bit) comparing to SSE4.2 (128 bit) vector length. For basic instructions like packed add, sub, mul… this leads to ~2 times performance advantage (as vector length is 2 times wider), but for some instructions performance gain is not so obvious. This blog is about such instructions, about permutations.
Briefly a set of AVX2 permutations are applied to high and low 128 bit parts separately. These instructions are: vpalignr, all vpack instructions, all vpunpck instructions and vpshufb instruction.
For example (ymms are 256 bit registers):
vpalignr ymm0, ymm2, ymm1, $12
palignr ymm0.lo, ymm2.lo, ymm1.lo, $12
palignr ymm0.hi, ymm2.hi, ymm1.hi, $12
where ymm.lo is low 128 bits of ymm and ymm.hi is high 128 bits of ymm.
That way if you just replace “palignr” with “vpalignr” and “xmm” with “ymm” you will not get correct result (darker blue identify expected result):
However if you prepare 1st operand using “vperm2i128” you can get the expected result:
Therefore we need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.
Below are schemes for “vpack”, “vpunpck” and “vpshufb”.
Consider "*" are zeroes. The example operates on quad words (to make vector visualization shorter), while vpack instructions operate only on words and double words, but for them scheme is the same.
Here “vpermq” is used to get final vector. And we also need 2 instructions.
The good news here is that we can get both low and high part from ymm0 and ymm1 using “vperm2i128”. To get low part:
We need 3 instructions to extend punpck[l,h] and only 4 to extend both.
Say we have the following 256bits vector from words (16bits):
And finally want to get the following vector:
Then we’ll need 2 “vpshufb”, “vpermq” and “vpor” to complete the permutation. First we shuffle elements that are moving to another vector part (from high to low and from low to high). Here they are: “F” and “0”. Other positions are filled with zeroes “*” (vpshufb put zero in corresponding place if mask is -128. Remember that we can operate only on 128 bit parts of a vector.
Next shuffle the rest elements. Ready positions are filled with zeroes “*”:
And finally combine 2 vectors into 1:
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804