# Get _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)

## Get _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)

Hello,

I'm porting an application from SSE to AVX2 and KNC.

I have some _mm_alignr_epi8 intrinsics. While I just had to replace this intrinsic by the _mm512_alignr_epi32 intrinsic for KNC (by the way, I missed this intrinsic in http://software.intel.com/sites/landingpage/IntrinsicsGuide/ for KNC), it seems that the 256-bit version, _mm256_alignr_epi8 does something unexpected. It is not an extension of the previous 128-bit instruction to 256 bits. It performs a 2x128-bit alignr on 256-bit vectors, which is not the expected behaviour if we look at its counterparts in AVX512 and KNC.

Does someone know the most efficient way of implementing the extension of _mm_alignr_epi8 to 256-bit vectors using AVX2 intrinsics?

I.e., being V1={7, 6, 5, 4, 3, 2, 1, 0} and V2={15, 14, 13, 12, 11, 10, 9, 8}, the output of this operation should be V3{8, 7, 6, 5, 4, 3, 2 ,1} and not V3{12, 7, 6, 5, 8, 3, 2 ,1}, which is what I get using _mm256_alignr_epi8.

17 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Diego,

__m256i    v1 = _mm256_setr_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i    v2 = _mm256_setr_epi32(15, 14, 13, 12, 11, 10, 9, 8);

__m256i    v = _mm256_blend_epi32(v1, v2, 0x80);
v = _mm256_permutevar8x32_epi32(v, _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6));

Hope, Intel knows how to do it using just one instruction ;)

Thank you vvsed.

Very useful, though permutevar is an expensive instruction (in addition to the bend).

Let's see if someone else know about another approach, but I'm afraid it won't be much more efficient.

Cheers.

_m256i hi = _mm256_permutef128_epi32(a,b,0x21);
_mm256_alignr_epi8(a,hi,offset);

This only gives you access to an offset of upto 16, but it can be useful

Diego,

I've compared both approaches.
"blend + permutevar" turns out to be of the same speed (VC) or even faster (GC) than
Christopher's "permutef128 + alignr" in a short cycle.
permutevar of cause needs additional register const for indexes.

Most of AVX/AVX2 instructions are designed to perform independently on the 2 128-bit lanes of 256-bit registers. vpalignr is no exception. Due to this design it is often more efficient to process data in 2 parallel streams. In order to optimize data loads and stores, one would load 256 bits of data from the two streams, then perform butterfly transform, and then perform calculations on the 128-bit lanes.

// Butterfly transform

__m256i mm_lo = _mm256_permute2x128_si256(mm1, mm2, 0x20);

__m256i mm_hi = _mm256_permute2x128_si256(mm1, mm2, 0x31);

// Process the two streams

__m256i mm_aligned = _mm256_alignr_epi8(mm_hi, mm_lo, 1);

If your data processing pattern also produces 256 bits of output data per stream, you can perform a second butterfly and then store 256-bit results for each stream.

That said, the 256-bit wide align instruction is indeed missing. The above approach doesn't work for the original motivating example of palignr - to align memory accesses while processing unaligned data. _mm256_alignr_epi8 cannot be used to align memory accesses to 32-byte boundaries.

Very interesting! Thank you.

Sorry vvsed, could you please tell me what GC and VC stand for?

Andysem, let me continue the discussion with regards to your example. If it is intended to palliate the, each time less, inefficient unaligned load accesses... Is it worth it?  2 aligned load + 2 permutations (3 cycles latency) + alignr instead of just one unaligned load instruction?

In case it was, this transformation would be useful when you operate these accesses only between them, but not if you operate them against other aligned loads. In such case, you would have to apply the same butterfly transformation on every involved load, even if they are properly aligned or already on registers.

Quote:

Diego Caballero wrote:

Andysem, let me continue the discussion with regards to your example. If it is intended to palliate the, each time less, inefficient unaligned load accesses... Is it worth it?  2 aligned load + 2 permutations (3 cycles latency) + alignr instead of just one unaligned load instruction?

Newer CPUs perform better with unaligned memory accesses, but still there is significant penalty when the access spans across multiple cache lines. Also, on Sandy/Ivy bridge unaligned 256-bit access is slower than 2 unaligned 128-bit accesses. I don't remember the exact numbers now, but I think this was discussed earlier on this forum. The point is that aligning memory accesses may still be beneficial for memory-bound algorithms.

Quote:

Diego Caballero wrote:

In case it was, this transformation would be useful when you operate these accesses only between them, but not if you operate them against other aligned loads. In such case, you would have to apply the same butterfly transformation on every involved load, even if they are properly aligned or already on registers.

Not sure I understand you. Of course, if your algorithm permits, you can omit both butterfly and align stages. However, that is probably the case for only the simplest algorithms, where you don't perform any horizontal operations, including shuffles. Most often though, you'll need something like butterfly, even if memory alignment is not an issue. As for palignr, it has uses other than avoiding unaligned memory accesses, so it's really case-specific.

Diego,

VC is Visual C (MSVC), GC is gnu C.

ALIGNR_256 macro works with arbitrary offset, eg ALIGNR_256(ret, a, b, 1, 4)

r: result.
(v0, v1): array, v0 contains low order elements.
offs: offset of 1st element.
size: element size in bytes.

#define ALIGNR_256(r, v1, v0, offs, size) \
if (offs == 0) \
r = v0; \
else if (offs == 32 / size) \
r = v1; \
else \
{ \
r = _mm256_permute2x128_si256(v0, v1, 0x21); \
\
if (offs > 16 / size) \
r = _mm256_alignr_epi8(v1, r, offs * size & ~16); \
else if (offs < 16 / size) \
r = _mm256_alignr_epi8(r, v0, offs * size); \
}

Thank you.

Andysem, very useful information.

Vladimir, thank you very much for the macro. It seems the most efficient way of implementing the full functionality for 256-bit registers.

I think that these two instruction could be loaded in parallel on Port2 and Port3 thus speeding execution.

I have posted a solution for that using only AVX, it might worth a try:

http://software.intel.com/en-us/forums/topic/283576#comment-1755317

Maybe the function name is not right (since it does not reflect the fact that a _mm256_alignr_ps would actually be _mm2x128_alignr_ps) but the rest works fine

Hello,

I did some tests (few months ago) on sandy bridge about unaligned loads. It was an FIR filter (convolution) with some kind of ringbuffer for the last values. Here, the ringbuffer could not be accessed aligned each iteration but it did not hurt performance that much. At least, adding code to access everything aligned cause more overhead and only performed about the same speed.

Has anyone tested the ALIGNR_256 macro compared to two 128 unaligned loads? The macro has some if statements and this needs branches which is not that good.

Quote:

Christian M. wrote:

The macro has some if statements and this needs branches which is not that good.

Offset and element size are expected to be compile time constants, so the compiler will remove all conditions and only one branch will remain.

Quote:

andysem wrote:

Quote:

Christian M. wrote:

The macro has some if statements and this needs branches which is not that good.

Offset and element size are expected to be compile time constants, so the compiler will remove all conditions and only one branch will remain.

Sorry, this is something I missed as I only looked at the code and did not think of the use of the macro. This improves things a lot, but one branch can still make a difference.

There is NO branches in the generated machine code (alignr only takes compile-time constants).
Unless you forget to enable optimizations.

Here is a version of alignr for AVX2 that work across lanes (in 2 instructions at most).

```// Unlike _mm256_alignr_epi8 this one works across lanes
template <int N>
__m256i _mm256_alignr_ex_epi8_emul(__m256i const & high, __m256i const & low)
{
__m256i high0low1 = _mm256_permute2f128_si256(low, high, _MM_SHUFFLE(0, 2, 0, 1));

if (N == 0)       return low;
else if (N == 32) return high;
else if (N == 16) return high0low1;
else if (N < 16)
{
return _mm256_alignr_epi8(high0low1, low, N & 15);
}

return _mm256_alignr_epi8(high, high0low1, N & 15);;
}

#define _mm256_alignr_ex_epi8(x, y, n) _mm256_alignr_ex_epi8_emul<(n)>((x), (y))
```