Get _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)

Get _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)

Ritratto di Diego Caballero

Hello,

I'm porting an application from SSE to AVX2 and KNC.

I have some _mm_alignr_epi8 intrinsics. While I just had to replace this intrinsic by the _mm512_alignr_epi32 intrinsic for KNC (by the way, I missed this intrinsic in http://software.intel.com/sites/landingpage/IntrinsicsGuide/ for KNC), it seems that the 256-bit version, _mm256_alignr_epi8 does something unexpected. It is not an extension of the previous 128-bit instruction to 256 bits. It performs a 2x128-bit alignr on 256-bit vectors, which is not the expected behaviour if we look at its counterparts in AVX512 and KNC.

Does someone know the most efficient way of implementing the extension of _mm_alignr_epi8 to 256-bit vectors using AVX2 intrinsics?

I.e., being V1={7, 6, 5, 4, 3, 2, 1, 0} and V2={15, 14, 13, 12, 11, 10, 9, 8}, the output of this operation should be V3{8, 7, 6, 5, 4, 3, 2 ,1} and not V3{12, 7, 6, 5, 8, 3, 2 ,1}, which is what I get using _mm256_alignr_epi8.

Thank you in advance

 

 

Barcelona Supercomputing Center
16 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Vladimir Sedach

Hello Diego,

__m256i    v1 = _mm256_setr_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i    v2 = _mm256_setr_epi32(15, 14, 13, 12, 11, 10, 9, 8);

__m256i    v = _mm256_blend_epi32(v1, v2, 0x80);
v = _mm256_permutevar8x32_epi32(v, _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6));

Hope, Intel knows how to do it using just one instruction ;)

Ritratto di Diego Caballero

Thank you vvsed.

Very useful, though permutevar is an expensive instruction (in addition to the bend).

Let's see if someone else know about another approach, but I'm afraid it won't be much more efficient.

 

Cheers.

Barcelona Supercomputing Center
Ritratto di Christopher H.

_m256i hi = _mm256_permutef128_epi32(a,b,0x21);
_mm256_alignr_epi8(a,hi,offset);

This only gives you access to an offset of upto 16, but it can be useful

Ritratto di Vladimir Sedach

Diego,

I've compared both approaches.
"blend + permutevar" turns out to be of the same speed (VC) or even faster (GC) than 
Christopher's "permutef128 + alignr" in a short cycle.
permutevar of cause needs additional register const for indexes.

 

Ritratto di andysem

Most of AVX/AVX2 instructions are designed to perform independently on the 2 128-bit lanes of 256-bit registers. vpalignr is no exception. Due to this design it is often more efficient to process data in 2 parallel streams. In order to optimize data loads and stores, one would load 256 bits of data from the two streams, then perform butterfly transform, and then perform calculations on the 128-bit lanes.

__m256i mm1 = _mm256_load_si256(stream1);

__m256i mm2 = _mm256_load_si256(stream2);

 

// Butterfly transform

__m256i mm_lo = _mm256_permute2x128_si256(mm1, mm2, 0x20);

__m256i mm_hi = _mm256_permute2x128_si256(mm1, mm2, 0x31);

 

// Process the two streams

__m256i mm_aligned = _mm256_alignr_epi8(mm_hi, mm_lo, 1);

If your data processing pattern also produces 256 bits of output data per stream, you can perform a second butterfly and then store 256-bit results for each stream.

That said, the 256-bit wide align instruction is indeed missing. The above approach doesn't work for the original motivating example of palignr - to align memory accesses while processing unaligned data. _mm256_alignr_epi8 cannot be used to align memory accesses to 32-byte boundaries.

 

Ritratto di Diego Caballero

Very interesting! Thank you.

Sorry vvsed, could you please tell me what GC and VC stand for?

Andysem, let me continue the discussion with regards to your example. If it is intended to palliate the, each time less, inefficient unaligned load accesses... Is it worth it?  2 aligned load + 2 permutations (3 cycles latency) + alignr instead of just one unaligned load instruction?

In case it was, this transformation would be useful when you operate these accesses only between them, but not if you operate them against other aligned loads. In such case, you would have to apply the same butterfly transformation on every involved load, even if they are properly aligned or already on registers.

What do you think about this?

Barcelona Supercomputing Center
Ritratto di andysem

Quote:

Diego Caballero wrote:

Andysem, let me continue the discussion with regards to your example. If it is intended to palliate the, each time less, inefficient unaligned load accesses... Is it worth it?  2 aligned load + 2 permutations (3 cycles latency) + alignr instead of just one unaligned load instruction?

Newer CPUs perform better with unaligned memory accesses, but still there is significant penalty when the access spans across multiple cache lines. Also, on Sandy/Ivy bridge unaligned 256-bit access is slower than 2 unaligned 128-bit accesses. I don't remember the exact numbers now, but I think this was discussed earlier on this forum. The point is that aligning memory accesses may still be beneficial for memory-bound algorithms.

Quote:

Diego Caballero wrote:

In case it was, this transformation would be useful when you operate these accesses only between them, but not if you operate them against other aligned loads. In such case, you would have to apply the same butterfly transformation on every involved load, even if they are properly aligned or already on registers.

Not sure I understand you. Of course, if your algorithm permits, you can omit both butterfly and align stages. However, that is probably the case for only the simplest algorithms, where you don't perform any horizontal operations, including shuffles. Most often though, you'll need something like butterfly, even if memory alignment is not an issue. As for palignr, it has uses other than avoiding unaligned memory accesses, so it's really case-specific.

 

Ritratto di Vladimir Sedach

Diego,

VC is Visual C (MSVC), GC is gnu C.

ALIGNR_256 macro works with arbitrary offset, eg ALIGNR_256(ret, a, b, 1, 4)

r: result.
(v0, v1): array, v0 contains low order elements.
offs: offset of 1st element.
size: element size in bytes.

#define ALIGNR_256(r, v1, v0, offs, size) \
    if (offs == 0) \
        r = v0; \
    else if (offs == 32 / size) \
        r = v1; \
    else \
    { \
        r = _mm256_permute2x128_si256(v0, v1, 0x21); \
\
        if (offs > 16 / size) \
            r = _mm256_alignr_epi8(v1, r, offs * size & ~16); \
        else if (offs < 16 / size) \
            r = _mm256_alignr_epi8(r, v0, offs * size); \
    }

 

 

Ritratto di Diego Caballero

Thank you.

Andysem, very useful information.

Vladimir, thank you very much for the macro. It seems the most efficient way of implementing the full functionality for 256-bit registers.

 

Barcelona Supercomputing Center
Ritratto di iliyapolak

I think that these two instruction could be loaded in parallel on Port2 and Port3 thus speeding execution.

__m256i mm1 = _mm256_load_si256(stream1);

__m256i mm2 = _mm256_load_si256(stream2);

 

Ritratto di emmanuel.attia

I have posted a solution for that using only AVX, it might worth a try:

http://software.intel.com/en-us/forums/topic/283576#comment-1755317

Maybe the function name is not right (since it does not reflect the fact that a _mm256_alignr_ps would actually be _mm2x128_alignr_ps) but the rest works fine

Ritratto di Christian M.

Hello,

I did some tests (few months ago) on sandy bridge about unaligned loads. It was an FIR filter (convolution) with some kind of ringbuffer for the last values. Here, the ringbuffer could not be accessed aligned each iteration but it did not hurt performance that much. At least, adding code to access everything aligned cause more overhead and only performed about the same speed.

Has anyone tested the ALIGNR_256 macro compared to two 128 unaligned loads? The macro has some if statements and this needs branches which is not that good.

 

Ritratto di andysem

Quote:

Christian M. wrote:

The macro has some if statements and this needs branches which is not that good.

Offset and element size are expected to be compile time constants, so the compiler will remove all conditions and only one branch will remain.

Ritratto di Christian M.

Quote:

andysem wrote:

Quote:

Christian M. wrote:

The macro has some if statements and this needs branches which is not that good.

 

Offset and element size are expected to be compile time constants, so the compiler will remove all conditions and only one branch will remain.

Sorry, this is something I missed as I only looked at the code and did not think of the use of the macro. This improves things a lot, but one branch can still make a difference.

Ritratto di emmanuel.attia

There is NO branches in the generated machine code (alignr only takes compile-time constants).
Unless you forget to enable optimizations.

Accedere per lasciare un commento.