I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.
What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).
Figure 1: Compressing the two data vector registers using the same dynamically created mask.
Figure 2: Scatter store to interleave the compressed data.
My code looks like the following:
void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) { /* Initialize a vector register containing zeroes to get the store mask */ __m512i zeroVec = _mm512_setzero_epi32(); /* Load data */ __m512i dataVec_1 = _mm512_conflict_epi32( data ); __m512i dataVec_2 = _mm512_conflict_epi32( data + 16 ); /* Compress the data */ __m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 ); __m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 ); /* Get the store mask by compare the compressed register with the zeroregister (4 means !=) */ __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 ); /* Interleave the selected data */ _mm512_mask_i32scatter_epi32( result, maskStore, vindex, compVec_1, 1 ); _mm512_mask_i32scatter_epi32( result + 1, maskStore, vindex, compVec_2, 1 ); }
I compiled everything with
O3 march=knl lmemkind mavx512f mavx512pf
I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution.
I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.
I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.

Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?

Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?

Is there a more easy or more performant way to interleave two vector registers?
Thanks for your help!
Sincerely