Using scatterstore to interleave two (sparse) 512-bit vector registers

Using scatterstore to interleave two (sparse) 512-bit vector registers

Hi @ all,

I have two 512 bit vector registers and one mask16. The two registers contain (sparse) data:

idx:             15   14   13   12   11   10     9     8     7     6     5     4     3     2     1     0

zmmVidx: 120 112 104 096 088 080 072 064 056 048 040 032 024 016 008 000 

zmm0:        00   11   00   00   10   00   00   00   00   00   01   00   00   00   00   00

zmm1:        00   00   A0  00   00   B0   00   00   00   00   00   F0   00   00   00   00 

mask1:         0     1     0     0     1     0     0     0     0     0     1     0     0     0     0     0  

mask2:         0     0     1     0     0     1     0     0     0     0     0     1     0     0     0     0  

I want to interleave the two vector registers and store them continuosly into a 32 byte of memory, starting with zmm0:

idx:       ...15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

mem:    ...00 00 00 00 00 00 00 00 00 00 A0 11 B0 10 F0 01

I realized that behaviour with the following code

/* ... */
__m512i zmm2        = _mm512_maskz_compress_epi32( mask1, zmm0 );
__m512i zmm3        = _mm512_maskz_compress_epi32( mask2, zmm1 );
__mmask16 mask3 = _mm512_cmp_epi32_mask( zmm3, _mm512_setzero_epi32(), 4); 

_mm512_mask_i32scatter_epi32(
   mem, mask3, zmmVidx, zmm2, 1);
_mm512_mask_i32scatter_epi32(
  mem + sizeof( uint32_t ), mask3, zmmVidx, zmm3, 1);

I first align the sparse data continuously in the vector register and store them afterwards. Is it possible to directly perform a masked interleave to memory, so one can avoid a scatterstore and use a continuos store operation?

 

Sincerely yours

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Why not .OR. the two (or add, or masked add or masked or), then store or masked store the results.

Scatter should be reserved when the data to be written will .NOT. reside within 64 bytes of each other (or cache line aligned depending on instruction).

Jim Dempsey

Well, thanks for your reply but I think it is not that easy because the data can (!) overlap between the two vectors. So it is possible that , e.g. all values of the two vectors (32 byte resulting data), or a subset of them, while the indices of the two masks can (!) overlap should be interleaved. 

Sincerely

Leave a Comment

Please sign in to add a comment. Not a member? Join today