# Interleaving values of 2 vector registers

## Interleaving values of 2 vector registers

Using Xeon Phi intrinsics in C++, I would like to interleave the float values of 2 registers. It is basically a vector of structures (vector<complex<float> >) to structure of vectors thing. I guess it is somehow related to swizzle and shuffle, but looking at the compiler and instruction set manuals I dont see how to do it. Here are the "formal" specs.

1. Given v1, v2 of type __m512, both containing 16 floats,  transform v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x8 y8) into  v1=(y15 y14 ... y1 y0) v2=(x15 x14 ... x1 x0)
2. Given v1, v2 of type __m512, both containing 16 floats, transform v1=(y15 y14 ... y1 y0), v2=(x15 x14... x1 x0) into v1=(x7 y7... x0 y0), v2=(x15 y15... x8 y8) (basically the reverse operation of the first)

With SSE, I do it with _mm_shuffle_ps() and _mm_unpackhi/lo_ps(), but how to (efficiently) do it for Xeon Phi?

Georg

7 posts / 0 nouveau(x)
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

No answer yet. So it is either a difficult or a stupid question ;-)

Hi georgv,

You may do it as follows.

static void
interleave(__m512 re, __m512 im, __m512 *u, __m512 *v)
{
__m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);
__m512 tmp_im = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)im);
__m512 tmp_re = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)re);
*u = _mm512_mask_blend_ps(0xAAAA, tmp_re, _mm512_swizzle_ps(tmp_im, _MM_SWIZ_REG_CDAB));
*v = _mm512_mask_blend_ps(0x5555, tmp_im, _mm512_swizzle_ps(tmp_re, _MM_SWIZ_REG_CDAB));
}

Hello,

thanks for providing this piece of code.  It covers case 2. I guess that finding suitable code sequences for re-ordering register content is one of the more advanced jobs in Xeon Phi programming ;-) . Or is there some guidance on how to do i?

After learning about _mm512_mask_blend_ps() (it is not in the compiler documentation, but in the header files), I also found a solution for case 1. Before this I had a solution using _m512_i32_scatter_ps() and _m512_i32_gather_ps(), but in both cases the new implementation is about twice as fast (code see below).

I was wondering if _m512_i32_gather_ps() is so much slower because the compiler fails to generate vprefetch* instructions for scatter/gather. However,  an experiment (commented out section in code below) did not show the performance boost I hoped for. Is there any guideline on the distances for prefetch commands, or is there any way to convince the compiler to generate suitable prefetches? Or is it just that _m512_i32_scatter_ps() and _m512_i32_gather_ps() are slow compared to the register fiddling I am doing now?

Georg

const size_type nFloats=16;
typedef _m512 vec_t;
/// load 32 float values from float aligned adress into v1,v2 and interleave values from v1 and v2 such
/// that every second value is place in the other one. p_p is 4 byte aligned.
///
/// v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x8 y8) becomes v1=(y15 y14 ... y1 y0) v2=(x15 x14 ... x1 x0)
inline void vec_gather2_ps(const float *p_p, vec_t &rv1_p,vec_t &rv2_p)
{
#if 0
// version with gather
// get every second
__m512i index=_mm512_set_epi32(30,28,26,24,22,20,18,16,14,12,10,8,6,4,2,0); // step 2
rv1_p=_mm512_i32gather_ps(index,p_p,4);
rv2_p=_mm512_i32gather_ps(index,p_p+1,4);
// prefetch values as generated by code in other alternative. However,
// no significant impact and still considerably slower
//_mm_prefetch(reinterpret_cast<const char *>(p_p+63*nFloats),_MM_HINT_T0);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+64*nFloats),_MM_HINT_T0);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+495*nFloats),_MM_HINT_T2);
//_mm_prefetch(reinterpret_cast<const char *>(p_p+496*nFloats),_MM_HINT_T2);
#else
__m512 v1=_mm512_setzero_ps();
__m512 v2=_mm512_setzero_ps();
// interleave such that odd elements are collected in high word
const __m512i gather_lo_hi = _mm512_set_16to16_epi32(15,13,11,9,7,5,3,1,14,12,10,8,6,4,2,0);
const __m512i gather_hi_lo = _mm512_set_16to16_epi32(14,12,10,8,6,4,2,0,15,13,11,9,7,5,3,1);
const __m512 split_v1 = (__m512)_mm512_permutevar_epi32(gather_lo_hi, (__m512i)v1); //even elements are now in lower half
const __m512 split_v2 = (__m512)_mm512_permutevar_epi32(gather_hi_lo, (__m512i)v2); //even elements are in upper half
#endif
}
/// distribute values for v1 and v2 such that every second value is placed in the other one, and store them at p_p.
///  p_p is 4 byte aligned.
///
/// v1=(y15 y14 ... y1 y0), v2=(x15 x14 ... x1 x0) becomes v1=(x7 y7 ... x0 y0), v2=(x15 y15 ... x7 y7)
inline void vec_scatter2_ps(float *p_p, vec_t const &rv1_p,vec_t const &rv2_p)
{
#if 0
// version with scatter
const __m512i index=_mm512_set_epi32(30,28,26,24,22,20,18,16,14,12,10,8,6,4,2,0); //step 2
_mm512_i32scatter_ps(p_p,index,rv1_p,4);
_mm512_i32scatter_ps(p_p+1,index,rv2_p,4);
#else
// version with code from forum
const __m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);
const __m512 tmp_v1 = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)rv1_p);
const __m512 tmp_v2 = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)rv2_p);
const __m512 v1_new=_mm512_mask_blend_ps(0xAAAA, tmp_v1, _mm512_swizzle_ps(tmp_v2, _MM_SWIZ_REG_CDAB));
const __m512 v2_new=_mm512_mask_blend_ps(0x5555, tmp_v2, _mm512_swizzle_ps(tmp_v1, _MM_SWIZ_REG_CDAB));
_mm512_packstorelo_ps(p_p,v1_new);
_mm512_packstorehi_ps(p_p+nFloats,v1_new);
_mm512_packstorelo_ps(p_p+nFloats,v2_new);
_mm512_packstorehi_ps(p_p+2*nFloats,v2_new);
#endif
}

Georg,

Do you still want an answer to the last few questions you posed?

Regards
--
Taylor

PS We're going through and catching any accidentally dropped posts.

Tayler,

thanks for coming back to this topic. I am currently busy with different projects, but medium term the answer to the questions below would still be interesting. I also noticed that there are other posts in the forum about scatter/gather/prefetch and its performance, So mayb this is of general interest.

The remaining questions are:

1. I was wondering if _m512_i32_gather_ps() is so much slower because the compiler fails to generate vprefetch* instructions for scatter/gather. However,  an experiment (commented out section in code below) did not show the performance boost I hoped for.
2. Is there any guideline on the distances for prefetch commands, or is there any way to convince the compiler to generate suitable prefetches?
3. Or is it just that _m512_i32_scatter_ps() and _m512_i32_gather_ps() are slow compared to the register fiddling I am doing now?

Thanks,

Georg

Dear georgv,

The gather/scatter instructions are used for indirect access to arrays when the access pattern is unknown. In your case it is better to use blend/permute.

Regarding vec_gather2_ps, it is possible to use 2 blend's followed by 2 permutevar's -- the idea is shown below.

static void
split(__m512 tmp_re, __m512 tmp_im, __m512 *re, __m512 *im)
{
__m512i interleave_lo_hi = _mm512_set_16to16_epi32(15,7,14,6,13,5,12,4,11,3,10,2,9,1,8,0);
__m512 u = _mm512_mask_blend_ps(0xAAAA, tmp_re, _mm512_swizzle_ps(tmp_im, _MM_SWIZ_REG_CDAB));
__m512 v = _mm512_mask_blend_ps(0x5555, tmp_im, _mm512_swizzle_ps(tmp_re, _MM_SWIZ_REG_CDAB));
*im = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)tmp_im);
*re = (__m512)_mm512_permutevar_epi32(interleave_lo_hi, (__m512i)tmp_re);
}