Reversing (V)MOVMSKPS (restoring masks from GPR to XMM / YMM)

Reversing (V)MOVMSKPS (restoring masks from GPR to XMM / YMM)


consider SSE/AVX Code which operates on XMM/YMM registers 99% of the time but requires to keep track of masks (control-flow to data-flow conversion using BLENDVPS operations).

Example in C (intrinsics for better understanding):

__m256 x = ...

__m256 y = ...

__m256 mask0 = ...

__m256 mask1 = _mm256_cmplt_ps(x, y);

__m256 mask2 = _mm256_and_ps(mask0, mask1);

__m256 res = _mm256_blendv_ps(x, y, mask2);

Now due to disjoint control-flow paths both being executed, the number of live variables required for the blending operations increases, and so does register pressure.

The idea is now to store masks in GPR instead of vector registers in order to free some of the registers (operations like and/or/xor can just as well be executed in the scalar unit).

This would result in code like this:

__m256 x = ...

__m256 y = ...

unsigned mask0 = ...

__m256 mask1v = _mm256_cmplt_ps(x, y);

unsigned mask1 = _mm256_movemask_ps(mask1v);

unsigned mask2 = mask0 & mask1;

__m256 mask2v = ?

__m256 res =_mm256_blendv_ps(x, y, mask2v);

Now the question is: can anybody help me out on the question mark? :)

However, I could imagine people that are more experienced with such code to give advice not to attempt this because of other performance issues - is that the case?

Kind regards,


P.S. I saw postings in a different thread ( ) that went into a similar direction, but I felt my question was a little bit of-topic.

4 帖子 / 0 全新

__m256 LUT[0x100]; // look up table for converting 8-bit masks to 256-bit masks, 256*32bytes (8 KB), aligned 32, *must be initialized properly*


const __m256 mask2v = LUT[mask2];

now, I don't thnik it will be beneficial in your example, the extra latency from the vmovmskps + LUT access will be worse than the one for the spills/fills due to your lack of registers, also it will be in the critical path, unlike the spills/fills most probably.NB: I use it personnaly toexpand masks stored in packed form (8-bit) for multi-passes algorithms: theeasy 32 to 1compression obviously minimize cache misses and I have measured actual speedups vs. storing the 256-bit masks (for datasets bigger than the L2$ capacity)

Hi bronxzv,

thanks for your answer.

That solution obviously works, and I will give it a try. However, I was hoping that there is a way to compute the 256-bit mask on the fly with some clever instructions. So far, I did not come up with anything useful.



>However, I was hoping that there is a way to compute the 256-bit mask on the fly with some clever instructions.

only a long instruction sequence will fit the bill, clearly slower than a single vmovaps(with very low cache miss ratio) with the LUT solution

anyway, even if there was an instruction it will not speed up much your solution since the 8-bit to 256-bit conversion occurs only once but the vmovmskps is required after each packed compare, all this added latency in the critical path, just for freeing a single YMM, is most probably a bad tradeoff