Hi,
consider SSE/AVX Code which operates on XMM/YMM registers 99% of the time but requires to keep track of masks (control-flow to data-flow conversion using BLENDVPS operations).
Example in C (intrinsics for better understanding):
__m256 x = ...
__m256 y = ...
__m256 mask0 = ...
__m256 mask1 = _mm256_cmplt_ps(x, y);
__m256 mask2 = _mm256_and_ps(mask0, mask1);
__m256 res = _mm256_blendv_ps(x, y, mask2);
Now due to disjoint control-flow paths both being executed, the number of live variables required for the blending operations increases, and so does register pressure.
The idea is now to store masks in GPR instead of vector registers in order to free some of the registers (operations like and/or/xor can just as well be executed in the scalar unit).
This would result in code like this:
__m256 x = ...
__m256 y = ...
unsigned mask0 = ...
__m256 mask1v = _mm256_cmplt_ps(x, y);
unsigned mask1 = _mm256_movemask_ps(mask1v);
unsigned mask2 = mask0 & mask1;
__m256 mask2v = ?
__m256 res =_mm256_blendv_ps(x, y, mask2v);
Now the question is: can anybody help me out on the question mark? :)
However, I could imagine people that are more experienced with such code to give advice not to attempt this because of other performance issues - is that the case?
Kind regards,
Ralf
P.S. I saw postings in a different thread (http://software.intel.com/en-us/forums/showthread.php?t=80452&o=a&s=lr ) that went into a similar direction, but I felt my question was a little bit of-topic.


