How to clear the upper 128 bits of __m256 value?

How to clear the upper 128 bits of __m256 value?

Vladimir Sedach的头像

How can I clear the upper 128 bits of m2:
__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2
VMOVDQA xmm2, xmm1

Of cause I'd not like to use _mm256_insertf128_si256().
 

9 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Tim Prince的头像

Do you mean like

#ifdef __AVX__
_mm256_zeroupper();
#endif

For use to enable gcc to compile SSE intrinsics.  icc translates SSE intrinsics to AVX-128 and suppresses the zeroupper() so may this is confusing.  I haven't checked what MSVC does.

Vladimir Sedach的头像

Tim,

I want to zero the higher 128-bit part of ONE variable only.
In other words, to figure out how to mimic VMOVDQA xmm2, xmm1 (clears upper half of ymm2) with preferably one intrinsic.

 

andysem的头像

m2 = _mm256_permute2x128_si256(m2, m2, 0x40);

 

Vladimir Sedach的头像

andysem:
_mm256_permute2x128_si256 is even slower than _mm256_insertf128_si256.
Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

iliyapolak的头像

Tim

Sorry for off topic question.

Do you have any problems with conditional compilation of AVX intrinsics when predefined __AVX__ is used?

andysem的头像

Quote:

Vladimir Sedach wrote:

Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

_mm256_castsi128_si256 is basically a no-op and is only there to perform type casts. In conjunction with other intrinsics it can be completely elided from the resulting code.

What you request is actually a new intrinsic, which might not be a bad idea.

Vladimir Sedach的头像

andysem,

Yes, it could be an intrinsic.
Though I would rather allow implicit conversions (with zero extension if needed) between vector types that differ only in element number:

__m128d md1;
__m256d md2 = md1; //zero extension
md1 = md2; //truncation 

after all, it is already being done with byte "arrays":

unsigned char c;
unsigned int i = c; //zero extension to 4-byte array
c = i; //truncation 

andysem的头像

GCC 4.8 recognizes this pattern:

__m256i ymm = _mm256_inserti128_si256(_mm256_setzero_si256(), xmm, 0);

and inserts a "vmovdqa xmmN, xmmN" instruction that clears the upper lane. It could potentially optimize away this instruction as well if it is known that the original xmm was filled with a VEX-encoded instruction (which is almost always the case), but it doesn't do that. I think, this is as close as you can get to a hand-written assembler.

 

登陆并发表评论。