How to clear the upper 128 bits of __m256 value?

How to clear the upper 128 bits of __m256 value?

How can I clear the upper 128 bits of m2:
__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2
VMOVDQA xmm2, xmm1

Of cause I'd not like to use _mm256_insertf128_si256().
 

9 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Do you mean like

#ifdef __AVX__
_mm256_zeroupper();
#endif

For use to enable gcc to compile SSE intrinsics.  icc translates SSE intrinsics to AVX-128 and suppresses the zeroupper() so may this is confusing.  I haven't checked what MSVC does.

Tim,

I want to zero the higher 128-bit part of ONE variable only.
In other words, to figure out how to mimic VMOVDQA xmm2, xmm1 (clears upper half of ymm2) with preferably one intrinsic.

 

m2 = _mm256_permute2x128_si256(m2, m2, 0x40);

 

andysem:
_mm256_permute2x128_si256 is even slower than _mm256_insertf128_si256.
Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

Tim

Sorry for off topic question.

Do you have any problems with conditional compilation of AVX intrinsics when predefined __AVX__ is used?

Zitat:

Vladimir Sedach schrieb:

Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

_mm256_castsi128_si256 is basically a no-op and is only there to perform type casts. In conjunction with other intrinsics it can be completely elided from the resulting code.

What you request is actually a new intrinsic, which might not be a bad idea.

andysem,

Yes, it could be an intrinsic.
Though I would rather allow implicit conversions (with zero extension if needed) between vector types that differ only in element number:

__m128d md1;
__m256d md2 = md1; //zero extension
md1 = md2; //truncation 

after all, it is already being done with byte "arrays":

unsigned char c;
unsigned int i = c; //zero extension to 4-byte array
c = i; //truncation 

GCC 4.8 recognizes this pattern:

__m256i ymm = _mm256_inserti128_si256(_mm256_setzero_si256(), xmm, 0);

and inserts a "vmovdqa xmmN, xmmN" instruction that clears the upper lane. It could potentially optimize away this instruction as well if it is known that the original xmm was filled with a VEX-encoded instruction (which is almost always the case), but it doesn't do that. I think, this is as close as you can get to a hand-written assembler.
 

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen