I'm looking for a way to do what _mm256_cast??128_??256 does, just without the part where it says "the upper 128 bits are undefined". What I do is that I execute some VEX coded SSE instruction, which results in the lower 128 bits to store the result and the upper 128 bits to be zeroed. Now, I want to continue to use this register for an AVX intrinsic, or just store the whole 256 bits to memory. With the currently available intrinsics I see no other safe way other than to use
_mm256_insertf128_??(_mm256_cast??128_??256(x), _mm_setzero_??(), 1), which is major overkill for something that, in reality, doesn't need any extra instructions.
From my tests, the cast intrinsic does what I want when I use clang, GCC, or ICC. But MSVC prefers to do the cast via 128bit store + 256bit load (stupid compiler). And even if I had luck with also MSVC, I'd rather not depend on undefined behavior. Do you have any idea how to do this? If you have a compiler-specific solution that would also be interesting. E.g. an inline asm statement that would do what I need...