Looking for smartest way to insert a DWORD into AVX register

Looking for smartest way to insert a DWORD into AVX register

Elmar的头像

Hi all,

I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.

Here is what I found so far:

AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway

AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.

There are lots of multi-instruction workarounds I could think of, but I hoped that the Intel engineers have a smart trick for this basic operation which I overlooked?

Thanks,

Elmar

9 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Sergey Kostrov的头像

Did you consider

/*
* Scalar to 128/256-bit vector broadcast operations.
*/
extern __m256i __ICL_INTRINCC _mm256_broadcastd_epi32( __m128i );

intrinsic function?

Elmar的头像

Hi Sergey,

thanks, but vpbroadcastd fills the entire vector, I want to insert a single dword at a given location (like vpinsrd), and I want to do that fast, without consuming an extra temporary register (e.g. if I combine a vpbroadcastd with a vpblendd, that's a workaround that needs an extra register).

CU,

Elmar

Sergey Kostrov的头像

What about these two intrinsic functions:
...
extern __m256i __ICL_INTRINCC _mm256_set_epi32( int, int, int, int, int, int, int, int );
...
and
...
extern __m256i __ICL_INTRINCC _mm256_setr_epi32( int, int, int, int, int, int, int, int );
...
Examples of application for _mm256_set_epi32 could look like:
...
__m256i v1 = _mm256_set_epi32( 0, 77, 0, 0, 0, 0, 0, 0 );
or
__m256i v2 = _mm256_set_epi32( 0, 0, 0, 0, 0, 0, 77, 0 );
...

andysem的头像

@Sergey Kostrov: These are multi-instruction constructs, which basically come down to broadcasts or moves+shuffles. And the OP seem to want to inject a single dword into an existing register filled with data.

I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.

bronxzv的头像

Quote:

Elmar wrote:AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway

simply use vinsertps followed by vinsertf128, this is the fastest available option AFAIK, I use it for my AVX legacy generic gather path detailed here for example: http://software.intel.com/en-us/comment/reply/285867/1740679

 

Elmar的头像

Quote:

andysem wrote:

I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.

But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?

For inserting a DWORD, I currently use vinsertps or vpermilps to place the DWORD at the right spot in an unused register, and then vblendps to move the DWORD into the target register (note that vblendps takes an immediate blend factor, not a mask register). If the DWORD crosses a lane, I need a third instruction for the cross-lane-shuffle.

I had hoped that Intel engineers would immediately fire the optimal solution at me (in terms of false dependencies, latency etc.), but it seems that they are busy (hopefully cleaning up the AVX2 manual #319433-014, because that's full of bugs ;-))...

Thanks,

Elmar

Sergey Kostrov的头像

Elmar, I did a verification and with these intrinsics:

>>...
>>__m256i v1 = _mm256_set_epi32( 0, 77, 0, 0, 0, 0, 0, 0 );
>>or
>>__m256i v2 = _mm256_setr_epi32( 0, 0, 0, 0, 0, 0, 77, 0 );
>>...

a performance impact is possible and implementation of a similar functionality with native instruvtions could be faster. Please do a performance evaluation if you decide to use these two intrinsics functions.

andysem的头像

> But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?

You're right, sorry for the confusion. It seems, inserts and blends are the way to go.

登陆并发表评论。