Q&A: SSE2 Optimizations

Q&A: SSE2 Optimizations

Here is a question received by Intel Developer Services Support, along with the response provided by our Application Engineers:

Q. I'm using the SSE2 extensions quite heavily, and there's one step of my calculations that I can't find a good way to do, but it's simple enough that I think there's a "right" way to do it. I've got 8 signed 16 bit words in a SSE2 register, and I want to obtain the sum of all of the words held in the register. That's it. My current method is using the shift and add commands until I get the whole sum, which takes about 9 instructions. These 8 values are the result of a bunch of parallel instructions that I'm using the SSE2 instructions for, so I'm pretty much stuck with that layout. If you could give me suggestions, I would very much appreciate it.

A. I would recommend using the PSHUFD instruction to create a second copy of the four high words in the lower 64 bits of a second XMM register and then using the PSHUFLW and PADD instructions to combine the results. Something like this (assume the eight 16-bit words are located in the xmm0 register):

PSHUFDxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1

At the end, the sum of the eight 16-bit values should be located in the lower 16 bits of the xmm0 register. This uses six instructions (including the initial PSHUFD instruction), and the PSHUFLW instructions are more efficient than the byte-wise shift instructions. This should be more efficient that your current implementation.

==
Lexi S.
Intel Developer Services Support
http://www.intel.com/IDS
email: IDS.support@intel.com

Lexi S. Intel Developer Zone Web Operations
1 envío / 0 nuevos
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.