I have integer only application which I want to speed up a bit with sse2. The tight loop uses add/sub/shift on int32_t so I could easily convert it with intrinsics.
Before the tight loop the program performs an expensive setup step, which is also done on vectors, but involves some multiplication and division. Having the multiplication SSE'd could help gain some more performance (I leave the division as it is since there's no integer division in sse).
I looked up the intrinsics and found that the PMULUDQ instruction is used for 2 things:
- multiply a signed int by a signed int (32bitx32bit -> 64bit) _mm_mul_su32()
- multiply 2 unsigned ints by 2 unsigned ints (32bitx32bit -> 64bit again) _mm_mul_epu32()
Now I need signed int 32bitx32bit -> 32bit (4 int vector), but this is only available in SSE4, or at least signed int 32bitx32bit -> 64bit (2 int vector). At first glance it didn't seem to be possible with sse2, but Googling around, people did use _mm_mul_epu32() for _signed_ integer multiplications. I've created a small function with intrinsics(based on asm found with google) and it really works, but I don't know why...
static inline __m128i muly(const __m128i &a, const __m128i &b)
__m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
__m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
Can someone please explain me?
2's complement negative integers have all MSB bits 1, and this is true for the lower 4 bytes of the result (that's why I get a proper 32bit signed int), but for the upper 4 bytes the bit pattern is 1011 (decimal 11).
How do the sign bits in the upper 4 get magically fixed when I use _mm_mul_su32()? It maps to the same instruction isn't it?