In the documentation the intrinsic _mm_mulhrs_epi16 the shift right should be 15 and not 14.
14 bits is correct. See the Instruction Set Reference in the Software Developer's Manual:http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
PMULHRSW (with 128-bit operand)
temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1;temp1[31:0] = INT32 ((DEST[31:16] * SRC[31:16]) >>14) + 1;temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) >>14) + 1;temp3[31:0] = INT32 ((DEST[63:48] * SRC[63:48]) >>14) + 1;temp4[31:0] = INT32 ((DEST[79:64] * SRC[79:64]) >>14) + 1;temp5[31:0] = INT32 ((DEST[95:80] * SRC[95:80]) >>14) + 1;temp6[31:0] = INT32 ((DEST[111:96] * SRC[111:96]) >>14) + 1;temp7[31:0] = INT32 ((DEST[127:112] * SRC[127:112) >>14) + 1;DEST[15:0] = temp0[16:1];DEST[31:16] = temp1[16:1];DEST[47:32] = temp2[16:1];DEST[63:48] = temp3[16:1];DEST[79:64] = temp4[16:1];DEST[95:80] = temp5[16:1];DEST[111:96] = temp6[16:1];DEST[127:112] = temp7[16:1];
I still do not understand...
I try the next piece of codefloat factor = 1.f;__m128i vFactor = _mm_set1_epi16(factor*(1<<14)); // Using fixed point..
__m128i inputVec = _mm_set_epi16(32,54,124,75,35,235,244,36);
__m128i resultVec = _mm_mulhrs_epi16(inputVec,vFactor);
By your explanation I should get resultVec = inputVec but the result elements are actually half the original values..
If you carefully read the documentation you will notice an additional hidden shift by 1.The temp*[16:1] can be read as (temp*[31:0]>>1)[15:0].
It might make sense to make the documentation more evident about this.
I agree the documentation for this function is not the best one.