I wrote a c-code and an AVX code to convert an alignedbuffer of size 1920*1280*3 from float to short.
The AVX implementation is 3 times slower than the c-code.
Here is the AVX code for the float2short:
for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)
{
__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));
__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));
__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));
__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);
__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);
pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);
__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);
__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);
pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);
__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);
__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);
pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);
}
As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).