Looking for efficient way to convert float (32 bit) aligned buffer to short (16 bit) aligned buffer

Looking for efficient way to convert float (32 bit) aligned buffer to short (16 bit) aligned buffer

I wrote a c-code and an AVX code to convert an alignedbuffer of size 1920*1280*3 from float to short.
The AVX implementation is 3 times slower than the c-code.

Here is the AVX code for the float2short:

for (int i = numOfElems;i;--i,pOut+=3,pIn1+=24,pIn2+=24,pIn3+=24)

{

__m256i intVec1 = _mm256_cvtps_epi32(_mm256_load_ps(pIn1));

__m256i intVec2 = _mm256_cvtps_epi32(_mm256_load_ps(pIn2));

__m256i intVec3 = _mm256_cvtps_epi32(_mm256_load_ps(pIn3));

__m128i intVec1L = _mm256_extractf128_si256(intVec1,0);

__m128i intVec1H = _mm256_extractf128_si256(intVec1,1);

pOut[0] = _mm_packs_epi32(intVec1L,intVec1H);

__m128i intVec2L = _mm256_extractf128_si256(intVec2,0);

__m128i intVec2H = _mm256_extractf128_si256(intVec2,1);

pOut[1] = _mm_packs_epi32(intVec2L,intVec2H);

__m128i intVec3L = _mm256_extractf128_si256(intVec3,0);

__m128i intVec3H = _mm256_extractf128_si256(intVec3,1);

pOut[2] = _mm_packs_epi32(intVec3L,intVec3H);

}

As you can notice the main loop is unrolled - so I get factor 3 acceleration (without it the c-code is 9 times faster than the AVX !!!).

2 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

This looks like an algorithm a decent compiler can auto-vectorize. Have you checked whether it did?
I can't see a problem from the code you posted. I recommend you take a look at the generated asm. Probably something stupid is happening somewhere.
Finally I can recommend you use IACA (Intel Architecture Code Analyzer) to let it annotate the asm a bit.

Vc: SIMD Vector Classes for C++ http://code.compeng.uni-frankfurt.de/projects/vc

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen