I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.
I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.
Actually I'm splitting the source _m256i vector into 2 _128i but this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?
Thanks in advance for your help, here's the current version on my code
a1 = _mm256_castsi256_si128( _source ); a2 = _mm256_extractf128_si256 ( _source,1 ); b1 = _mm_slli_si128( a1,1); b2 = _mm_slli_si128( a2,1); a1 = _mm_srli_si128( a1,15); a2 = _mm_srli_si128( a2,15); _dest = _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) ); _dest = _mm256_insertf128_ps ( _dest, _mm_or_si128(b2,a1), 1 );