AVX sometimes slower than SSE
Has anyone experienced a slow down by a factor of around 2 for certain functions that are converted from SSE to AVX-128?
My setup:
Intel Compiler icc V12.0.0.20101116
Linux Kernel: 2.6.32-71.el6.x86_64
processor Intel Core i7-2600K CPU @ 3.4GHz
Intel Speed Step *DISABLED*
Affinity, locked to 1 core
Memory allocated 32 byte aligned
My compiler flags:
SSE: -m64 -msse3 -axSSE3 -align
AVX: -m64 -xavx -align
I have compiled the following function:
inline void vec_vec_add_overwrite( float *vec1, float *vec2, int n )
{
long ii;
for( ii = 0; ii < n; ii++ )
{
vec1[ii] += vec2[ii];
}
}
My tests go along as follows:
SetAffinity( core 0 )
overhead = GetClockOverhead(NUMTESTS)
memset( clocks, 0, NUMTESTS *sizeof(clocks) )
n = 5123 /*vector lengths*/
for( i = 0 ; i < NUMTESTS; i++ )
{
vec1 = malloc( aligned 32, n length )
vec2 = malloc( aligned 32, n length )
fill_with_random( vec1 )
fill_with_random( vec2 )
_mm_clflush( vec1 );
_mm_clflush( vec2 );
_mm_fence();
before = ReadTSC() /* uses assembly CPUID call */
vec_vec_add_overwrite( vec1, vec2, n );
clocks[i] = ReadTSC() - before;
}
RemoveOverhead(clocks, NUMTESTS, overhead)
print average(clocks[IGNORED_START_INDEX : END]) /* I THROW OUT A HANDFUL OF BEGINNING RESULTS TO REMOVE INITIAL TRANSIENTS */
The SSE version looks roughly like this (unix style assembly dest on right):
movss vec1, xmm1
addss vec2, xmm1
movss xmm1, vec1
...
...
.L_aligned:
movaps vec1, xmm1
addps vec1, xmm1
movaps xmm1, vec1 /*this block unrolled twice*/
...
...
.L_unaligned
movups vec1, xmm1
movups vec2, xmm2 /*this block unrolled twice*/
addps vec2, xmm1
movups xmm1, vec1
.L_finishup:
...
movss vec1, xmm1
addss vec2, xmm1
movss xmm1, vec1
...
ret
The AVX version looks roughly like this (unix style assembly dest on right):
vmovss vec1, xmm1
vaddss vec2, xmm1, xmm2
vmovss xmm2, vec1
...
...
.L_partially_aligned
...
vmovups vec1[0:3], xmm0
vinsertf128 $1, vec1[4:7], ymm0, ymm1 /* this block unrolled twice*/
vaddps vec2[0:7], ymm1, ymm2
vmovups ymm2, vec1[0:7]
....
.L_finishup:
...
vmovss vec1, xmm1
vaddss vec2, xmm1, xmm2
vmovss xmm2, vec1
...
vzeroupper
ret





