I'm trying to write a geometric mean sqrt(a * b) using AVX intrinsics, but it runs slower than molasses!

int main()

{

int count = 0;

for (int i = 0; i < 100000000; ++i)

{

__m128i v8n_a = _mm_set1_epi16((++count) % 16),

v8n_b = _mm_set1_epi16((++count) % 16);

__m128i v8n_0 = _mm_set1_epi16(0);

__m256i temp1, temp2;

__m256 v8f_a = _mm256_cvtepi32_ps(temp1 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_a, v8n_0)), _mm_unpackhi_epi16(v8n_a, v8n_0), 1)),

v8f_b = _mm256_cvtepi32_ps(temp2 = _mm256_insertf128_si256(_mm256_castsi128_si256(_mm_unpacklo_epi16(v8n_b, v8n_0)), _mm_unpackhi_epi16(v8n_b, v8n_0), 1));

__m256i v8n_meanInt32 = _mm256_cvtps_epi32(_mm256_sqrt_ps(_mm256_mul_ps(v8f_a, v8f_b)));

__m128i v4n_meanLo = _mm256_castsi256_si128(v8n_meanInt32),

v4n_meanHi = _mm256_extractf128_si256(v8n_meanInt32, 1);

g_data[i % 8] = v4n_meanLo;

g_data[(i + 1) % 8] = v4n_meanHi;

}

return 0;

}

The key to this mystery is that I'm using Intel ICC 11 and it's only slow when compiling with icc -O3 sqrt.cpp. If I compile with icc -O3 -xavx sqrt.cpp, then it runs 10x faster.

But it's not obvious if there's emulation happening because I used performance counters and the number of instructions executed for both versions is roughly 4G:

Performance counter stats for 'a.out':

16867.119538 task-clock # 0.999 CPUs utilized

37 context-switches # 0.000 M/sec

8 CPU-migrations # 0.000 M/sec

281 page-faults # 0.000 M/sec

35,463,758,996 cycles # 2.103 GHz

23,690,669,417 stalled-cycles-frontend # 66.80% frontend cycles idle

20,846,452,415 stalled-cycles-backend # 58.78% backend cycles idle

4,023,012,964 instructions # 0.11 insns per cycle

# 5.89 stalled cycles per insn

304,385,109 branches # 18.046 M/sec

42,636 branch-misses # 0.01% of all branches

16.891160582 seconds time elapsed

-----------------------------------with -xavx----------------------------------------

Performance counter stats for 'a.out':

1288.423505 task-clock # 0.996 CPUs utilized

3 context-switches # 0.000 M/sec

2 CPU-migrations # 0.000 M/sec

279 page-faults # 0.000 M/sec

2,708,906,702 cycles # 2.102 GHz

1,608,134,568 stalled-cycles-frontend # 59.36% frontend cycles idle

798,177,722 stalled-cycles-backend # 29.46% backend cycles idle

3,803,270,546 instructions # 1.40 insns per cycle

# 0.42 stalled cycles per insn

300,601,809 branches # 233.310 M/sec

15,167 branch-misses # 0.01% of all branches

1.293986790 seconds time elapsed

Is there some kind of processor internal emulation going on? I know for denormal numbers, adds end up being 64 times slower than normal.