To get a better idea of MIC's single core, single threaded performance, I tried the following simple experiment:

The following is a simple, unvectorized code, where I take two vectors "arr1" and "arr2" of length=LENGTH and multiply them their corresponding elements with each other, LOOP number of times. I have kept LENGTH short enough so that both vectors fit in the L1 cache, so this shouldn't be memory bound. For ex: LOOP = 1000000 and LENGTH < 256 (should fit within L1 cache).

I compiled without using any optimization flags.

for(size_t j=0;j<LOOP;j++){ for(int i=0;i<LENGTH;i+=2){ real[i/2] = arr4[i]*arr5[i]; im[i/2] = arr4[i+1]*arr5[i+1]; } }

I get performance as 0.28 Gflops, when run on the MIC. I count the number of floating point operations as "LENGTH*LOOP" since I perform two floating point operations for every iteration of inner loop which runs LENGTH/2 times.

Now I try vectorization as per the following code:

for(size_t j=0;j<LOOP;j++){ for(int i=0;i<LENGTH;i+=8){ __m512d m = _mm512_load_pd(&arr1[i]); __m512d in = _mm512_load_pd(&arr2[i]); t0 = _mm512_mul_pd(m,in); } }

I get performance as 0.6 Gflops, when run on the MIC. The number of floating point operations executed is the same.

I tried a non-trivial scenario in the vectorized case i.e. hadamard product of the two vectors as follows:

for(size_t j=0;j<LOOP;j++){ for(int i=0;i<LENGTH;i+=8){ __m512d m = _mm512_load_pd(&arr1[i]); __m512d m_r = _mm512_swizzle_pd(m,_MM_SWIZ_REG_CDAB); __m512d in = _mm512_load_pd(&arr2[i]); __m512d in_r = _mm512_swizzle_pd(in,_MM_SWIZ_REG_CDAB); __m512d reals = _mm512_mask_swizzle_pd(m,0xAA,m,_MM_SWIZ_REG_CDAB); __m512d imags = _mm512_mask_sub_pd(m,0x55,zero,m_r); t0 = _mm512_mul_pd(reals,in); t0 = _mm512_fmadd_pd(imags,in_r,t0); } }

I get performance as ~1.2 Gflops. I did account for the different number of floating point operations.

Shouldn't I get 1 Gflops for the unvectorized case and ~8 Gflops for the vectorized case?

Thanks,

Bharat.