Hi,

in the process of optimizing vector-vector operations (like a dot-product) I encountered issues with obtaining the full cache bandwidth. According to the documentation (e.g., the system software developers guide), L1 should be able to provide 1 cacheline per cycles and L2 1 cacheline per 2 cycles. The former of course only with 2 or more threads per core.

I give the example of a double precision dot-product. I appended a plot of the measured read performance.

- single thread: I expect 32B/cycle from L1 and L2
- two threads: I expect 64B/cycle from L1 and 32B/cycle from L2
- the measured result is almost by a factor of two worse than expected
- the performance of the MKL BLAS routine is even worse (probably because it is more general?)

Optimizations:

- L2->L1 prefetch distance was optimized
- removed dependencies between consecutive instructions
- loops unrolled
- see example below

Question: What am I doing wrong?

Thanks for your help, Simon

Example for "L1 size < vector length < L2 size" (I also tried many variations of this, the given code example is among those performing best):

#pragma noprefetch for(long i=0; i<n; i+=64) { _mm_prefetch((char *)(x+i)+prefetch_distance_L1,1); _mm_prefetch((char *)(x+i+8)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+8)+prefetch_distance_L1,1); zmm1 = _mm512_load_pd(x+i); zmm3 = _mm512_load_pd(x+i+8); zmm2 = _mm512_load_pd(y+i); zmm4 = _mm512_load_pd(y+i+8); sum = _mm512_fmadd_pd(zmm1, zmm2, sum); sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2); _mm_prefetch((char *)(x+i+16)+prefetch_distance_L1,1); _mm_prefetch((char *)(x+i+24)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+16)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+24)+prefetch_distance_L1,1); zmm1 = _mm512_load_pd(x+i+16); zmm3 = _mm512_load_pd(x+i+24); zmm2 = _mm512_load_pd(y+i+16); zmm4 = _mm512_load_pd(y+i+24); sum = _mm512_fmadd_pd(zmm1, zmm2, sum); sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2); _mm_prefetch((char *)(x+i+32)+prefetch_distance_L1,1); _mm_prefetch((char *)(x+i+40)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+32)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+40)+prefetch_distance_L1,1); zmm1 = _mm512_load_pd(x+i+32); zmm3 = _mm512_load_pd(x+i+40); zmm2 = _mm512_load_pd(y+i+32); zmm4 = _mm512_load_pd(y+i+40); sum = _mm512_fmadd_pd(zmm1, zmm2, sum); sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2); _mm_prefetch((char *)(x+i+48)+prefetch_distance_L1,1); _mm_prefetch((char *)(x+i+56)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+48)+prefetch_distance_L1,1); _mm_prefetch((char *)(y+i+56)+prefetch_distance_L1,1); zmm1 = _mm512_load_pd(x+i+48); zmm3 = _mm512_load_pd(x+i+56); zmm2 = _mm512_load_pd(y+i+48); zmm4 = _mm512_load_pd(y+i+56); sum = _mm512_fmadd_pd(zmm1, zmm2, sum); sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2); }