L1/L2 cache (read) bandwidth

L1/L2 cache (read) bandwidth

Hi,

in the process of optimizing vector-vector operations (like a dot-product) I encountered issues with obtaining the full cache bandwidth. According to the documentation (e.g., the system software developers guide), L1 should be able to provide 1 cacheline per cycles and L2 1 cacheline per 2 cycles. The former of course only with 2 or more threads per core.

I give the example of a double precision dot-product. I appended a plot of the measured read performance.

  1. single thread: I expect 32B/cycle from L1 and L2
  2. two threads: I expect 64B/cycle from L1 and 32B/cycle from L2
  3. the measured result is almost by a factor of two worse than expected
  4. the performance of the MKL BLAS routine is even worse (probably because it is more general?)

Optimizations:

  • L2->L1 prefetch distance was optimized
  • removed dependencies between consecutive instructions
  • loops unrolled
  • see example below

Question: What am I doing wrong?

Thanks for your help, Simon

Example for "L1 size < vector length < L2 size" (I also tried many variations of this, the given code example is among those performing best):

#pragma noprefetch
        for(long i=0; i<n; i+=64)
        {
            _mm_prefetch((char *)(x+i)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(x+i+8)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+8)+prefetch_distance_L1,1);
            zmm1 = _mm512_load_pd(x+i);
            zmm3 = _mm512_load_pd(x+i+8);
            zmm2 = _mm512_load_pd(y+i);
            zmm4 = _mm512_load_pd(y+i+8);
            sum = _mm512_fmadd_pd(zmm1, zmm2, sum);
            sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2);
            _mm_prefetch((char *)(x+i+16)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(x+i+24)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+16)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+24)+prefetch_distance_L1,1);
            zmm1 = _mm512_load_pd(x+i+16);
            zmm3 = _mm512_load_pd(x+i+24);
            zmm2 = _mm512_load_pd(y+i+16);
            zmm4 = _mm512_load_pd(y+i+24);
            sum = _mm512_fmadd_pd(zmm1, zmm2, sum);
            sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2);
            _mm_prefetch((char *)(x+i+32)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(x+i+40)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+32)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+40)+prefetch_distance_L1,1);
            zmm1 = _mm512_load_pd(x+i+32);
            zmm3 = _mm512_load_pd(x+i+40);
            zmm2 = _mm512_load_pd(y+i+32);
            zmm4 = _mm512_load_pd(y+i+40);
            sum = _mm512_fmadd_pd(zmm1, zmm2, sum);
            sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2);
            _mm_prefetch((char *)(x+i+48)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(x+i+56)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+48)+prefetch_distance_L1,1);
            _mm_prefetch((char *)(y+i+56)+prefetch_distance_L1,1);
            zmm1 = _mm512_load_pd(x+i+48);
            zmm3 = _mm512_load_pd(x+i+56);
            zmm2 = _mm512_load_pd(y+i+48);
            zmm4 = _mm512_load_pd(y+i+56);
            sum = _mm512_fmadd_pd(zmm1, zmm2, sum);
            sum2 = _mm512_fmadd_pd(zmm3, zmm4, sum2);
        }

AttachmentSize
Downloadapplication/pdf ddot-bandwidth.pdf76.84 KB
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Please read this post Link://software.intel.com/en-us/forums/topic/387129

I had already read it, but I cannot see where this helps me. Could you be more specific?

I am not an expert on MIC optimization:) I thought that you have some general question about the MIC cache performance.

I think that Jim could provide more help on your issue.

Hi Simon,

With your code, two OMP threads and 512 cachelines, I see 620-630 cycles/thread or 50 bytes/cycle -- higher than 32bytes/cycle in your charts. Did you compile with -O3? Did you exclude OMP overhead? Did you pin your threads? Did you align x and y to 64 bytes? Are zmm1, etc. thread-local? I look at the best time out of 100 runs...

Thanks, Evgueni.

For the related case of a single vector summation, I was able to achieve up to ~94% of 32 Bytes/cycle using a fairly similar approach to yours.  The best results used 2 (vector) partial sum variables, and loop unrolling  (up to 16x = 32 cache lines processed in each iteration) to reduce the control overhead.  Obviously there is no room for software prefetching in this case.  These cases show that unrolling is critical to performance, which makes the code much less general.  Without unrolling (just 2 vector sum variables), performance was down around 50% of the expected 32 Bytes/cycle -- two issue slots for the vector instructions and two more for the non-overlapped scalar loop control instructions.

My code includes the reduction of the partial sums in the %zmm registers to produce a single scalar sum, but I noticed that the compiler often hoists my final rdstc call into the middle of that code, so my timings only include a portion of the the overhead of that final summation.

For L2-resident data I have had a lot more trouble.  I did get one case up to 63% of 32 Bytes/cycle using a single thread with four summation variables and fiddling with the vprefetch0 distance to get the best performance.  My 63% corresponds to 20.2 Bytes/cycle -- almost exactly the same as your ~19.4 Bytes/cycle.   Like you I was also unable to get improved performance using more than one thread.

It is important to be aware of the relatively high overheads in OpenMP, even when running 2 threads on one physical core.  By reading the TSC just before the OpenMP Parallel For loop and then having each thread read its TSC as soon as it starts, it appears that the overhead of the Parallel For is several thousand cycles and the skew across the two threads is also large.

To get the threads to start at closer to the same time, I used a sneaky trick --- before the OpenMP Parallel For loop, I read the TSC into a global variable.  Then I added 100,000 to that value and had the threads spin on an rdtsc loop until they reached this delayed starting time.  Using this trick, I was able to get the threads to report an initial starting time within 50-80 cycles of each other. 

For the case with 2,3,4 threads on a single physical core, it should be possible to build an extremely efficient set of synchronization primitives for threaded code -- that would take some of the confusion out of this testing.    Any volunteers?

John D. McCalpin, PhD
"Dr. Bandwidth"

One of my customers adds up the performance of multiple threads even though they don't finish at the same time.  I don't know how to judge if that's sneakier than what John describes.

Thanks to all who replied so far. Some comments:

Quote:

Evgueni Petrov aka espetrov (Intel) wrote:

With your code, two OMP threads and 512 cachelines, I see 620-630 cycles/thread or 50 bytes/cycle -- higher than 32bytes/cycle in your charts. Did you compile with -O3? Did you exclude OMP overhead? Did you pin your threads? Did you align x and y to 64 bytes? Are zmm1, etc. thread-local? I look at the best time out of 100 runs...

  • yes, with -O3
  • OMP overhead should be excluded. My tests looks like this: #pragma omp barrier; read rdtsc; ddot; read rdtsc;
  • ... this implies that in my plot the 2-thread case is really two independent ddots.
  • ... however this might lead to a case like TimP described: the thread could start/finish at different times
  • I look at the average time of many runs
  • everything is 64B aligned (I also tried 4kB alignment, with the same result)
  • each ddot is completely thread local
  • threads where not pinned manually, in this case I simply used KMP_AFFINITY

Quote:

John D. McCalpin wrote:

For the related case of a single vector summation, I was able to achieve up to ~94% of 32 Bytes/cycle using a fairly similar approach to yours.  The best results used 2 (vector) partial sum variables, and loop unrolling  (up to 16x = 32 cache lines processed in each iteration) to reduce the control overhead.  Obviously there is no room for software prefetching in this case.  These cases show that unrolling is critical to performance, which makes the code much less general.  Without unrolling (just 2 vector sum variables), performance was down around 50% of the expected 32 Bytes/cycle -- two issue slots for the vector instructions and two more for the non-overlapped scalar loop control instructions.

Thanks for these pointers, I think you are right --- my L1 code still needs work (which we can also see from the rising performance in my plot from 2kB->32kB, which is then cut off when L1 size is reached).

Quote:

John D. McCalpin wrote:

For L2-resident data I have had a lot more trouble.  I did get one case up to 63% of 32 Bytes/cycle using a single thread with four summation variables and fiddling with the vprefetch0 distance to get the best performance.  My 63% corresponds to 20.2 Bytes/cycle -- almost exactly the same as your ~19.4 Bytes/cycle.   Like you I was also unable to get improved performance using more than one thread.

Have you tried another scenario with combinations of reads and writes? In that case I seem to get a bit closer to the 32 Bytes/cycles figure. Maybe the quoted 32 Bytes/cycle are reads and writes combined, and cannot be achieved with reads only? Could someone from Intel comment on this?

Quote:

John D. McCalpin wrote:

To get the threads to start at closer to the same time, I used a sneaky trick --- before the OpenMP Parallel For loop, I read the TSC into a global variable.  Then I added 100,000 to that value and had the threads spin on an rdtsc loop until they reached this delayed starting time.  Using this trick, I was able to get the threads to report an initial starting time within 50-80 cycles of each other. 

For the case with 2,3,4 threads on a single physical core, it should be possible to build an extremely efficient set of synchronization primitives for threaded code -- that would take some of the confusion out of this testing.    Any volunteers?

I would be interested, though right now I cannot see how this would help for production code: if I understand correctly we would obtain extremely good synchronization, at the cost of a large overhead (waiting & spinning)?

Thanks, Simon

Hi Simon,

You are correct -- my "sneaky trick" of synchronizing the threads using the TSC is no good for performance -- it just adds extra overhead.

The trick is useful for analysis, since I really do want to be sure that multiple threads are running concurrently over most of the measurement interval.

By the way, I discovered that when I add a "#pragma unroll ()" directive, the compiler quits issuing software prefetches.  Has anyone else seen this?

John D. McCalpin, PhD
"Dr. Bandwidth"

Quote:

John D. McCalpin wrote:

 I discovered that when I add a "#pragma unroll ()" directive, the compiler quits issuing software prefetches.  Has anyone else seen this?

Unrolling is sometimes used to fit in software prefetches without redundancy.  I don't think it's so common on KNC for changes to interfere with software prefetch generation, unless you count the situation where vectorization with vgather comes out performing worse than non-vectorized code which is covered better by software prefetch.

Leave a Comment

Please sign in to add a comment. Not a member? Join today