basic cache parameters in table, discrepancies in descriptions

basic cache parameters in table, discrepancies in descriptions

I would like to clarify the basic cache parameters in "Intel Xeon Phi Coprocessor System Software Developers Guide".

In Table 2.4 "Cache Hierarchy", the table says duty cycles of L2 is "1 per clock", but the text body says "The L1 cache can be accessed each clock, whereas the L2 can only be accessed every other clock". It may mean the duty cycles is 2 clocks.

"Intel Xeon Phi Core Micro-architecture" has discrepancies with the table, too. The document says "The data cache allows simultaneous read and write allowing cache line replacement to happen in a single cycle", which may mean the ports of L1 is read AND write. It also says "The cache is divided into two logical banks", which may mean the number of banks is TWO. It says "L2 cache can deliver 64 bytes of read data to corresponding cores every two cycles and 64 bytes of write data every cycle", which may mean the duty cycle is two.

I am referring to the documents "Intel Xeon Phi Coprocessor System Software Developers Guide" as of 328207-002EN June, 2013; and "Intel Xeon Phi Core Micro-architecture" copyrighted by Apress 2012.

I am tuning my code for the L2 cache bandwidth, so I care about the peak performance very much.

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

From my testing it looks like L2 read bandwidth is concurrency-limited. 

I did a fair amount of work with a sum-reduction kernel last April, and after lots of experimentation I was able to achieve a maximum rate of just under 23 GB/s for L2-contained data using a single core.  At the ~1.1 GHz operating frequency of my Xeon Phi SE10P, this corresponds to 3.062 cycles per cache line.

Given a measured L2 latency of about 23.4 cycles (for scalar pointer-chasing code -- see additional notes below) and a maximum load throughput of one line every 3.062 cycles, the implied concurrency can be computed using a version of Little's Law:

                   Latency * Bandwidth = Concurrency
           23.4 cycles * (1 cache line / 3.062 cycles) = 7.64 cache lines

This suggests that the core only supports 8 outstanding L1 Data Cache misses.  (This could be tested more directly, but at the time I was focused on maximizing the performance of my sum reduction kernel, not just testing how fast I could get bits to move around.)

Additional Notes on Latency:
1. The Xeon Phi System Software Developer's Guide notes that vector latencies may be different than scalar latencies. Some versions of the guide give specific values for the number of extra cycles and some are more vague.
2. But the scalar code also pays a penalty -- the same document describes a 3-cycle "address generation interlock".  Specifically, "A GPR register must be produced three or more clocks prior to being used as a base or index register in an address computation."   It is not clear if this penalty applies to GPRs used as "raw" pointers.  A pointer-chasing code can be compiled in either way -- I just did an "objdump" on the code that I used to get the 23.4 cycle latency value and I see that it uses pointers directly and not as bases or indices.  The inner loop contains 100 loads, in blocks like this:

mov    (%rcx),%rsi
mov    (%rsi),%rdi
mov    (%rdi),%r8
mov    (%r8),%r9
mov    (%r9),%r10
mov    (%r10),%r11
mov    (%r11),%r14
mov    (%r14),%r15
mov    (%r15),%rdx
mov    (%rdx),%rcx

Given an observed performance of one cache line every 3.062 cycles and the assumption that the core can only support 8 concurrent L1 Data Cache misses, the maximum L2 latency that is consistent would be 24.5 cycles.   Given that the vector code pays a slightly higher load-to-use latency than the scalar code, but that it does not pay the "address generation interlock" penalty (since the load address registers are computed in advance in the vector version of the code, this all seems consistent.

Summary & Conclusion:  L2 read bandwidth on Xeon Phi is limited by the maximum number of L1 Data Cache misses to 8 cache lines per latency, or about one cache line every ~3 cycles.  This is ~2/3 of the peak bandwidth of one cache line every 2 cycles.

Not Covered Here:  Normal cached stores that miss in L1 and hit in L2 should be similar, since L1 store misses are almost certain to use the same set of L1 Data Cache miss buffers.  (It may be easier to get full performance on stores to L2, since the in-order core does not stall on store misses.)
Streaming stores bypass the L2, so there is no concept of "L2 bandwidth" that can be applied.

"Dr. Bandwidth"

Thank you, Dr. Bandwidth, for the practical information. 

There is a newer book out (Sep 2013) - Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers - which is an Apress Open book. The information on cache is mostly the same as the previous book but there is additional software and performance information that makes it worthwhile downloading a copy (or you can purchase a paperback copy.)

Thank you. The information is very helpful that the instruction issue rate has already limited the bandwidth aside from the L2 hardware/true bandwidth. It confirms my understanding.

The L2 latency of 20+ clocks likely include the cost of flushing and restarting the pipeline (to the E-stage), and it might be reduced by prefetching to the L1 from the L2. But, I hardly was successful to add prefetching, because the prefetch instructions with offsets do not pair with anything and they likely have a bubble in the pipeline. In addition, threading makes things worse if prefetching is added.

It took some work to generate good code, but the best L2 bandwidth results I obtained were always with a single thread per core, and did use software prefetches.   I was able to get fairly close to the best results using simpler code with multiple threads, but it was always a bit slower.

"Dr. Bandwidth"

I also have similar experience of measuring the cache bandwidth. The detailed information can be found here:


Leave a Comment

Please sign in to add a comment. Not a member? Join today