Core-to-Core Communication Latency in Skylake/Kaby Lake

Core-to-Core Communication Latency in Skylake/Kaby Lake

Hello,

The Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 Processors[1] claims that the latency for reading a cache line modified on another core is ~75 cycles. This document was written in the Nehalem era (2008). In Skylake, is the core-to-core communication latency still twice as slow as a regular L3 hit, or has Intel added more complicated cache communication infrastructure?

Core-to-core communication latency determines how efficiently threads can communicate; e.g. in a producer-consumer scenario a naive consumer might miss on every unit of work, wasting maybe ~75 cycles. Is reading a modified cache line still a ~75 cycle penalty on the newest architectures?

Thanks,
Brian Hempel

[1] https://software.intel.com/sites/products/collateral/hpc/vtune/performan...

Thread Topic: 

Question
7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

On-chip cache-to-cache intervention latency for modified data is generally going up as core counts increase and frequencies remain flat.  

I don't see any reason for this to change in the next generation Xeon processors -- the number of cores is expected to increase and the uncore frequency is expected to stay flat or decrease.   

In general, implementation changes that allow further increases in throughput (e.g., Cluster on Die) can give slightly better intervention latencies within the "cluster" (i.e., half of the cores in a package), but at the cost of significantly higher cache-to-cache intervention latencies outside of the "cluster".   On a Xeon E5-2660-v4 (Broadwell EP, 10-core, 2.0 GHz nominal) the Intel Memory Latency Checker tool reports L2 to L2 modified intervention latency (same socket) of 40.3-42.3 ns, depending on the snoop mode of the processor and the frequency of the uncore.  Since only 2 threads are needed to run this test, the cores are probably running at their max Turbo frequency of 3.0 GHz, making this correspond to 121-127 core cycles.   If I switch to "Cluster on Die" mode, the value is reduced slightly to 37.6 ns, or about 113 core cycles.

The producer/consumer use case is one that can be helped by forcing the uncore frequency to stay at the maximum value.   For high-bandwidth workloads (like STREAM), the "energy efficient Turbo" mechanisms will quickly ramp the uncore frequency to the maximum, but a producer/consumer code may not generate enough traffic to cause the hardware to boost the uncore frequency.

There are lots of ways to make producer/consumer latency better, but to get the best results this needs to be associated with a new, explicitly visible hardware mechanism for low-latency interprocessor communication.  I personally think it is inevitable that such mechanisms will be developed, but it is not yet clear whether this will ever happen in the x86 architecture.

"Dr. Bandwidth"

John,

This is a rather esoteric question relating to your last post.

I am assuming that the latencies you reported were for the data written to one cache to be read from the other cache. The write-to-read is conceptually one operation, but in fact it is several. The write to one L2 posts an invalidate to the other L2(s). If in the single producer, single consumer case, say, the producer is bursting in a batch of entries (push interval less than L2 to L2 modified intervention latency), does each subsequent push prevent the other core from performing the read (e.g. of the fill pointer) until the burst is complete? If so, then the latency could be much worse.

Jim Dempsey

I usually implement a producer/consumer code using "data" and "flag" in separate cache lines.  This enables the consumer to spin on the flag while the producer updates the data.   When the data is ready, the producer writes the flag variable.   At a low level, the steps are:

  1. The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.

    1. The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
    2. The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
  2. The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
  3. The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
    1. The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
    2. The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable.  If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
    3. The L3 may send the data for the cache line containing the flag to the producer's cache at this time, but that data cannot be used until the coherence transactions are complete.
  4. The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
    1. Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
  5. The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
    1. If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
    2. If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache.  I don't know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
  6. Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
    1. Depending on the implementation, several different things might happen.
    2. One option is for the L3 to hold the line containing the flag variable in a "transient" state while it waits for an acknowledgement from the Producer core that it has received the write permission message.  In this case the L3 will either:
      1. Stall the read request from the consumer core, or
      2. NACK the read request from the consumer core (i.e., tell it to try again).
    3. Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core's cache.
  7. In the "lucky" case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
    1. Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
    2. Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data.  (This is more common in the case of a store than a read.)
  8. Finally the consumer core gets the new value of the flag variable and sees that it has changed!  The data is now ready!
  9. The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
  10. The consumer core now executes a load instruction to get the data.  This misses in the consumer's L1 and L2 caches and generates a read request on the ring.
  11. The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
  12. The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
  13. The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now "read-only" in the producer's caches.
  14. As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
  15. The cache line containing the data arrives in the consumer core's cache, and the load instruction is allowed to complete.
  16. Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
    1. This requires the consumer to make an "upgrade" request on the cache line containing the flag, so it can write to it.   This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again -- it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
    2. Double-buffering can be used to avoid this extra transaction -- if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation.  These often involve extra round trips required to ensure ordering in ugly corner cases.  A common example is maintaining global ordering across stores that are handled by different coherence controllers.  This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages. 

There are fewer steps in the case where the "data" and "flag" are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line.  This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

"Dr. Bandwidth"

Excellent !!

Your discussion provides critical information for use in developing high(er) throughput queue systems. And it indicates that simplified coding of a queue will likely yield a lesser performing function.

Thanks for the information.

Jim Dempsey

@"Dr. Bandwidth"'s excellent #4 response above - answers my question at:
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-ar...

So am I to believe that the above cache invalidation strategy is the only way for a program to explicitly atomically transfer
a 64-bit value from one physical (or logical/hyperthread) core  to another without going out to RAM and back,
on a Haswell-MB 4910 chip, not having QPI,  is the above method in Comment #4 ? 
I believe I am correct to assume my mobility Haswell chip does not have a QPI ring bus  ?
If I am wrong, is there no way for code to explicitly use the QPI ring bus  (beyond arranging that a consumer core is
trying to access an address that has just been written by a producer core to L2/L3 caches, not flushed, and the consumer
is prefetching / reading that address ?
 

 

The Core i7-4910MQ does have an internal ring that connects the 4 cores with each other and with the 4 slices of the L3.   See, for example, slide 7 of Intel's presentation at https://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Pro...

The sequence of events for a two-cache-line producer/consumer setup is similar to what I outlined above, except that there is no case where you might need to send invalidate commands to another chip....

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today