Haswell L2 cache bandwidth to L1 (64 bytes/cycle)?

Haswell L2 cache bandwidth to L1 (64 bytes/cycle)?

Hello,

I'm having problems achieving the 64 bytes/cycle L2 to L1 cache bandwidth on the Haswell core.  I can only achieve 32 bytes/cycle.  Does anyone know if this is possible?  Several other reviewers on the web are having the same problem.  Was this a typo on the Intel slides for Haswell?  My current code is using integer loads.  Is the claimed 64 bytes/cycle dependent on SSE loads?  I'm stumped.

Thanks,

Stephen

 

Tom's hardware:

L1D, L2, and L3 cache bandwidth are all up on our Core i7-4770K compared to my preview piece. However, only the L1D yields the doubling of bandwidth we were expecting. Given 64 bytes/cycle (compared to 32 for Ivy Bridge), this number should still be much higher than it is.

http://www.tomshardware.com/reviews/core-i7-4770k-haswell-review,3521-12...

www.extremetch.com:

We’re surprised that the L2 cache is just 7% faster on Haswell, but the massive L1 bandwidth is a huge jump.

http://www.extremetech.com/computing/157125-haswell-review-intels-core-i...

 

AttachmentSize
Downloadimage/jpeg Cache Latency.JPG142.79 KB
19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The Haswell core can only issue two loads per cycle, so they have to be 256-bit AVX loads to have any chance of achieving a rate of 64 Bytes/cycle.

"Dr. Bandwidth"

Thanks for the comment John.   I will try using 256 bit AVX loads.  

My current program is just doing a DWORD load with a stride of 64 bytes.  Naturally, this is a fake benchmark, but in theory it should be loading a 64 byte line into the L1 from the L2 each clock.   For some reason the code can't get past the 32 bytes / clock rate.  What surprises me even more is that I can't find any benchmark or article that demonstrates the Haswell cpu running at 64 bytes / clock from the L2.  Not one. 

Now, if Haswell can achieve 64 bytes / clock with two 256 AVX loads, then why not with a single DWORD load with a 64 byte stride?  This might imply that the L2/L1 logic is somehow coupled with the AVX unit and not the integer unit.  That would be really strange.

Anyhow, I will give it a try with AVX loads.  I welcome any help in figuring this out.  

Thanks again,

Stephen

How the DWORD stride is loaded? Are AVX registers or 64-bit GP register used to load the data?

Thanks for asking Iliyapolak,  Here is the basic loop for my integer code.  It is just a simple load and increment pointer that repeats thirty-two times.  In theory the haswell core should be able to do a load every clock.  Each load should bring in 64 bytes / cycle, but it doesn't.  It is taking two clocks per load just like Ivy Bridge or Sandy Bridge.

I will implement an AVX version on Monday.   

Thanks, 

Stephen

//* Load 256k minus 32k into L2 *//

//* Then touch the oldest 32k * 4 again *//

mov ecx, 32k * 4

mov esi,  BUFFER

top:

mov eax, dword [esi]

add esi, 64

//* The MOV, ADD sequence repeats 32 times *//

sub ecx, (64 * 32)

jne top:

 

 

>>> In theory the haswell core should be able to do a load every clock>>>

I think that  load of 64 bytes per clock can be achieved  with the help of AVX  registers only. I only suppose that at level of the internal implementation when load mov instructions are stored inside Load buffer their targets or data is loaded into physical registers waiting probably for AGU to calculate store address and send the data to L1D Cache. I do not know if the problem is related to number of physical integer registers which are loaded with the data. I was thinking about the possibility that internally data is transfered by 32-bit physical registers and hence lower bandwith per cycle

 

There is absolutely no doubt that you can only move 64 Bytes per cycle into registers by using AVX instructions.  There are very clearly only 2 load units (ports 2 and 3 shown in Figure 2-1 of the Intel Architectures Optimization Manual, document 248966-030, September 2014), so only two load uops can be executed per cycle.   The biggest load payload is 256 bits == 32 Bytes, and 64 Bytes / 2 loads = 32 Bytes per load.

If you assume that whole cache lines are being transferred between the L1 and L2, then you can compute a "hardware bandwidth" by multiplying the number of L1 misses/second by 64 Bytes/miss.  Since this approach does not concern itself with getting data all the way into registers, you can use any type of load to generate the misses.  Implementation recommendations:

  • A stride of 64 Bytes for reads on a data structure that is larger than L1 cache and smaller than L2 cache should give an L1 miss every cycle.
  • When using small (4KiB) pages, the random mapping of virtual to physical addresses will mean that you can't count on retaining data structures that are the full size of the L2.   Fortunately the L1 is well-behaved, so you actually only need to access a data structure that is slightly larger than the L1 to get it to completely flush every time through.  64 KiB is a good size for flushing the L1 cache without too much risk of missing the L2.  (Using 2MiB pages eliminates this problem.)
  • You will want ignore the timings the first time through the measurement, since they will be contaminated by both L2 misses and by writebacks of dirty data from the L1.
  • Inline RDTSC or RDTSCP instructions should be used for timing, and the loop should be sized so that the overhead of these timer reads (typically ~30-35 cycles, but I have not tested this on Haswell) is negligible.
  • Check the assembly code to make sure that there are not extraneous instructions that could cause pipeline bubbles.  I usually make sure the loop is unrolled a few times.
"Dr. Bandwidth"

I suppose that at the hardware level physical AVX registers could be  using multiplexed 256-bit wide paths to the Load/Store ports hence the 64-byte/cycle bandwitdh. In the case of physical GP registers 64-bit width path can be probably used to transfer multiplexed data hence the lower bandwidth.

Guys, all theory aside, I find the lack of practical/ACTUAL C/ASM snippet featuring the MAJOR ability of Haswell to fetch 64B per clock stupid and crying-for-implementing, NO?!

Seriously, call me a dummy amateur but this is unacceptable.

I asked the same thing one of the developers of superb Everest/AIDA and he very politely stated that AIDA is PRO, sadly I am AM.

http://forums.aida64.com/topic/1326-new-cache-and-memory-benchmarks-in-a...

Unfortunately Intel has not disclosed enough of the details of their cache implementation to guide the development of L2 cache bandwidth benchmarks.   I have some ideas, but not enough time to implement and test them...

I think it is safe to assume that Intel is telling the truth when they say that Haswell can transfer 64 Bytes per cycle from the L2 to the L1 -- but that is only a very small part of the story!    Moving data from the L2 cache to the L1 data cache is always associated with motion of data from the L1 data cache to the core, and it is not at all clear how these transfers interact.   For example on Sandy Bridge (where I have done most of my testing), we know that the L1 data cache is composed of 8 banks, each handling accesses to a different 8 Byte field of a 64-byte cache line, and each capable of handling only one read access per cycle..  The L2 to L1 interface is reported to be capable of delivering 32 Bytes per cycle, but we don't know if the data is delivered as writing to four consecutive banks on each of two cycles, or if it writes to all eight banks every other cycle.  Even for "read-only" access patterns the data must be written into the L1 data cache, and we don't know how the L1 data cache banks handle the mixture of reads (to the core) and writes (incoming data from the L2).   We don't know if a demand load bringing data from the L2 to the L1 data cache bypasses that data to the core (thus eliminating the need to read the data from the L1 data cache), but if L1 hardware prefetch is active then at least some of the lines being read from L2 to L1 will get there in advance of the demand load which, so those will have to be read from the L1.

The situation is more complex if the core is performing both reads and writes.  Timing issues might be critical, and timing is extremely difficult to control if there are any stores in the instruction stream.   It is also not clear how reads and writes interact in the L1 data cache banks, and not clear how L1 writebacks to the L2 and L1 refills from the L2 interact with L1 cache reads and writes.  Do L1 writebacks (which are reads of the L1 cache array) conflict with (i.e. use the same read ports as) reads from the core?  Does the writeback access all 8 L1 data cache banks for one cycle or four banks in each of two cycles?  Performance in the presence of writebacks might be strongly impacted by the existence of small "victim buffers" between the L1 and L2 caches (to prevent conflicts between the victim write and the incoming data read, which always go to the same cache set).   

For Haswell the cache organization has clearly changed, but the descriptions don't go into enough detail to understand how the L2 to L1 transfers will interact with L1 to core transfers.   The result may eventually be easier to understand than the banked L1 on Sandy Bridge, but there are lots of possible implementations and it is very hard to design microbenchmarks to disambiguate across the possibilities.  It is also likely that the real implementation is more complex than any hypothesis that an outside party is likely to come up with, which would result in either a "false positive" in the microbenchmark testing, or a set of results that is inconsistent with any of the proposed models.

"Dr. Bandwidth"

Thank you for sharing your knowledge Dr. Bandwidth, it shows how 'simply' things as the major feature of the leading CPU are deceptively complicated and shrouded in corporative secrecy as if the goal is to prevent the plebs of reaching these speeds.

 

The 64 bytes/cycle represents the data bus width from the L2->L1. That is doubled over prior generations @ 32 bytes/cycle. The sustained BW from L2->L1 is less than this peak capability but is also nearly doubled on Haswell over prior generations.

Vish

Our investigations of Haswell indeed indicate that it is possible to achieve a bandwidth of more than 32B/c (about 43B/c) from L2 to L1 in load-only scenarios. This is under the assumption that the L1 is single-ported, i.e. data can only be transfered from L2 to L1 in cycles that don't overlap with data transfers from L1 to registers. In the attached plot you can see that that it takes approximately two cycles to compute the dot product for one cache line with data residing in L1 (two vfmadds, four vmovpds taking two cycles with two avx loads happening each cycle).

With data residing in the L2 cache, we'd expect a total of four clock cycles (two additional cycles to transfer two cache lines from L2 to L1), but we actually see about three (or less in the case of Broadwell) cycles, which corresponds to something in the ballpark of 40B/c.

However, for streaming access patterns that involve evicts from L1 to L2 I observed the bandwidth degrades to 32B/c. I've read somewhere that the L1-L2 eviction bandwidth is only 32B/c, but still this should result in a 'mixed' bandwidth higher than 32B/c depending on the load to evict ratio, e.g. for a copy-a-to-b-kernel: load a (43B/c), write-allocate b (43B/c), evict b (32B/c) should yield an average L2 bandwidth of 39B/c. Can someone comment on that?

Attachments: 

AttachmentSize
Downloadimage/png ddot_fma.png12.35 KB

Our investigations of Haswell indeed indicate that it is possible to achieve a bandwidth of more than 32B/c (about 43B/c) from L2 to L1 in load-only scenarios. This is under the assumption that the L1 is single-ported, i.e. data can only be transfered from L2 to L1 in cycles that don't overlap with data transfers from L1 to registers. In the attached plot you can see that that it takes approximately two cycles to compute the dot product for one cache line with data residing in L1 (two vfmadds, four vmovpds taking two cycles with two avx loads happening each cycle).

With data residing in the L2 cache, we'd expect a total of four clock cycles (two additional cycles to transfer two cache lines from L2 to L1), but we actually see about three (or less in the case of Broadwell) cycles, which corresponds to something in the ballpark of 40B/c.

However, for streaming access patterns that involve evicts from L1 to L2 I observed the bandwidth degrades to 32B/c. I've read somewhere that the L1-L2 eviction bandwidth is only 32B/c, but still this should result in a 'mixed' bandwidth higher than 32B/c depending on the load to evict ratio, e.g. for a copy-a-to-b-kernel: load a (43B/c), write-allocate b (43B/c), evict b (32B/c) should yield an average L2 bandwidth of 39B/c. Can someone comment on that?

Attachments: 

AttachmentSize
Downloadimage/png ddot_fma.png12.35 KB

Hi Johannes,

Could you post the code for your ddot_fma kernel here?

Thanks

Hi Johannes,

Could you post the code for your ddot_fma kernel here?

Thanks

The source code would be interesting, as there are factors besides L2 bandwidth which could limit performance of fma dot product.  For example, gcc gives better performance without fma, as lack of "riffling" makes the overall latency of fma the limiting factor.

One may doubt whether plain dot products are included in the mix of application kernels which would have motivated the increased bandwidth between L2 and L1 implemented in Haswell.  Parallel dot products, as in dgemm, might be among the motivators, but those allow for L1 locality of at least one operand.

I finally got around to looking at the attached charts and realized that I am confused....

The chart shows performance approaching 2 cycles "per cache line" for the largest L1-containable sizes.  Assuming that "per cache line" means "for processing 8 FMAs" (i.e., reading one cache line from each of the two input vectors), this corresponds to a bandwidth of 128 Bytes in 2 cycles, or 64 Bytes/cycle, which is the expected value for a core that can execute 2 32-Byte loads per cycle.

Moving to L2-contained data, the chart shows that the performance is bounded below by 5 cycles "per cache line".  This is only 40% of the performance of the L1-contained version, so it corresponds to a sustained rate of 25.6 Bytes/cycle.  This is only 40% of the stated L2-L1 bandwidth of 64 Bytes/cycle, and 20% lower than the peak 32 Bytes/cycle of the previous generation.

Assuming that the L1 Data Cache is "single ported" (i.e., it can either provide data to the core or receive refills from the L2 in any cycle) should result in a 4-cycle minimum time "per cache line" -- 2 cycles to execute the 4 loads to move the 128 Bytes from the L1 Data Cache to the core plus 2 cycles to move two new lines from the L2 cache to the L1 Data Cache.

Looking at the chart, I don't see any evidence of execution in less than 5 cycles -- every data point from 40 KiB to ~120 KiB is at or above 5 cycles.  This suggests that in addition to being "single ported", there is at least one additional cycle required in this access pattern, and it is hard to understand why such an overhead would be required.  (With stores and writebacks it is pretty easy to come up with scenarios that should result in such delays.)   

So I finally see where the "43 Bytes/cycle" mentioned above comes from -- it *assumes* that the L1 can only provide data to the core or receive data from the L2 in any cycle, and then derives an average L2 to L1 rate based on the excess cycles -- 128 Bytes in 3 excess cycles is 42.67 Bytes/cycle.   I think it is important to be clear that the "43 bytes/cycle" is not a measurement, but an interpretation based on what is almost certainly an oversimplified model of the actual implementation.  The measurement is 25.6 Bytes per cycle for the DDOT kernel operating with data that is expected to be in the L2 cache.  There may be other, more obscure, explanations for the discrepancy between this value and the 64 Byte/cycle burst rate claimed for the L2 to L1 interface.

These numbers are very close to 2x faster than the previous generation of products, but they are still disappointing.  The L1-contained DDOT takes 2 cycles and requires 2 4-wide FMAs, so it is running at 1/2 the peak arithmetic rate of the chip.  The L2-contained DDOT drops this from 50% of peak (8 FLOPS/cycle) to 20% of peak (3.2 FLOPS/cycle), which could be provided by an SSE implementation with separate 128-bit FP add and FP multiply units running at 20% lower frequency -- comparable to technology released by Intel 13-14 years ago.
 

"Dr. Bandwidth"

I notice the assembler above uses 32 bit instructions and registers.  DWORD is 32 bit AFAIK for historical reasons. I would try using 64 bit registers (rax, etc). I think the assembler is smart enough to use the correct 64 bit bit ops, but to be sure use  explicit  64 bit ops such as loadq.  

I think the data bus is still 64 bits, please correct me if I am wrong. 

Benchmarking 64 bit vs 128 bit sse4.x ( xmm0 and such)  loads, I see these not being any faster than the 64 bit loads.  That would point to the data bus being 64 bit wide.

Leave a Comment

Please sign in to add a comment. Not a member? Join today