Hi, I have been looking at the cache latencies of my Intel SB part. I build a ptr chase that takes a given size of memory and breaks it into so many granular chunks. I use the first array entry every granular element to point to the next. For instance.. in a size spanning 128KB and a granularity of 32KB I create a walk of:a -> a[32768 * 2]a[32768 * 2] -> a[32768 * 1]a[32768 * 1] -> a[32768 * 3]a[32768 * 3] -> a The cache size of the L1 is 32KB and has an associativity of 8, so one might expect that the index to a specific way is set via bits 14:12. I do a walk with granular steps of 4KB spanning (8 + n) * 4KB for various n, and I observe: * for n=0, I get 4 cycles of latency, as expected the L1 latency * for n=1, I am stressing the associativity of the L1, but I don't get the L2 latency, I'm inbetween, say 5-7 cycles sometimes and other times very close to the L2 latency * for n=2, I'm getting the L2 latency, makes sense(question: why for n=1 am I sometimes getting into a mode where I get numbers close the L1 latency, when for the given cachesize and associativity I should be in the L2, is there some buffering for victims to the L2 that my test is hitting in some state with this trivial test?) Then I thought I'd measure the transition between the L2 and L3. The L2 is reported as 256KB and non inclusive of the L1. The associativity is 8, so one might expect the index to the way of the L2 to be determined via bits 17-15. I do a walk with granular steps of 32KB spanning (8 + n) * 32KB for various n, and I observe: * for n=0, I get the L1 latency, which is expected, the L1 is 8 way associative and we found earlier it maps every one of the 8 steps of 32KB in this test to different ways.. and since there are 8 steps, we fit in the associativity of the L1. * for n=1, I get 5.2 cycles, which isn't the L2 latency, why? * for n=2 and greater, I observe the L3 latency, and not the L2 latency, why!?(question: It appears that there's some interesting behavior going on between the L1 and L2. * is the L1 a write back cache to the L2? Appears not. The behavior looks alot like the L2 doesn't exist. * if I do a walk of steps of 4096, once I do 64 of these (which maps the associativity of the L2, but doesn't include that of the L1, which would require 72 such steps) I observe the latency increase from the L2 for 64, which is 12, to 14 for 65 steps, 17 for 66 steps, and so on. I'm observing behavior like either the L1 doesn't exist or a way of the L2 isn't there.)This test is using large pages, 2MB, so TLB misses are not occuring, and the start of the array is aligned to the start of the 2MB page. Lastly, the pointer chase is 8x unrolled. See below for an illustration of what the code looks like, maybe the HW pref is catching for walks with a few steps (say 8), that the address loaded via the rip is the same. For a walk with 16 steps, every 2 loop iterations you would get the same address touched via that rip:1002420: 48 8b 3f mov (%rdi),%rdi1002423: 48 8b 3f mov (%rdi),%rdi1002426: 48 8b 3f mov (%rdi),%rdi1002429: 48 8b 3f mov (%rdi),%rdi100242c: 48 8b 3f mov (%rdi),%rdi100242f: 48 8b 3f mov (%rdi),%rdi1002432: 48 8b 3f mov (%rdi),%rdi1002435: 48 8b 3f mov (%rdi),%rdi1002438: 49 ff cd dec %r13100243b: 75 e3 jne 1002420I can play with removing the unrolling in this test.. but maybe the hw pref is getting in the way. Still.. for a walk spanning 64 * 4KB, you fit in the L2, for 65 * 4KB, you don't. Yet I'm observing that we are actually going to the L3. Is the hardware prefetcher speculating and evicting data from the L1 that I'm not requesting?Any help in understanding this strange behavior is very much appreciated. As stated before.. the steps are randomized but there is a periodicity that the hw pref might catch, depending upon it's use of the rip and the address history for that rip.Perfwise
For more complete information about compiler optimizations, see our Optimization Notice.