Unusual Pointer Chasing Memory Latency on SLES 11 SP2 with E5-2670

Unusual Pointer Chasing Memory Latency on SLES 11 SP2 with E5-2670

Hi, I am running on a 2S Intel motherboard, S2600GZ, with 2 x E5-2670 cpus. I'm measuring my expected cache latencies (4 from L1, 12 from L2, 40 from L3) but when I try to measure, using huge pages, the latency of the test illustrated below (again using huge pages) on SLES 11 SP2, I observe that I'm getting either X or 2X the latency from run to run. In some cases the latency is 80-90 ns and in others it's 160-180 ns. I'm sure the latency isn't the later, but I've pulled 1 CPU out of the motherboard thinking I may be inadvertantly accessing it's memory but that's rectified this issue. Do you have any idea why I'm observing this behavior? The test does the following:1) allocates a large span of memory, say 32MB.2) accesses randomly a 8B element every 4096 B, but only 1 access every 4096B block3) that access then contains the pointer to the next access.. and so on.4) once you've made the measurement you flush every step of the walk, using CLFLUSH.5) repeat till you get a good memory latency estimate.I've affinitized the process with "numactl" to no avail.Lastly, I've accessed every 128KB of a likewise 32MB array, and measure the latency of that pointer chase and don't observe this behavior. I get a reproducible number for the latency in that test.Any pointers or information as to things I should be aware of is greatly appreciated..perfwise

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Folks, Just replying to see if anyone has any ideas. I've removed 1 socket from my 2P server platform and still see the same results. Is there a program you suggest to measure the memory latency on this platform. Seems to me this would be a basic thing someone could answer.. any help.. is very much appreciated.Perfwise

I forgot to mention. I'm using huge pages to avoid TLB refresh from interferring with my measurements.I get the following as latencies of the caches for this server processor:L1: 4L2: 12L3: 39-40what's the expected memory latency (if it's publicly available, and can you confirm the L3 latency?). On my SB desktop system the latency ranges from 25-30 clocks but is typically 29. I was surprised to see the increase in latency, but now there's 8 cores on the ring bus.Perfwise

Update... I just downloaded Lmbench3, and it's reporting 127 ns. I know that's bogus. So any pointers to what's expected, with 1S populated in this 2S motherboard with the processor in my first post, with 1600 DDR3, is greatly appreciated, if anyone knows.Perfwise

Hello perfwise,
Sorry to not respond earlier.
Are you able to turn off the prefetchers in the bios and then just try a 64 byte stride, sequential, linked list, dependent, standard memory latency test?
With 2MB pages, the TBL penalty should be negiglible.
This would eliminate any question about your methodology.
You might be doing evrything right, I'm just not so familiar with random loads and clflushing.

I would probably check (with prefetchers off), using a standard latency test,the result for regular (4KB) pages, then 2MB pages using 64 byte stride and something like a 40MB array size (or even an 80 MB array).

I would expect the results to be 4KB latency to be similar to the 2MB latency but I personally haven't used 2MB pages very much.

Pat, I'll try that.. but would you have a rough idea of what the minimum memory latency is for ddr3 1600 at 2.6 GHz? Also.. what's the expected L3 latency for a 20MB L3 on this chip? ThanksPerfwise

All, Just inquiring whether we know the L3 latency and the memory latency (page closed) I should expect on my SB server system. Any help is appreciated, I would think this would be a simple request to answer since this knowledge is somewhat important in understanding the performance of the system. Thanks and have a great day..Perfwise

Hello Perfwise,
Sorry for the delay.
I'm getting about 76 ns/LLC_miss for a page-hit. This is a 'load to use' latency.

This test has the config:
prefetchers off,
turbo off,
run for 20 seconds,
40MB array
stride 64 bytes
linked list, dependent load
run the test on 1 cpu from each node simutaneously
I used cpu 3 and cpu 18 (I try to avoid running on cpu 0).
memory malloc'd on same cpu on which test was run
so I malloc'd a 40MB array on cpu 3 and ran the latency test on cpu 3. Same method for cpu 18.

For the page-miss case, I used a stride of 4096 bytes and a 512MB array size.
I got a latency of 87.3 ns/miss.

For an L3/LLC hit, using a 10MB array size, I get a latency of 40.2 clockticks.


Leave a Comment

Please sign in to add a comment. Not a member? Join today