level 2 and 3 cache misses on Xeon E5

level 2 and 3 cache misses on Xeon E5

Hi,

i try to use the perf commands on an Xeon E5 family processor (2620) in order to get cache miss ratio for L2 and L3. I have tried the perf stat commands and read the Inter software developer manual , but i still an confused regarding 3 issues:

-how do i use the unmask value with the event number in order to get the events? the format of the perf stat operans is -rNNN so we need the hex digits, how are they produced from event number and mask?

-which events must i use in order to get L2 misses? the description on the events is a bit confusing....

-for Xeon case where there is L3 , are the LLC misses= L3 misses= total cache misses ( as measured in perf stat)?

any advice would be highly appreciated

Thanks,

George

6 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

If you have VTune please use it.

The lack of documentation of the "raw" event options for the "perf" commands is irritating.  Fortunately the easy cases really are easy.

For example, if you want to monitor Event 24H, Umask 01h (Demand Data Read Requests that Hit in the L2 cache), the command would be:

                perf stat -e r0124 a.out

So the Umask goes first, then the Event Number.  This is the same as the format used by the performance counter event select registers, as described in Chapter 18 of Volume 3 of the Intel Architecture Software Developers Guide.

The "perf" command automatically fills in reasonable default values for the higher-order bits of the performance counter event select register, so the command above should be equivalent to:

               perf stat -e r00430124

You can choose different high-order bits, which "perf" may or may not allow, depending on the specific bits.  A simple example would be to count in user mode only and not kernel mode, which requires clearing bit 16 to give:

               perf stat -e r00410124

The hard part is understanding exactly what the various events mean.  This one is not too bad -- it counts loads that miss the L1 cache and hit in the L2 cache.  It does not count how much data gets moved from the L2 to the L1 because the hardware prefetchers try to get the data into the L1 before it is needed.   It also does not tell you whether the data was found in the L2 because it was still there from a previous load or if it was in the L2 because one of the L2 hardware prefetchers grabbed it in advance of the actual load. 

John D. McCalpin, PhD "Dr. Bandwidth"

Thanks, that it really helpful

 do you know which event, or combination of events i should use for simple L2 misses (of no particular source) and L1i misses? And what if i need the ratio of those?

 

Almost all of the available information for the Xeon E5-26xx series performance counters is in Section 19.4 of Volume 3 of the Intel 64 and IA32 Software Developer's Manual.  I use document 325384, revision 048, dated September 2013.

For L2 misses, you can (in principle) use counters in the L1, counters in the L2, or counters in the off-core interface (between the L2 and L3).   From Table 19-7 (in section 19.4), the primary event in the L2 that allows you to count misses is 24H:

  • Event 24H, Umask 02H (appears to) count demand read requests (from L1 Dcache misses) that miss in the L2.

    • This event is not directly documented, but is implied by the combination of Umask 01H (counts demand read requests that hit in the L2) and Umask 03H (counts all demand read accesses to the L2).  This implies that Umask 02H counts misses, and in my tests it appears to count these correctly.
  • Event 24H, Umask 08H counts store "read for ownership" requests that miss the L1 Data cache and also miss in the L2 cache.
  • Event 24H, Umask 20H counts instruction fetches that miss the L2 cache.

Adding these three values together should give the total number of L2 misses due to data cache and instruction cache misses.   It should not count L2 cache misses associated with L1 hardware prefetcher accesses.

As an alternative for L2 cache misses, you can use event B0H.  This measures "off-core" requests (from the L2 to the rest of the chip).   Umask 01H counts read requests, Umask 02H counts instruction requests, and Umask 04H counts store (read-for-ownership) requests.

In my experiments, the values from event B0H match the corresponding values from event 24H.    It *may* be possible to combine the Umasks to obtain the sum of the counts in a single run, but in my experience this is hit and miss with the Intel performance counters.  Unless the manual shows the combined mask as a separate entry you should not assume that combining will work.  It often does, but you need to test each case to be sure -- in this case you would run a test code with Event B0H/Umask 01H, Event B0H/Umask 02H, Event B0H/Umask 04H and compare the sum of those results with what you get from Event B0H/Umask 07H.

For instruction cache misses, you can use Event 24H/Umask 30H -- this is documented as counting instruction cache misses that hit in the L2 plus instruction cache misses that miss in the L2, so it should cover all instruction cache misses.

 

There are many other events that relate to L2 traffic -- for example L1 Data cache writebacks that hit or miss in the L2 (which is not inclusive of the L1 caches in the Sandy Bridge core) and L2 hardware prefetches that hit or miss in the L2.  

John D. McCalpin, PhD "Dr. Bandwidth"

Hi all

I have two questions. But first here is a lengthy background description:

I need to measure cache performance for different lookup algorithms and different 
data structures on a Xeon E5 v2 CPU.

I am wrapping code around rdmsr() and wrmsr() functions to do this. 

From Intels pcm source code I have extracted the following configurations

  wrmsr(i, IA32_PERF_GLOBAL_CTRL,         0x0);
  wrmsr(i, IA32_FIXED_CTR_CTRL,         0x333);
  wrmsr(i, IA32_PMC0,                     0x0);
  wrmsr(i, IA32_PERFEVTSEL0,         0x43412e); Event: 2EH Umask: 41H
  wrmsr(i, IA32_PMC1,                     0x0);
  wrmsr(i, IA32_PERFEVTSEL1,         0x4308d2); Event: D2H Umask: 08H
  wrmsr(i, IA32_PMC2,                     0x0);
  wrmsr(i, IA32_PERFEVTSEL2,         0x4307d2); Event: D2H Umask: 07H
  wrmsr(i, IA32_PMC3,                     0x0);
  wrmsr(i, IA32_PERFEVTSEL3,         0x4302d1); Event: D1H Umask: 02H
  wrmsr(i, IA32_PERF_GLOBAL_CTRL, 0x70000000f);

In Intel's source code, the following names are used for the four PCM counters:

PCM0: L3Miss
PCM1: L3UnsharedHit
PCM2: L2HitM
PCM3: L2Hit

And the formulas for calculating cache hit/misses are: 

L2Miss = L2HitM + L3UnsharedHit + L3Miss
L3HitRatio = (L3UnsharedHit + L2HitM)/(L3UnsharedHit + L2HitM + L3Miss)
L2HitRation = L2Hit / (L2Hit + L2HitM + L3UnsharedHit + L3Miss)

So this was the background, here comes the questions :-)

1) In the tables describing eventcodes and umask values I couldn't find the event
code/umask value pair D2H/07H corresponding to PCM2/L2HitM.

Is this because it is a mask of the three EC/UM values 
D2H/ (01H OR 02H OR 04H)  ?

2) Where can I find more detailed descriptions of the event codes and 
umask values that will help me understand the formulas above?

Best regards

Morten Jagd Christensen

发表评论

登录添加评论。还不是成员?立即加入