Cache miss ratio of nehalem

Cache miss ratio of nehalem

Hi,

With Vtune 9.1, it's possible to estimate the percentage of cycles due to long latency data access, such as LLC miss and MLC miss.
Besides that, how to measure the miss ratio of each Cache level (L1/L2/L3 cache miss ratio) on Nehalem? What's the calculation formula?

Regards,
Jie Jiang

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - Jie Jiang
Hi,

With Vtune 9.1, it's possible to estimate the percentage of cycles due to long latency data access, such as LLC miss and MLC miss.
Besides that, how to measure the miss ratio of each Cache level (L1/L2/L3 cache miss ratio) on Nehalem? What's the calculation formula?

Regards,
Jie Jiang

Hi Jie,

We can estimate the % of cycles due to long latency data access,

For 3rdlevel misses: ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100

For 2ndlevel misses: (((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100

If percentage is significant (> 20%), consider reducing misses.

Use VTune Analyzer to drill down to source line and investigate why, change your code.

Regards, Peter

Quoting - Peter Wang (Intel)

Hi Jie,

We can estimate the % of cycles due to long latency data access,

For 3rdlevel misses: ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100

For 2ndlevel misses: (((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100

If percentage is significant (> 20%), consider reducing misses.

Use VTune Analyzer to drill down to source line and investigate why, change your code.

Regards, Peter

Hi Peter,

Thanks for your reply.
However, I'd like to know the fomula for calculating l1/L2/L3 cache miss ratio.
Intel manuals give fomulas for Itanium, core 2 processors, but excluding core i7 proessors.
Maybe the estimation of the percentage of cycles due to L2/L3 cache miss is a better metric than simple cache miss ratio?

Regards,
Jie

Quoting - Jie Jiang

Hi Peter,

Thanks for your reply.
However, I'd like to know the fomula for calculating l1/L2/L3 cache miss ratio.
Intel manuals give fomulas for Itanium, core 2 processors, but excluding core i7 proessors.
Maybe the estimation of the percentage of cycles due to L2/L3 cache miss is a better metric than simple cache miss ratio?

Regards,
Jie

Hi Jie,

Theseformulas(last time I posted) which indicates how Cache Misses impacts on application's run overall.

If you want to know cache miss ratio for different level, here are examples (for memory load):
1. L1: L1D_CACHE_LD.I_STATE / L1D_CACHE_LD. MESI
2. L2: (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM) / L2_RQSTS.LOADS
3. L3: MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM)

The user can define their event ratios by themselves, e.g. Instruction Cache Misses

Regards, Peter

Quoting - Peter Wang (Intel)

Hi Jie,

We can estimate the % of cycles due to long latency data access,

For 3rdlevel misses: ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100

For 2ndlevel misses: (((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100

If percentage is significant (> 20%), consider reducing misses.

Use VTune Analyzer to drill down to source line and investigate why, change your code.

Regards, Peter

Hi Peter,
I have tested the load latency as you pointed.
However, something is strange.

My platform is Intel Nehalem (Core i7) /linux.

Here are some events count collected by pfmon-3.9/perfmon2.

Index Description Counter Value
============================================================================================
1 L1D_CACHE_LD:I_STATE (description not available)................. 1792152169
2 L1D_CACHE_LD:MESI (description not available).................... 3601420667
3 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ 3203586
4 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 331743878
5 MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM (description not available) 0
6 L2_RQSTS:LOADS (description not available)....................... 718837824
7 CPU_CLK_UNHALTED:THREAD (description not available).............. 7310483484
8 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION (description not available). 1633902124
9 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION (description not available). 0

Note that, if we calculate the MLC miss cost as
i(((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100, the result percentage will be about 158.827%. This is intuitively wrong since all overhead should be smaller than the total run time.

There is similar results for the measurement of perfsuite-1.0.0/perfctr.

So I'm wondering if the calculation method is only applicapable to Vtune since Vtune does sampling, not count PMU event (as what pfmon and perfusite does).

Any idea?

Quoting - Jie Jiang
Hi Peter,
I have tested the load latency as you pointed.
However, something is strange.

My platform is Intel Nehalem (Core i7) /linux.

Here are some events count collected by pfmon-3.9/perfmon2.

Index Description Counter Value
============================================================================================
1 L1D_CACHE_LD:I_STATE (description not available)................. 1792152169
2 L1D_CACHE_LD:MESI (description not available).................... 3601420667
3 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ 3203586
4 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 331743878
5 MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM (description not available) 0
6 L2_RQSTS:LOADS (description not available)....................... 718837824
7 CPU_CLK_UNHALTED:THREAD (description not available).............. 7310483484
8 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION (description not available). 1633902124
9 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION (description not available). 0

Note that, if we calculate the MLC miss cost as
i(((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100, the result percentage will be about 158.827%. This is intuitively wrong since all overhead should be smaller than the total run time.

There is similar results for the measurement of perfsuite-1.0.0/perfctr.

So I'm wondering if the calculation method is only applicapable to Vtune since Vtune does sampling, not count PMU event (as what pfmon and perfusite does).

Any idea?

Hi Jie,

This is the problemfor event based sampling to workwith perfmon2 like program,which uses PMUin processor, as well as event based sampling uses.

Perhaps there is conflicting for PMU resource sharing between VTune Analyzer and perfmon2, so the result is incorrect.

Regards, Peter

Quoting - Peter Wang (Intel)

Hi Jie,

This is the problemfor event based sampling to workwith perfmon2 like program,which uses PMUin processor, as well as event based sampling uses.

Perhaps there is conflicting for PMU resource sharing between VTune Analyzer and perfmon2, so the result is incorrect.

Regards, Peter

Hi Peter,

Perhaps my previous description is not very clear and causes misunderstanding.

Vtune works by sampling and call graph, no counting mode of PMU is provided.
The sampled value is stastically inprecise.

When I care about the absolute value of the event number, counting mode is a better choice, which pfmon-like tools can provide. However, when using pfmon to count PMU events(ALONE, not with Vtune, therefore NO PMU resource sharing issues), and applying the fomulas you gave (also, by Intel manuals), the MLC cache miss penalty (which is about 150% ) is confusing me. How can such a penaly larger than the total run time of a program?

So I'm wondering if the above formula applies to Vtune sampling results only (since MEM_LOAD_RETIRED.LLC_UNSHARED_HIT is a PEBS event)?
Or it also applies to the counted results of pfmon, but here pfmon doesn't work correctly? But another tool perfsuite also gives similar resutls.

How do you think?

Regards,
Jie

Quoting - Jie Jiang

Hi Peter,

Perhaps my previous description is not very clear and causes misunderstanding.

Vtune works by sampling and call graph, no counting mode of PMU is provided.
The sampled value is stastically inprecise.

When I care about the absolute value of the event number, counting mode is a better choice, which pfmon-like tools can provide. However, when using pfmon to count PMU events(ALONE, not with Vtune, therefore NOPMU resource sharing issues), and applying the fomulas you gave (also, by Intel manuals), the MLC cache miss penalty (which is about 150% ) is confusing me. How can such a penaly larger than the total run time of a program?

So I'm wondering if the above formula applies to Vtune sampling results only (since MEM_LOAD_RETIRED.LLC_UNSHARED_HIT is a PEBS event)?
Or it also applies to the counted results of pfmon, but here pfmon doesn't work correctly? But another tool perfsuite also gives similar resutls.

How do you think?

Regards,
Jie

Hi Jie,

Sampling data collection absolutely works in PMU counting mode, we can't trust sampling results since you used pfmon which uses PMU.

Observe your results -
MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 331743878 -> this is much higher, and unreasonable
MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM (description not available) 0
CPU_CLK_UNHALTED:THREAD (description not available).............. 7310483484

So please don't use sampling with application which will use PMU.

Regards, Peter

Hi Peter,I am trying to measure the L3 miss ratio. Here are my number with ompscr benchmark c_lu:MEM_LOAD_RETIRED.LLC_MISS events: 90318400MEM_LOAD_RETIRED.LLC_UNSHARED_HIT events: 4810432MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM events: 404736using the above formula, I got l3 cache miss ratio as ~17. What does this mean? Is it 0.17?Also does this need to be normalized with the number of samples for the above?Thanks,pranith.

Quoting Pranith Kumar DenHi Peter, I am trying to measure the L3 miss ratio. Here are my number with ompscr benchmark c_lu:MEM_LOAD_RETIRED.LLC_MISS events: 90318400MEM_LOAD_RETIRED.LLC_UNSHARED_HIT events: 4810432MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM events: 404736using the above formula, I got l3 cache miss ratio as ~17. What does this mean? Is it 0.17?Also does this need to be normalized with the number of samples for the above?Thanks,pranith.

It seemed that LLC_MISS was high!

LLC Miss/Hit rates = 90318400 / (4810432+404736) = 17. That meant 1 LLC hit with 17 LLC misses, average.

There is no necessary to normalize PMU if you use Intel VTune Performance Analyzer.

Could it be special memory test from benchmark c_lu? Can you test other normal applications to compare results?

Regards, Peter

Thanks Peter!

I was also trying to find out MPKI. I tried looking for the documentation but it was of not much help.

The problem is if we need INSTR_RETIRED when the sampling is taking place. But there is no such counter.
We have a counter which gives INSTR_RETIRED for the entire program and not only during sampling.

Also, is there any way to disable sampling in this case? I want to measure the miss events throughout the program.

Thanks,
pranith.

The user canadd/remove eventINST_RETIRED.ANY by modifying sampling activity.

Why did you want to disable sampling data collection? Do you want collect performance data in your program?If so, VTune Analyzer can'tinterpret these data, but you still can use formulas we discussed above.

Regards, Peter

Hi,

sorry I was not clear in my question.

I want to measure MPKI(miss per kilo instructions). Now using sampling events I got the number of L2 miss events.
Now to get MPKI, I need the number of instructions retired during this sampling. Is there any such counter for getting this?

I want to disable sampling because there is some slight variation(~2%) when I re-run the same program. To eliminate this, I want to collect the miss events throughout the program.

Regards,
Pranith.

Hi Pranith,

Now I might understand you need.

You have to use INST_RETIRED.ANY event for sampling data collection to know total instructions executed for interest of process or module. Meanwhile you may disable collecting of L2 miss events in your program, and remove other events in sampling configuration.

Next run session without sampling data collection, you can collect the miss events throughout the program?

So you can know MPKI = L2 misses *1000 / INST_RETIRED.ANY

Regards, Peter

Leave a Comment

Please sign in to add a comment. Not a member? Join today