How to calculate L1 and L2 cache miss rate?

How to calculate L1 and L2 cache miss rate?

Hello, everyone:
I am a new user of Intel Vtune. I want to measure the L1 and L2 cache miss rate on intel Quad 4 Q6600 processor. The following formula is computing the L1 and L2 miss rate, am I right?

L1: L1D_CACHE_LD.I_STATE / L1D_CACHE_LD. MESI
L2: L2D_CACHE_LD.I_STATE / L2D_CACHE_LD. MESI

btw, I have another question about measuring the multithread application. I run two threads on the core0 and core1 of Q6600 which shared L2 cache. One thread is main thread, another is prefetch thread, How can I measue the impact of prefetch thread on the main thread? I mean how to evaluate the benefit ofprefetch thread?
hope for your responsing!

17 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi,

Q6600 is Intel Core 2 processor.Yourmain thread and prefetch thread canaccess data in shared L2$. How to evaluate the benefit of prefetch thread? You can use VTune Analyzer to measure L2$ misses in main threadto compare two situations: 1) use prefetch thread; 2) don'tuse prefetch thread.

Measuring L2$ misses is tomodifysampling activity, "Configure Sampling"->Ratios->add "L2 Cache Miss Rate" to the list of "Selected Ratios:"
You can verify in "Selected events:" list, event L2_LINE_IN.SELF.ANY was added. Sampling result willdisplay "L2$ Miss Rate" data to you.
L2 Cache Miss Rate = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, you can find info inhelp file.

Hope it helps.

Regards, Peter

Quoting - Peter Wang (Intel)
Hi,

Q6600 is Intel Core 2 processor.Yourmain thread and prefetch thread canaccess data in shared L2$. How to evaluate the benefit of prefetch thread? You can use VTune Analyzer to measure L2$ misses in main threadto compare two situations: 1) use prefetch thread; 2) don'tuse prefetch thread.

Measuring L2$ misses is tomodifysampling activity, "Configure Sampling"->Ratios->add "L2 Cache Miss Rate" to the list of "Selected Ratios:"
You can verify in "Selected events:" list, event L2_LINE_IN.SELF.ANY was added. Sampling result willdisplay "L2$ Miss Rate" data to you.
L2 Cache Miss Rate = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, you can find info inhelp file.

Hope it helps.

Regards, Peter

hi, Peter
Thanks for your response. I don't know why the L2 cache miss rate in the vtune mannual is different from the definition in the text. The global L2 miss rate is L2 miss number/Memory reference, the local L2 miss rate is L2 miss number /L2 reference. What do you think about the above definition? How to calculate the L2 cache miss rate according to the above formulor. I mean which event I should select to measure.

I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.

L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.

Both above event miss rateswill be calculated by VTune Analyzerautomatically.

Hope that I have answered your questions.

Regards, Peter

Quoting - Peter Wang (Intel)

I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.

L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.

Both above event miss rateswill be calculated by VTune Analyzerautomatically.

Hope that I have answered your questions.

Regards, Peter

Hi, Peter
The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
Please reference.
Definitions:

- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)

For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2

Calculate local and global miss rates

- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%

as for a 32 KByte 1st level cache; increasing 2nd level cache

L2 smaller than L1 is impractical

Global miss rate similar to single level cache rate provided L2 >> L1

Local miss rate not a good measure for secondary cache.

cited from:people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
So I want to instrument the global and local L2 miss rate.
How about your opinion?

Best Reply

Quoting - explore_zjx

Hi, Peter
The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
Please reference.
Definitions:

- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)

For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2

Calculate local and global miss rates

- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%

as for a 32 KByte 1st level cache; increasing 2nd level cache

L2 smaller than L1 is impractical

Global miss rate similar to single level cache rate provided L2 >> L1

Local miss rate not a good measure for secondary cache.

cited from:people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
So I want to instrument the global and local L2 miss rate.
How about your opinion?

Hi,

Finally I understand what you meant:-) Actually Local miss rate and Global miss rate are NOT in VTune Analyzer's terminologies

Note that "$ Miss rate" also can be defined by the user, you used (divided by) "memory references", but VTune Analyzer used "instructions retired".

According to your reqirements, I suggest to defineG-miss rate, L-miss rate as:
G-miss rate = MEM_LOAD_RETIRED.L2_LINE_MISS/ INST_RETIRED.ANY
L-miss rate =MEM_LOAD_RETIRED.L2_LINE_MISS / MEM_LOAD_RETIRED.L1D_MISS

Again, that is your decision to define Event Ratio - VTune Analyzer provides typical event ratios, but the user can re-define what they like.

Regards, Peter

Quoting - Peter Wang (Intel)

Hi,

Finally I understand what you meant:-) Actually Local miss rate and Global miss rate are NOT in VTune Analyzer's terminologies

Note that "$ Miss rate" also can be defined by the user, you used (divided by) "memory references", but VTune Analyzer used "instructions retired".

According to your reqirements, I suggest to defineG-miss rate, L-miss rate as:
G-miss rate = MEM_LOAD_RETIRED.L2_LINE_MISS/ INST_RETIRED.ANY
L-miss rate =MEM_LOAD_RETIRED.L2_LINE_MISS / MEM_LOAD_RETIRED.L1D_MISS

Again, that is your decision to define Event Ratio - VTune Analyzer provides typical event ratios, but the user can re-define what they like.

Regards, Peter

Thanks very much.

Quoting - Peter Wang (Intel)

I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.

L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.

Both above event miss rateswill be calculated by VTune Analyzerautomatically.

Hope that I have answered your questions.

Regards, Peter

this article : http://software.intel.com/en-us/articles/using-intel-vtune-performance-a...

show us the MEM_LOAD_RETIRED.L2_LINE_MISS, is it more helpful than L2_LINES_IN?
and I need to calculate the ratio myself, or I can input the fomula in the vtune?

Quoting - softarts

this article : http://software.intel.com/en-us/articles/using-intel-vtune-performance-a...

show us the MEM_LOAD_RETIRED.L2_LINE_MISS, is it more helpful than L2_LINES_IN?
and I need to calculate the ratio myself, or I can input the fomula in the vtune?

MEM_LOAD_RETIRED.L2_LINE_MISS is to measure L2 data cache misses.

L2_LINE_IN are both is for L2 data cache missesand L2 instruction cache misses. If you have many "branch" code, L2_LINE_IN is helpful!

You can use VTune Analyzer'sdefinition (default):
L2 Cache Miss Rate = L2_LINE_IN.SELF.ANY/ INST_RETIRED.ANY

This result will be displayed in VTune Analyzer's report! No action is required from user!

Or you can use yourself definition: for example, if you don't care of L2 missby Instruction prefetching.
L2 Cache Miss Rate = MEM_LOAD_RETIRED.L2_LINE_MISS / INST_RETIRED.ANY

This result will NOT be displayed in VTune Analyzer's report! And the user can't input this formula in report!

Regards, Peter

Quoting - Peter Wang (Intel)

MEM_LOAD_RETIRED.L2_LINE_MISS is to measure L2 data cache misses.

L2_LINE_IN are both is for L2 data cache missesand L2 instruction cache misses. If you have many "branch" code, L2_LINE_IN is helpful!

You can use VTune Analyzer'sdefinition (default):
L2 Cache Miss Rate = L2_LINE_IN.SELF.ANY/ INST_RETIRED.ANY

This result will be displayed in VTune Analyzer's report! No action is required from user!

Or you can use yourself definition: for example, if you don't care of L2 missby Instruction prefetching.
L2 Cache Miss Rate = MEM_LOAD_RETIRED.L2_LINE_MISS / INST_RETIRED.ANY

This result will NOT be display in VTune Analyzer's report! And the user can't input this formula in report!

Regards, Peter

my application measurement result is:

L2 cache miss(MEM_LOAD_RETIRED.L2_LINE_MISS) about 30-50%,

L1 cache miss: MEM_LOAD_RETIRED.L1 is 700, INST_RETIRED.ANY=0,that means L1 cache miss rate is infinite?

there is ideal result for cache miss rate?

Quoting - softarts

my application measurement result is:

L2 cache miss(MEM_LOAD_RETIRED.L2_LINE_MISS) about 30-50%,

L1 cache miss: MEM_LOAD_RETIRED.L1 is 700, INST_RETIRED.ANY=0,that means L1 cache miss rate is infinite?

there is ideal result for cache miss rate?

I don't know why you have "INST_RETIRED.ANY=0", I guess that data is sample count, not event count.

VTune Performance Analyzer has default SAV (Sample After Value) setting for selected event, "sample is zero" means - your app ran shortly (event count < SAV value). You can increase workload or change default SAV value (by modifying your vtune activity).

By the way, the penalty of L1 cache miss is low. Usually you can ignore this.

Regards, Peter

Quoting - Peter Wang (Intel)

By the way, the penalty of L1 cache miss is low. Usually you can ignore this.

Probably so, in a case where L2 miss rate is high. However, if you suspect significant L1 miss rate, you should consider L1 TLB miss rate.

Can you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache in my program.
Please Please!!

Quote:

Sigehere S. wrote:

Can you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache in my program.
Please Please!!

In general speaking, there are below steps to optimize your program of using L1/L2 cache:
1. Use events such as MEM_LOAD_UOPS_RETIRED.L2_HIT (means L1_MISS) & MEM_LOAD_UOPS_RETIRED.L2_MISS to do event-based sampling data collection - know High L1/L2 cache misses in your code (you also can use predefined Memory Access Analysis directly, if you wont define your analysis type.)
2. Investigate how your code area (which has L1/L2 cache misses high) access your memory (load & write) - usually there is a loop or that function was called by another function which has a loop.
3. Investigate associated data structure which was used in loop, and understand memory layout.
4. Ensure that your algorithm accesses memory within 256KB, and cache line size is 64bytes. Please concentrate data access in specific area - linear address. For example, use "structure of array" instead of "array of structure" - assume you use p->a[], p->b[], etc.
5. Don't use big "stride" to access data in loop, and access memory within 64 bytes - it will be better.
6. Use "pad" in data structure if your data structure is not 64bit aligned in 64bit OS
7. Adjust your algorithm, if can use "invariable" data in loop (reduce "load" operations)
8. If you used shared data in threads for multithreaded application, use "lock" avoid false-sharing.
9. Other idea I miss, you can append.

Again, you need to run code area with memory layout, then find idea to optimize it - use VTune(TM) Amplifier to verify.

Hope it helps.

Thanks, Peter

>>>4. Ensure that your algorithm accesses memory within 256KB, and cache line size is 64bytes. Please concentrate data access in specific area - linear address. For example, use "structure of array" instead of "array of structure" - assume you use p->a[], p->b[], etc.>>>

In of the older Intel documents(related to optimization of Pentium 3) I read about the hybrid approach so called Hybrid arrays of SoA.Is this still recommended for the newest Intel processors?

Thanks in advance.

It‘s good programming style to think about memory layout - not for specific processor, maybe advanced processor (or compiler's optimization switchers) can overcome this, but it is not harmful.

Quote:

Peter Wang (Intel) wrote:

It‘s good programming style to think about memory layout - not for specific processor, maybe advanced processor (or compiler's optimization switchers) can overcome this, but it is not harmful.

Thanks Peter

发表评论

登录添加评论。还不是成员?立即加入