Estimate the penalty of Cache Miss more accurate on Ivy-bridge?

Most of time the user will reference Tuning Guides and Performance Analysis Papers for different Intel® Core™ Generation processors, to optimize their applications.
Usually estimating Cache Miss penalty will be first considered, because CPU penalty is expensive when LLC miss happened. See below formula: (Ivy-bridge as example)
% of cycles spent on memory access (LLC misses) = (MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * 180) / CPU_CLK_UNHALTED.THREAD
That means, we estimate 180 cycles as latency for memory load once LLC Miss happened. However this is average value, sometime it is not true.
Is there any other method to capture *runtime* performance data, which is more closely to the fact? The answer is "Yes". Please see document Intek® 64 and IA-32 Architectures Optimization Reference Manual, there are new events supported in VTune™ Amplifier XE 2013, which can present *runtime* latency of LLC miss.
To estimate the exposure of DRAM traffic on third generation Intel Core processors, the remainder of
L2_PENDING is used for MEM Bound:
%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS
Where L3_Miss_fraction is:
MEM_L3_WEIGHT * MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * MEM_L3_WEIGHT)
The correction factor MEM_L3_WEIGHT is approximately the external memory to L3 cache latency ratio. A factor of 7 can be used for the third
generation Intel Core processor family.
Let’s have a simple test to know what the advantage of new event has.
Example code:
=========================== 
#include <stdio.h>
#define NUM 1024
double a[NUM][NUM], b[NUM][NUM], c[NUM][NUM];
void multiply()
{
 unsigned int i,j,k;
    for(i=0;i<NUM;i++) {
       for(j=0;j<NUM;j++) {
          c[i][j] = 0.0;
          for(k=0;k<NUM;k++) {
             c[i][j] += a[i][k]*b[k][j];
          }
       }
     }
}
main()
{
 //start timing the matrix multiply code
 multiply();
}
====================================================
# amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD,CYCLE_ACTIVITY.STALLS_L2_PENDING,MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS,MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS -- ./matrix
amplxe: Using result path `/home/peter/r005runsa'
amplxe: Executing actions 50 % Generating a report                             
Collection and Platform Info
----------------------------
Parameter                 r005runsa
------------------------  ----------------------------------------------------------------------------
Application Command Line  ./matrix 
Computer Name             ivb01
Environment Variables     
MPI Process Rank          
Operating System          2.6.32-279.el6.x86_64 Red Hat Enterprise Linux Server release 6.3 (Santiago)
Result Size               4144003
User Name                 root
CPU
---
Parameter          r005runsa
-----------------  -------------------------------------------------
Frequency          3500000000
Logical CPU Count  8
Name               3rd generation Intel(R) Core(TM) Processor family
Summary
-------
Elapsed Time:  7.332
Event summary
-------------
Hardware Event Type                Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
---------------------------------  -------------------------  --------------------------------  -----------------
CPU_CLK_UNHALTED.THREAD            28510042765                14255                             2000003
CYCLE_ACTIVITY.STALLS_L2_PENDING   12598018897                6299                              2000003
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS  1600112                    16                                100007
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS   15356447                   307                               50021 amplxe: Executing actions 100 % done         
Now we can use two methods to estimate the latency of LLC miss.
(Old - estimated data) 1. % of cycles spent on memory access (LLC misses) = (MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * 180) / CPU_CLK_UNHALTED.THREAD = 1600112 * 180 / 28510042765 = 1.01%
(New - calculate by using runtime data) 2. L3_Miss_fraction is:
7 * MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + 7 *
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS) = 7*1600112 / (7*1600112+15356447) = 11200784 / 26557231 = 0.421
%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS
= 12598018897 * 0.421 / 28510042765 = 18.6%
In this new method, you can use L2 stall pending cycles - in this case, it is 44% (CYCLE_ACTIVITY.STALLS_L2_PENDING/ CPU_CLK_UNHALTED.THREAD) of all CPU clocks, and L3 Miss fraction is 42.1% of 44% L2 stall pending, or say L3 Miss latency is 18.6% of all CPU clocks. That is more accurate than old method 1, because it just estimated LLC miss count but without pending cycles.
For more complete information about compiler optimizations, see our Optimization Notice.