Estimate the penalty of Cache Miss more accurate on Ivy-bridge?

Most of time the user will reference Tuning Guides and Performance Analysis Papers for different Intel® Core™ Generation processors, to optimize their applications.
Usually estimating Cache Miss penalty will be first considered, because CPU penalty is expensive when LLC miss happened. See below formula: (Ivy-bridge as example)
% of cycles spent on memory access (LLC misses) = (MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS * 180) / CPU_CLK_UNHALTED.THREAD
That means, we estimate 180 cycles as latency for memory load once LLC Miss happened. However this is average value, sometime it is not true.
Is there any other method to capture *runtime* performance data, which is more closely to the fact? The answer is "Yes". Please see document Intek® 64 and IA-32 Architectures Optimization Reference Manual, there are new events supported in VTune™ Amplifier XE 2013, which can present *runtime* latency of LLC miss.
To estimate the exposure of DRAM traffic on third generation Intel Core processors, the remainder of
L2_PENDING is used for MEM Bound:
%MEM Bound = CYCLE_ACTIVITY.STALLS_L2_PENDING * L3_Miss_fraction / CLOCKS
Where L3_Miss_fraction is:
The correction factor MEM_L3_WEIGHT is approximately the external memory to L3 cache latency ratio. A factor of 7 can be used for the third
generation Intel Core processor family.
Let’s have a simple test to know what the advantage of new event has.
Example code:
===========================
#include <stdio.h>
#define NUM 1024
double a[NUM][NUM], b[NUM][NUM], c[NUM][NUM];
void multiply()
{
unsigned int i,j,k;
for(i=0;i<NUM;i++) {
for(j=0;j<NUM;j++) {
c[i][j] = 0.0;
for(k=0;k<NUM;k++) {
c[i][j] += a[i][k]*b[k][j];
}
}
}
}
main()
{
//start timing the matrix multiply code
multiply();
}
====================================================
amplxe: Using result path `/home/peter/r005runsa'
amplxe: Executing actions 50 % Generating a report
Collection and Platform Info
----------------------------
Parameter                 r005runsa
------------------------  ----------------------------------------------------------------------------
Application Command Line  ./matrix
Computer Name             ivb01
Environment Variables
MPI Process Rank
Operating System          2.6.32-279.el6.x86_64 Red Hat Enterprise Linux Server release 6.3 (Santiago)
Result Size               4144003
User Name                 root
CPU
---
Parameter          r005runsa
-----------------  -------------------------------------------------
Frequency          3500000000
Logical CPU Count  8
Name               3rd generation Intel(R) Core(TM) Processor family
Summary
-------
Elapsed Time:  7.332
Event summary
-------------
Hardware Event Type                Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
---------------------------------  -------------------------  --------------------------------  -----------------