I have been using VTune for a while and I would appreciate some advice about the metrics I'm trying to measure (I'm using a processor from Harpertown family - Core microarchitecture):
1) Stall time: Processor's documentation states that it can issue/retire up to 4 instructions per cycle. Assuming that the ideal CPI in this case is 0.25, may I compute the relative stall time as (Measured_CPI-0.25)/Measured_CPI? E.g. Assuming that the measured CPI is 1.25, is it correct to say that the total stall time is 80% (1/1.25)?
2) L2 miss penalty: How correct/accurate is to compute stall time due to L2 misses as: L2_misses * avg_mem_latency? Btw, what is the most precise way to measure average memory latency? I've tried to use the counter "BUS_REQUEST_OUTSTANDING", as suggested in the Intel 64 and IA-32 Optimization Reference Manual, but the results using this counter do not make sense (in some cases, VTune reports BUS_REQUEST_OUTSTANDING events > CPU_CLK_UNHALTED.CORE events)
3) L2 cache miss rate: I was wondering whether the builtin "L2 Cache Miss Rate" ratio afforded by VTune is inconsistent with what most of us consider as "miss rate" (number of misses in L2 divided by number of accesses in L2). Being "L2 Cache Miss Rate" computed as L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, shouldn't it be called "miss per instruction"? Is it correct to compute L2 miss rate as:
L2_RQSTS.SELF.ANY.I_STATE / L2_RQSTS.SELF.ANY.MESI ?