I started using Vtune to analyze the performance of my application and found that many of the counters are not tallying. Below I just printed a few of them. To start with CPU_CLK_UNHALTED.THREAD is showing a number which doesn't match the CPU time. The CPU I'm using is a Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (Haswell). Based on this I was expecting the cycles spent on the thread to match the CPU time @2.5Ghz. Please note that for this run I just sampled a single logical core and that core was running a single thread exclusively (core is isolated). The above doesn't match by miles, based on the event counter the program only run for CPU_CLK_UNHALTED.THREAD / 2.5Ghz = 0.7 sec!
Besides this we can also see the events counts for the cache misses. They don't tally either. I was expecting L1_miss = L2_hit + L2_miss; L2_miss = L3_hit + L3_miss. There is also the CPI reported... as per the events the CPI would be of 0.013!! This can't be the case as the best theoretical CPI we can achieve is 0.25 (there are 4 execution pipes).
Am I missing something? Could it be that I have a missconfigured Vtunes: as in the event masks and Umask values are being incorrectly passed by vtune because it thinks we are on a different uarchitecture (say Haswell vs Ivy Bridge)? Any hints would be appreciated!
Elapsed Time: 60.059s
CPU Time: 42.334s
CPI Rate: 0.013
Total Thread Count: 1
Paused Time: 0s
Hardware Event Type
Hardware Event Count Hardware Event Sample Count Events Per Sample
CPU_CLK_UNHALTED.REF_TSC 105,834,158,751 52,917 2000003
CPU_CLK_UNHALTED.THREAD 1,770,002,655 885 2000003
INST_RETIRED.ANY 141,332,211,998 70,666 2000003
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS 526,415,792 752 100003
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS 52,654,078,981 3,761 2000003
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS 247,807,434 354 100003
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS 199,505,985 285 100003
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 47,970,139 137 50021
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 38,516,170 110 50021