Vtune hardware event counts wrong?

Vtune hardware event counts wrong?

Hi all,

I started using Vtune to analyze the performance of my application and found that many of the counters are not tallying. Below I just printed a few of them. To start with CPU_CLK_UNHALTED.THREAD is showing a number which doesn't match the CPU time. The CPU I'm using is a Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (Haswell). Based on this I was expecting the cycles spent on the thread to match the CPU time @2.5Ghz. Please note that for this run I just sampled a single logical core and that core was running a single thread exclusively (core is isolated). The above doesn't match by miles, based on the event counter the program only run for CPU_CLK_UNHALTED.THREAD / 2.5Ghz = 0.7 sec!

Besides this we can also see the events counts for the cache misses. They don't tally either. I was expecting L1_miss = L2_hit + L2_miss; L2_miss = L3_hit + L3_miss. There is also the CPI reported... as per the events the CPI would be of 0.013!! This can't be the case as the best theoretical CPI we can achieve is 0.25 (there are 4 execution pipes).

Am I missing something? Could it be that I have a missconfigured Vtunes: as in the event masks and Umask values are being incorrectly passed by vtune because it thinks we are on a different uarchitecture (say Haswell vs Ivy Bridge)? Any hints would be appreciated!

Elapsed Time: 60.059s
CPU Time: 42.334s
CPI Rate: 0.013
    
Total Thread Count: 1
Paused Time:  0s
    
Hardware Events
Hardware Event Type
    
Hardware Event Count                                     Hardware Event    Sample Count   Events Per Sample
CPU_CLK_UNHALTED.REF_TSC                    105,834,158,751   52,917              2000003
CPU_CLK_UNHALTED.THREAD                      1,770,002,655      885                   2000003
INST_RETIRED.ANY                                       141,332,211,998   70,666              2000003
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS        526,415,792        752                  100003
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS     52,654,078,981    3,761                2000003
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS        247,807,434        354                  100003
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS    199,505,985         285                  100003
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS        47,970,139          137                  50021
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS     38,516,170          110                  50021

 

Best,

Matias

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The REF_TSC counter matches the reported CPU seconds perfectly...

The INST_RETIRED.ANY looks fine -- about 1.33 instructions per TSC cycle.

The CPU_CLK_UNHALTED.THREAD number is weird and obviously wrong.  (The value implies 80 instructions per cycle, which would be really exciting if it were real....)

There could be some confusion with the user and kernel enable bits.

There could be interference from an NMI watchdog process (these often use CPU_CLK_UNHALTED.THREAD).

The only way I know how to debug this stuff is to use msrtools to dump the performance counter configuration registers during the run. At that point it is usually just as easy to do all the measurements myself and skip using VTune entirely.....

"Dr. Bandwidth"

Hi John,

Thanks for the prompt reply!

Regarding NMI watchdog I disabled it (echo 0 > /proc/sys/kernel/nmi_watchdog)

Not only the clock ticks are not matching but also the L1/L2/L3 counts as well. I re run these counters using perf and getting the raw counters from Intel's User Manual Vol2 Chapter 19 for the above mentioned CPU, and these do add up (there is a glitch on L2_Miss = L3_Hit + L3_Miss) but the counters do make much more sense in general...

I'm particularly interested in Vtunes because of the event sampling feature... roughly matching the event counts to the instruction counter (and source line) where they are generated is very useful to pinpoint the code regions that need attention.

Meanwhile I'll take a look into msrtools!

root@test:/md> perf stat -e instructions,r01d1,r08d1,r02d1,r10d1,r04d1,r20d1,r01d3 -C 4 -B sleep 42

 Performance counter stats for 'sleep 42':

       136220124158 instructions              #    0.00  insns per cycle         [49.97%]
       39301638501 r01d1                                                        [49.92%]   <= L1_Hit
         136907962 r08d1                                                        [49.95%]    <= L1_Miss
          92917335 r02d1                                                        [49.98%]     <= L2_Hit
          44093038 r10d1                                                        [50.00%]     <= L2_Miss
          19069173 r04d1                                                        [40.06%]      <= L3_Hit
           1205756 r20d1                                                        [40.07%]       <= L3_Miss
           1242890 r01d3                                                        [40.04%]       <= L3 Missed but serviced by Local Dram

      42.000461652 seconds time elapsed\

Best,

Matias

Leave a Comment

Please sign in to add a comment. Not a member? Join today