I used VTune to profile a MPI+OpenMP program running on a Xeon Phi processor (native mode) . However, I noticed there is discrepancy between the VTune time and the time reported by my timer, as shown below. As you see, Vtune time (305.94) is about 3x bigger than the elapsed time (90.255). I also looked at reported VTune time for individual functions, which turned out to be even bigger than the time of the whole program (90.255 reported by my timer). I'm almost certain that the time reported by my timer is correct, so VTune somehow produces wrong results. This seems a consistent problem for other runs with different number of ranks and threads, where I noticed the time discrepancy could be between 100x-200x. I'd appreciate it if someone could offer some ideas on how this could happen (and how I might fix it)!
Elapsed Time (by my timer): 90.255
Name Intel(R) Xeon(R) / Core i7 980X Processor
Logical CPU Count 244
CPU Time 305.954