I have the following doubt,
When we collect hw-events for a multi-threaded program in xeon-phi, the statistics for every hw-event is given on cumulative basis or thread basis? for example, CPU_CLK_UNHALTED like the cpu time (when using linux 'time') gives a cumulative clock cycles utilized by the application on defined number of threads unlike the elapsed time. Is this correct?
How are cache_fill and other hw-events reported? Is it an accumulation of all core events or just one core specified in the -collect cpu-mask in general?