I am trying to play with the Intel performance counter monitor tool. I reuse some of its code and write a kernel module to read performance counter data. I basically follows the procedures in PCM::program() to set up the on core counters, and then use rdmsr wrmsr to read/write performance counters. I found that the data collected are not accurate when time between two read are small. For example, here are my procedures:
/* routines to start the counter of # of branch instructions, mimic PCM:program() code*/
/* routines to read the counter, using rdmsr and wrmsr*/
for ( i =0; i < 1000; ++i) arr[i] = 1;
/* routines to read the counter again, using rdmsr and wrmsr*/
The number of branch instructions should be 1000, but the reading constantly shows about ~6500 (after - before). I am aware of that rdmsr has certain latency, probably 100+ cycles. But extra 5500-branch-instruction seems too large for 100+ cycles. I am not sure if this is because of my set up, or performance counters should not be used in this way? Can someone give me some suggestions? Thanks.