The accuracy of the performance counter statisitics

The accuracy of the performance counter statisitics

Hi ,

I am trying to play with the Intel performance counter monitor tool. I reuse some of its code and write a kernel module to read performance counter data. I basically follows the procedures in PCM::program() to set up the on core counters, and then use rdmsr wrmsr to read/write performance counters. I found that the data collected are not accurate when time between two read are small. For example, here are my procedures:

/* routines to start the counter of # of branch instructions, mimic PCM:program() code*/

/* routines to read the counter, using rdmsr and wrmsr*/

for ( i =0; i < 1000; ++i) arr[i] = 1;

/* routines to read the counter again, using rdmsr and wrmsr*/

The number of branch instructions should be 1000, but the reading constantly shows about ~6500 (after - before). I am aware of that rdmsr has certain latency, probably 100+ cycles. But extra 5500-branch-instruction seems too large for 100+ cycles. I am not sure if this is because of my set up, or performance counters should not be used in this way? Can someone give me some suggestions?  Thanks.

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Xin,

You are running into the overhead of calling the driver. Your user mode (ring 3) code has to do a call to the driver, which calls the kernel (switching from ring 3 to ring 0), does the rdmsr/wrmsr instruction, and returns. And there are probably multiple calls to the driver per call to PCM.

If you truly want to read the MSR with the minimum overhead, you can use the rdpmc instruction but this is not easy. usually rdpmc is not enabled to be read from ring3. There is a bit in CR4 that has to be set. It is the PCE bit. On linux there is a driver that enables rdpmc (https://github.com/andikleen/simple-pmu). It works on older versions of Linux. I don't know of a windows driver that enables rdpmc. Even if you enable rdpmc from ring3, you will only be able to read the core PMU counters (3 fixed & 3-8 variable core counters). You will still have to make a trip into ring 0 to do wrmsr or to rdmsr any other msr besides the core PMC counters.

I've used Andi's driver to do very low overhead measurements before. But it is not for the faint of heart.

Hope this helps,

Pat

Hi Patrick,

Thank you for you quick response. I actually reuse only the code of PCM::program() and put them in a kernel module, but not the code for accessing msr. When read/write msr, I use assembly code like this 

asm volatile ("\trdmsr\n" : "=a" (lo), "=d" (hi) : "c" (msr))

to directly access msr in the kernel module. This does not involve ring transitions.

Using rdpmc should be one solution to reduce the overhead. Based on your experience, what is the granularity that rdpmc/rdmsr can achieve? Can they measure 1000 or even 100 instructions/cycles accurately?

Thank you

Quote:

Patrick Fay (Intel) wrote:

Hello Xin,

You are running into the overhead of calling the driver. Your user mode (ring 3) code has to do a call to the driver, which calls the kernel (switching from ring 3 to ring 0), does the rdmsr/wrmsr instruction, and returns. And there are probably multiple calls to the driver per call to PCM.

If you truly want to read the MSR with the minimum overhead, you can use the rdpmc instruction but this is not easy. usually rdpmc is not enabled to be read from ring3. There is a bit in CR4 that has to be set. It is the PCE bit. On linux there is a driver that enables rdpmc (https://github.com/andikleen/simple-pmu). It works on older versions of Linux. I don't know of a windows driver that enables rdpmc. Even if you enable rdpmc from ring3, you will only be able to read the core PMU counters (3 fixed & 3-8 variable core counters). You will still have to make a trip into ring 0 to do wrmsr or to rdmsr any other msr besides the core PMC counters.

I've used Andi's driver to do very low overhead measurements before. But it is not for the faint of heart.

Hope this helps,

Pat

I figure out the problem. It is because of a mistake in my code. Now the reading seems very accurate. 

I am wondering if where I can find some document that discuss the accuracy of the performance counter in general. Can anyone give me some pointers? Thanks

What kind of accuracy do you mean?I think that only the info about an accuracy of rdtsc instruction is freely available.

Performance Counters do not have an option to count events as function o finstruction pointer.They will simply incremet the counter in your case by looking  at uops which  constitute branch instructions.Moreover when there is high frequency of context switches not only your thread will be measured.

Hello iliyapolak,

Thank you for your reply. For example, if I have only one instruction inbetween two rdmsr instructions ( set up to count # of  the retired instrucitons), will the difference between two readings be exactly 1? My test result is not exactly 1, but 3. In fact, this is accurate enough for me. But I am wondering if I someone have documented this more comprehesively, such as different counters: core counter, uncore counte and etc ... or different instructions: rdpmc, rdmsr and etc.

I quickly google it, and found this post talking about the latency of rdtsc, it suggests that there should be at least 1000 cycles between two readings to make the counting accurate. Should I make the similar assumption when using rdmsr?

http://software.intel.com/en-us/forums/topic/305287

Hello Xin,

I'm confused. Are you saying you are putting code into the ring0 driver and trying to time it in ring0? Because rdmsr can only be executed in ring0.

The rdpmc and rdmsr instructions take about 100-200 cycles. Yes counts are accurate. But you are again sort of confusing me. rdmsr gets the value of the counter (if the msr that rdmsr is reading is a PMU msr). So if you programmed clockticks.ref into the counter, read the msr, run some code and then reread the msr, you would get the unhalted clockticks and it would reflect the overhead of the rdmsr instruction. But if you programmed instructions.retired and reran your test, you would get the number of instructions plus 1 for the rdmsr.

Does that make sense?

Pat

Hi Xin

What ary trying to count the number of branches?

I think that during any CPU-clock cycles related measurement any measured code should run(or be looped) longer that total sum of for example rdtsc instructions.

Hi Patrick,

Yes, I put code into a kernel module, so all these code are in ring 0. Here is what my code looks like:

PCM1 is programmed to count number of retired instructions, (0xc0 for event number, and 0x00 for umask, according to table 19-1 in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide) 

long long int lo_start, lo_end, hi_start, hi_end,

long msr = IA32_PCM1;

/*program performance counter routines*/

asm volatile ("rdmsr" : "=a" (lo_start), "=d" (hi_start) : "c" (msr));   //start to read

asm volatile ("mov $0, r10");                                                     //run one dummy instruction

asm volatile ("\trdmsr\n" : "=a" (lo_end), "=d" (hi_end) : "c" (msr));   //read again

after putting low and high 32 bit value together, the different of end and start is 3. Shouldn't I expect 1 in difference?

Thank you

You need to look at the assembly code, not the __asm() stuff. So you'll have to disassemble the compiled code. I think you'll see there are 2 instructions between the two rdmsr instructions.

Hi Patrick,

You are right, I did see two extra mov between rdmsr, that explains the reading. the reading seems very accurate even if rdmsr has certain latency. Thank you very much. 

Inline RDPMC instructions (or the corresponding RDMSR instruction in the kernel) should count correctly for even very small code sections.  I just tested a bunch of loops that did nothing but execute RDPMC instructions and when I used the RDPMC instructions to count branches (Event 0x88, Umask 0x81), the values incremented exactly when they were supposed to -- every iteration for the original loop, and once every 8 iterations when I unrolled the loop by 8.

On the other hand, the RDPMC instruction takes time, and that can distort several aspects of the code under test.  The overhead of the RDPMC instruction is almost certain to vary across products.   On my Xeon E5-2680 (Sandy Bridge EP) systems, repeated consecutive calls to the cycle counting event (Event 0x3C, Umask 0x00, or the corresponding fixed-function event accessed by executing RDPMC with counter number of (1<<30)+1) almost always show deltas of 39 cycles from one reading to the next when I save the low-order 32-bits into a cache-contained array.  This increases by a few cycles if I combine the upper and lower 32-bit results into a 64-bit value and save it into a cache-contained array, with 43 cycles as the most common delta in cycle count values.

Some care needs to be taken with measuring small code sections, since the RDPMC instruction is not guaranteed to be ordered with respect to surrounding instructions.  I saw no deviations in my simple loop that was not doing anything except reading the PMC and saving the results, but more complex loops could result in out-of-order execution.

"Dr. Bandwidth"

I was thinking about the possibility of polluted mesurement when latency of instruction which triggers the process of measurement is greater than latency of profiled instructions.For example measured instruction(s) can execute out of order or even at the same time in parallel with profiling instruction and because of shorter latency of profiled instruction even miniscule changes(counted cpu-cycles) cannot be effectively measured.

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

Hello Dr. McCalpin

Thank you for your information. Your measurements results are similar to mine, the latency is about 30 cycles in nehalem-ep processors.  

Quote:

John D. McCalpin wrote:

Inline RDPMC instructions (or the corresponding RDMSR instruction in the kernel) should count correctly for even very small code sections.  I just tested a bunch of loops that did nothing but execute RDPMC instructions and when I used the RDPMC instructions to count branches (Event 0x88, Umask 0x81), the values incremented exactly when they were supposed to -- every iteration for the original loop, and once every 8 iterations when I unrolled the loop by 8.

On the other hand, the RDPMC instruction takes time, and that can distort several aspects of the code under test.  The overhead of the RDPMC instruction is almost certain to vary across products.   On my Xeon E5-2680 (Sandy Bridge EP) systems, repeated consecutive calls to the cycle counting event (Event 0x3C, Umask 0x00, or the corresponding fixed-function event accessed by executing RDPMC with counter number of (1<<30)+1) almost always show deltas of 39 cycles from one reading to the next when I save the low-order 32-bits into a cache-contained array.  This increases by a few cycles if I combine the upper and lower 32-bit results into a 64-bit value and save it into a cache-contained array, with 43 cycles as the most common delta in cycle count values.

Some care needs to be taken with measuring small code sections, since the RDPMC instruction is not guaranteed to be ordered with respect to surrounding instructions.  I saw no deviations in my simple loop that was not doing anything except reading the PMC and saving the results, but more complex loops could result in out-of-order execution.

This may be a good idea. So I can just use two CPUID instructions to guard the measured code region, if I don't care about performance degradations.

thanks.

Quote:

Patrick Fay (Intel) wrote:

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

And I've seen folks still in mfence instructions sometimes if they are worried about exact counts of memory loads/stores. 

Note that, in the ring3 (user land), you may also get interrupts right in the middle of your code, which may mess up your counts.

Hi All, 

I tried to obtain the hardware performance counter data on windows (win 7, 32-bit, x86 - Intel Xeon processor) platform using kernel mode driver. But, I was not successful.
In my custom driver, I wrote the following assembly code to read the counter data:

NTSTATUS DriverEntry (...){
 

__asm {

mov ecx, 0x309;    // fixed IA32_PERF_FIXED_CTR0 -- Inst_Retired.Any
rdmsr;
mov lowvalue, eax;
mov highvalue, edx;
}

DbgPrint("MSR output: %x \t %x \r\n", lowvalue, highvalue);

}

Could please help me to know, if I am making any mistake here. Please let me know how can I get this counter data. I replaced "rdmsr" instruction with "rdpmc" instruction as well, but it was not successful either. 

Thanks in advance for the help. 

It would help to have some idea of what you mean by "not successful"....

The fixed-function performance counter accessed via MSR 0x309 has to be enabled by setting (1) bit 32 of the IA32_PERF_GLOBAL_CTRL MSR (0x38F), and (2) bits 0 (and also bit 1 if you want to count in user space as well as kernel space) of the IA32_FIXED_CTR_CTRL MSR (0x38D).

You also need to be sure that you are reading the "before" and "after" values on the same core.   I don't know how that is done in Windows, but in Linux device driver (kernel) code this is usually done by setting up an inter-processor interrupt targeting the desired core so that it will be the one reading the MSR.    It is probably also possible to pin the kernel thread to the desired core for the duration of the test (?).   Pinning the thread to a single core is also required for user-space code that uses the RDPMC instruction.

"Dr. Bandwidth"

Thank you for the kind reply. 

I followed as said above, but I got another error.

    __asm {    
        mov eax, 0x80000000       // bit 32 is set
        xor edx, edx                       // edx = 0 
        mov ecx, 0x38F                 //   IA32_PERF_GLOBAL_CTRL   msr
        wrmsr
        
        mov eax, 0x00000001       // only bit 0 is set, as we count in kernel space            
        xor edx, edx                       
        mov ecx, 0x38D                     // IA32_FIXED_CTR_CTRL  msr
        wrmsr
            
        mov ecx, 0x309
        rdmsr
        mov lowvalue, eax
        mov highvalue, edx
    }    

I got the following error with the fault at wrmsr.

  ExceptionCode: c0000005 (Access violation)

 

 FAULTING_SOURCE_CODE:  
    48:     __asm {    
    49:         mov eax, 0x80000000                  
    50:         xor edx, edx                       
    51:         mov ecx, 0x38F                     
>   52:         wrmsr
    53:         
    54:         mov eax, 0x00000001                   
    55:         xor edx, edx                       
    56:         mov ecx, 0x38D                     
    57:         wrmsr

 

 

 

Your code is trying to set bit 31, not bit 32.  (Bit addresses start with zero, not one).   From the discussion in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual, only bits 0,1,2 and 32, 33, 34 are writable, the rest are reserved and the hardware does actually track which bits are writable.

"Dr. Bandwidth"

 

Dear John McCalpin, 

Thank you for the reply. I have two queries. 

1) I tried to configure the general performance counter to generate PMI interrupt after every N retired instructions. 

For N=1001, I write -1000 in msr 0xc1. After 1001 retired instructions, the PMI interrupt has to occur due to overflow.
However, my implementation shows that the PMI interrupt is generated only once. Could you please let me know if my configuration is correct or if I am missing something ? Do I have to write -1000 in msr 0xc1 again while handling the interrupt ?

Following is my setup. 

__asm {                    
    
    //IA32_PERF_GLOBAL_OVF_CTRL MSR
        xor edx, edx                      
        mov eax, 0x00000001                 
        mov ecx, 0x390                     
        wrmsr     
    
    //IA32_PERF_GLOBAL_CTRL MSR address    0x38F      
        xor edx, edx                      
        mov eax, 0x00000001    
        mov ecx, 0x38F                      
        wrmsr
    
    //set -1000 as a overflow counter    
        mov eax, -1000
        mov ecx, 0xc1
        wrmsr

        xor edx, edx                 
        mov eax, 0x005100C0   
        mov ecx, 0x186                    
        wrmsr
        
        }

2) How can I configure the fixed performance counter to generate PMI interrupt after N retired instructions ?

Thank you in advance.

I have never worked on the interrupt handlers for PMIs, but I do believe that they normally reset the counter to (MAXVAL - trip_count) before returning to the user code.   Otherwise you would have to wait the full 2^48 increments before the next overflow.  For retired instructions this will probably take longer than you are interested in waiting.....  

Many Intel processors have limitations in what you are allowed to write to the programmable counter MSRs, so newer processors provide a "full-width" alias for each of these.   E.g. for Counter 0, the counter that you normally read is MSR 0x31 (IA32_PMC0), but if you want to write more than the lower 32 bits you need to write to MSR 0x431 (IA32_A_PMC0).   This is described in Section 18.2.5 of Volume 3 of the SW Developer's Guide.

To use one of the fixed function performance counters the procedure is almost identical.  You write (MAXVAL - trip_count) to the IA32_FIXED_CTR0 counter MSR and set up the IA32_FIXED_CTR_CTRL MSR to enable overflows on that counter.

"Dr. Bandwidth"

Hello Patrick,

I really appreciate if you provide me a sample code how to serialize msr (RDMSR and WRMSR) instructions. I have found an article (How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures). In which the author explain serializing instructions (CPUID, RDTSC and RDTSCP) to read more accurate cycles of long loops. Is this can be applied on msr? How ever mfence and lfence instructions also have been used to reduce disturbances or noise. I am not sure which one might give accurate results?

 

 

Quote:

Patrick Fay (Intel) wrote:

Hello Illyapolak,

If one is really worried about miscounts resulting from out-of-order instruction flow, one can put a serializing instruction before the rdmsr (or rdpmc). Serializing instructions include cpuid and rdtscp. These instructions will wait until all other instructions have finished and then they will run. So, you will see lots of cycles wasted as you flush the pipeline but you eliminate the out-of-order worries. I've never really run into a situation where I needed to worry about it anyway.

Pat

As I described at https://software.intel.com/en-us/forums/software-tuning-performance-opti..., "serializing" is complex subject, and requires that you state your requirements with a great deal of precision.   In many cases the type of serialization that you think you want is simply not possible without huge (many hundreds of cycles) overheads.

Even when the definitions look fairly precise, further investigation often shows that there are cases that are not covered.  For example the RDTSCP instruction is not allowed to "execute" until all prior instructions in program order have "executed".  One problem is that "execution" is not an instantaneous event.  Every instruction is pipelined to some degree, and instructions that access memory are in the "executing" state for anywhere between ~4 cycles and >1000 cycles.  A more precise definition would require clarification between the times that an instruction "begins execution" and "completes execution".   Even this may not be precise enough, since the results of "completing execution" become visible at different times to different functional units, depending on register bypass and/or cache bypass implementations.   Does the definition say that the RDTSCP instruction cannot "begin" execution (which takes ~36 cycles) until all prior instructions have "completed" execution?  Or does the definition say that the RDTSCP instruction cannot begin execution until such a point that the TSC value returned is guaranteed will point to a time no earlier than the latest cycle in which any prior instruction "completed execution".   The definition of "execution" and the concepts of "before" and "after" become even fuzzier when you consider that the RDTSCP instruction is microcoded, executing about 22 uops, with a minimum repeat latency of 30 cycles or longer.

Fortunately for your current use case, the WRMSR instruction is listed as a serializing instruction in Section 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual.  RDMSR is not a serializing instruction, but that may or may not matter, depending on what your specific requirements are.   I should add that, like CPUID, WRMSR is a very slow instruction.   I don't have recent timings, but if I recall correctly this was taking 100-200 cycles on a Xeon E3 (Sandy Bridge) processor.  Because the MSR interface is an abstraction to a communication network that spans the entire chip, it seems likely that the latency of MSR reads and writes will vary depending on the core making the request and the physical location of the register being accessed. For MSRs with thread scope or core scope the RDMSR and WRMSR instructions can only access the *local* copies.  If you want to read/write MSRs associated with a different logical processor, you will need to set up an interprocessor interrupt, which probably has a cost of a few thousand cycles.

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today