performance counters interrupt and virtualiztion

performance counters interrupt and virtualiztion

I'm trying to write an extension to kvm that stops execution after a fixed number of branch instructions (for example 1000).
I've set PERFEVTSEL0 and set the PMC0 (msr 0xc1) to -1000, and wrote an ISR for PMC.
the hw raises an interrupt which causes a vmexit but when reading the PMC0 register the value is more than 0, why is that so?
Is the performance counters not precise?

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Yoav,

The counter will continue to increment until it is stopped. There are probably multiple branches taken in order to service the ISR. You might be able to reduce the extra counting by only counting ring 3 events (as opposed to ring 0 events) but this may not be what you want to do.

It seems like some chips may have a 'freeze counters on overflow' bit you can set (but it has been a while since I last read a 'how to program counters' documents). It seems like setting the value so low (1000) runs the risk of triggering your ISR a lot.


The interrupt is supposed to cause a vmexit and stop the counter, but it doesn't and the counter has a positive value. Am i missing something?

I don't know if you are missing anything. There isn't enough detail. Programming the counters is hard enough without adding the complexity of a virtual machine.


What details are missing?

How many branches are in the ISR/vmexit code path before you get to whatever it is that is supposed to stop the counter from counting and what sort of 'greater than 0' counts are you getting?

Usually, if no virtual machines are involved, you have to explicitly stop the counter. I don't know anything about what happens when VMs are involved.

there are a few branches but the vmexit is supposed to switch off the performance counters (HOST_IA32_PERF_GLOBAL_CTRL is 0). after the vmexit I'm reading the pmc0 counter and the value is sometimes more than 0, which means that i missed by a couple of branches.
I'll try to repharse. are the performance counters for branches accurate? can it be that the IRQ is raised not at the time of the counter overflow but a few branches later?


Yes, the branch event is accurate (as far as I know). But probably the vmexit has branches and interrupts are branches. The vmexit seems to be a complex (more than one instruction) procedure. I'm guessing that the vmexit process has conditional branches in it.

Obviously I don't know much about the vmexit. Do you know if this guess is correct for the vmexit?

This is not the case, I don't think it's the problem. lets say i want to implement a branch stopper, e.g. stop after 1000 branches and gather statistics. how would i go and implement this without virtualization.

2. set MSR_P6_PERFCTR0 (msr 0xc1)  = -1000
3. set IA32_PERFEVTSEL0:
3.1. evt_sel = 0xc4
3.2. umask = 0x00
3.3. usr = 1
3.4. int = 1
3.5. en = 1

am i missing anything?

Looks good. It should work. If you have a test 'looping' kernel that does a 1000 loops then your ISR should get invoked. Any interrupts that happen during the test will increment the count and the long jump to the ISR will be another increment.

What chip are you using? On sandybridge, I don't see a 0xc4 event that uses a umask=0.

I'm using haswell chip, but I still when reading pmc0 value getting a value larger than 0. do you have a code sample that uses the performance counter interrupt feature ?

No, I don't know of a sample ISR driver. You might also collect BR_INST_RETIRED.FAR_BRANCH (0xc4, umask=0x40) which will count interrupts (IIRC) and see if BR_INST_RETIRED.FAR_BRANCH is equal to the over count.

It doesn't account for the bad count, could it be an HW bug? 

(1) Using Umask 0x00 with Event 0xC4 seems to be inviting trouble for two reasons: First, it is an "architectural" event and "architectural" events are often less tightly specified than the machine-specific events.  Second, it counts all branch instructions, which may include control transfers that you are not thinking of counting.   In particular, it may count control transfers that are necessary to get to the code that stops the counter from continuing to count.

(2) You never said how many "extra" counts you are seeing.  Is it 1? 10? 100?     Performance counters are sometimes exact, but Chapter 19 of Volume 3 of the SW developer's guide starts with the warning:

  • The counter values reported by the performance-monitoring events are approximate and believed to be useful as relative guides for tuning software.

(3) Using inline RDPMC instructions I have seen that the related event BR_INST_EXEC.TAKEN_CONDITIONAL (Event 0x88, Umask 0x81) is exact on my Xeon E5-2680 (Sandy Bridge) systems -- see the comments in another forum thread at  

If you want to know if the counter is correct, the only way to avoid extraneous code is to put inline RDPMC instructions right where you want them.  Even that is not guaranteed in all cases because of ordering issues, but the cores tend to execute in FIFO order so it is usually correct.  There are sneaky tricks for enforcing ordering of RDPMC instructions using false dependencies, but you have to avoid the register-zeroing idioms that the hardware recognizes.  That is too long a topic for today.


"Dr. Bandwidth"

Hi John,
I've changed the event to ROB_MISC_EVENT_LBR_INSERTS which is machine specific for haswell, and activated the LBR to select the specific branches that i want. but still i'm getting extra events (up to 10 more). 

Ten extra branches seems like a lot, but in a virtualized environment the number of extra layers of software could include this many.

It would be interesting to compare this against the counts in a non-virtualized environment.  I don't know if any tools enable this to be done directly, but it should be relatively easy to hack the kernel code that processes the performance monitor interrupt (__perf_event_overflow in kernel/events/core.c, if I am reading the code correctly) to get it to read the current value of the counters before it does anything else.   For a one-time test you could just add a kernel debug print of the values obtained to see whether it has incremented above zero in this (presumably much shorter) code path

Of course one would really prefer some form of user-mode interrupt support to avoid the kernel crossing entirely (since all the kernel is doing is sending the data back to the user run-time for processing), but that is a much larger topic.

"Dr. Bandwidth"

Hi  Dr. John McCalpin,

I configure the architectural performance monitoring counter INTEL_MSR_PERFMON_CRT1.

2. set INTEL_MSR_PERFMON_CRT1   = -20000
3. set IA32_PERFEVTSEL1:
3.1. evt_sel = 0xc0
3.2. umask = 0x00
3.3. usr = 1
3.4. int = 1
3.5. en = 1

My program is running in kernel mode but I am counting user program instructions.

I set the NMI vector in local APIC to trigger an exception when the counter overflows.

The exception is not taken in account by the processor exactly when the overflow occur. There  are a variable excess at each run from 10 to 50 instructions.

How can I configure the LAPIC NMI to trigger the NMI more soon (when the overflow occur) without any delay?

Thank you for your attention


I don't use interrupt-based performance monitoring very often, and have never written this sort of code myself, so this is beyond my expertise.

Some processors allow "freezing" the performance counters on PMIs, which should eliminate the extra counts you are seeing.  This is discussed in Section 17.4.7  of Volume 3 of the SW Developer's Manual (document 325384-058). 

I don't know if it is possible to make the interrupt happen "sooner" (which is a difficult concept in an out-of-order processor), but the PEBS subsystem was designed to ensure that the processor state is captured at the time of sampled events.  This is discussed in Chapters 17, 18, and 19 of Volume 3 of the Software Developer's Manual. 

Linux kernels have included PEBS support for a while.  It is not easy to understand, but combining the documentation in Volume 3 with the examples in the various Linux kernels should provide some insight into how to make PEBS work.  

Several recent processor generations have support for a version of the INST_RETIRED event that has hardware support to reduce the "PEBS shadow in IP distribution" (Event 0xC0, Umask 0x01), but this feature only appears to be available for the retired instructions event, and not for the branch events that you are interested in.

I don't know if any of these are directly applicable to your problem, but at least they seem to be related topics!


"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today