Core i7/Xeon 5500 (Nehalem) Performance Counters

Core i7/Xeon 5500 (Nehalem) Performance Counters

I'm trying to understand how to program the performance counters, when HyperThreading (HTT) is enabled, to be able to separate the counts by Logical Processor (LP).

Back in the P4 days, if I wanted to have LP-specific counts, I programmed 2 counters to count the same event, I set the AnyThread CCCR bits to 11b (any thread), set the T0_OS/USR bits in one ESCR (to count on LP0 only), and set the T1_OS/USR bits in the other ESCR (to count on LP1 only).

Reading the Nehalem docs it's a bit confusing how to do the above. The only thread control available is the ANY bit (bit 21) in the PerfEvtSel registers. According to the manual (Software Developer's Manual, Vol. 3B, page 18-54):

When set to 1, it enables counting the associated event conditions (including matching the threads CPL with the
OS/USR setting of IA32_PERFEVTSELx) occurring across all logical processors sharing a processor core. When bit
21 is 0, the counter only increments the associated event conditions (including matching the threads CPL with the
OS/USR setting of IA32_PERFEVTSELx) occurring in the logical processor which programmed the
IA32_PERFEVTSELx MSR.

Does that mean that
- The hardware "remembers" which LP programmed the counter?
(and the meaning of PerfEvtSelX.ANY=0 is really "this" LP only)
- To program a counter to count events for a specific LP the code that programs the counter must run on that LP?

For example, if I want to count INST_RETIRED.ANY_P by LP, do I have to program a different counter on each LP, while running on that LP, and set PerfEvtSelX.ANY=0 for each counter?

Setting PerfEvtSelX.ANY=1 would not be what I want because that would give the aggregate instructions retired count for *both* LPs (the core). The only other choice is to set PerfEvtSelX.ANY=0, but that doesn't offer the choice of which LP will/should increment the counter.

The same seems to be true for the Fixed-Function counters. Setting the ANY bit(s) in IA32_FIXED_CTR_CTRL_MSR has the same effect as the ANY bit in the PerfEvtSelX registers, which makes sense.

Any clarifications would be appreciated.

Enio.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

To obtainend-user supportfor Intel hardware products, please visit:
http://supportmail.intel.com/scripts-emf/welcome.aspx

-Gina

Quoting - eniopineda
I'm trying to understand how to program the performance counters, when HyperThreading (HTT) is enabled, to be able to separate the counts by Logical Processor (LP).

Back in the P4 days, if I wanted to have LP-specific counts, I programmed 2 counters to count the same event, I set the AnyThread CCCR bits to 11b (any thread), set the T0_OS/USR bits in one ESCR (to count on LP0 only), and set the T1_OS/USR bits in the other ESCR (to count on LP1 only).

Reading the Nehalem docs it's a bit confusing how to do the above. The only thread control available is the ANY bit (bit 21) in the PerfEvtSel registers. According to the manual (Software Developer's Manual, Vol. 3B, page 18-54):

When set to 1, it enables counting the associated event conditions (including matching the threads CPL with the
OS/USR setting of IA32_PERFEVTSELx) occurring across all logical processors sharing a processor core. When bit
21 is 0, the counter only increments the associated event conditions (including matching the threads CPL with the
OS/USR setting of IA32_PERFEVTSELx) occurring in the logical processor which programmed the
IA32_PERFEVTSELx MSR.

Does that mean that
- The hardware "remembers" which LP programmed the counter?
(and the meaning of PerfEvtSelX.ANY=0 is really "this" LP only)
- To program a counter to count events for a specific LP the code that programs the counter must run on that LP?

For example, if I want to count INST_RETIRED.ANY_P by LP, do I have to program a different counter on each LP, while running on that LP, and set PerfEvtSelX.ANY=0 for each counter?

Setting PerfEvtSelX.ANY=1 would not be what I want because that would give the aggregate instructions retired count for *both* LPs (the core). The only other choice is to set PerfEvtSelX.ANY=0, but that doesn't offer the choice of which LP will/should increment the counter.

The same seems to be true for the Fixed-Function counters. Setting the ANY bit(s) in IA32_FIXED_CTR_CTRL_MSR has the same effect as the ANY bit in the PerfEvtSelX registers, which makes sense.

Any clarifications would be appreciated.

Enio.

Hi
One thing you might not have noticed is that,
In Nehalem, the counter and configuration MSR interfaces that software use are per-thread; whereas in the NetBurst PMU, the counter and configuration MSR interfaces are shared between two logical processors in the same core.
So, with Nelalemthe purpose of having those T0xxx and T1xxx bits are no longer needed because the PMC and PERFEVTSELx MSRs are no longershared.
If you keep the ANY bit zero, by default you get event counts that are separate for each logical processor.

Quoting - Shih Kuo (Intel)

Hi
One thing you might not have noticed is that,
In Nehalem, the counter and configuration MSR interfaces that software use are per-thread; whereas in the NetBurst PMU, the counter and configuration MSR interfaces are shared between two logical processors in the same core.
So, with Nelalemthe purpose of having those T0xxx and T1xxx bits are no longer needed because the PMC and PERFEVTSELx MSRs are no longershared.
If you keep the ANY bit zero, by default you get event counts that are separate for each logical processor.

First, thanks for your answers. Of course, that leads to more questions ...

Sometimes, such as when describing the ANY bit, the docs seem to imply that, somehow the hardware knows which LP writes to the PERFEVTSELx registers. However, the very first paragraph in Vol 3B, section 18.17 says the Nehalem processor family supports 4 PMCs and 3 fixed-fuction counters in the processor core.
I couldn't find any explicit mention in the docs about the fact that the PMCs and fixed-function counters (FFCs) are per LP. Maybe it's there somewhere and I just didn't see it.

I take "in the processor core" to mean just that. The processor core, in the case of HyperThreading, is comprised of 2 LPs. Thus I assume that the 4 PMCs and 3 FFCs are shared by all LPs in the processor core. But you're saying that is not the case and that there are really 8 PMCs and 6 FFCs per core. Not what the docs say at all.
If the PMCs, FFCs and PERFEVTSELx registers are really per-thread/LP then that's great. It makes life a lot simpler.

There must be some amount of sharing, however, otherwise there would not be a need for the ANY bit.
If the counters are really not shared then can I effectively count 8 events per core by programming 8 different events (4 in each LP) and setting the ANY bit in all PERFEVTSELx registers to 1?

I don't have a machine to play with the counters at the moment. I'm updating my code to support Nehalem and would like to get right.

Enio.

Quoting - EnioPineda

First, thanks for your answers. Of course, that leads to more questions ...

Sometimes, such as when describing the ANY bit, the docs seem to imply that, somehow the hardware knows which LP writes to the PERFEVTSELx registers. However, the very first paragraph in Vol 3B, section 18.17 says the Nehalem processor family supports 4 PMCs and 3 fixed-fuction counters in the processor core.
I couldn't find any explicit mention in the docs about the fact that the PMCs and fixed-function counters (FFCs) are per LP. Maybe it's there somewhere and I just didn't see it.

I take "in the processor core" to mean just that. The processor core, in the case of HyperThreading, is comprised of 2 LPs. Thus I assume that the 4 PMCs and 3 FFCs are shared by all LPs in the processor core. But you're saying that is not the case and that there are really 8 PMCs and 6 FFCs per core. Not what the docs say at all.
If the PMCs, FFCs and PERFEVTSELx registers are really per-thread/LP then that's great. It makes life a lot simpler.

There must be some amount of sharing, however, otherwise there would not be a need for the ANY bit.
If the counters are really not shared then can I effectively count 8 events per core by programming 8 different events (4 in each LP) and setting the ANY bit in all PERFEVTSELx registers to 1?

I don't have a machine to play with the counters at the moment. I'm updating my code to support Nehalem and would like to get right.

Enio.

I think you were trying to extrapolate a generic statement to pin down details of hardware implementation. It's not really germane to how to use the programming interface of PMUs. Table B-5 has the "scope" column the lists the details of each MSR interface is shared or not shared. There are many innovative ways of implmenting hardware to support the scope of programming interface listed in the scope column. It's outside the scope of SDM to describe how different hardware implements the logic behind a particular scope of shared or non-shared MSR interface.
Hope this helps

Quoting - Shih Kuo (Intel)

I think you were trying to extrapolate a generic statement to pin down details of hardware implementation. It's not really germane to how to use the programming interface of PMUs. Table B-5 has the "scope" column the lists the details of each MSR interface is shared or not shared. There are many innovative ways of implmenting hardware to support the scope of programming interface listed in the scope column. It's outside the scope of SDM to describe how different hardware implements the logic behind a particular scope of shared or non-shared MSR interface.
Hope this helps

I don't want to know how the hardware implements the PMUs. I want to understand how to program the controls to get the counts I need. The basic question was:

Are the PMCs shared by all LPs in the core or are they per LP?

Your answer was that "with Nelalemthe purpose of having those T0xxx and T1xxx bits are no longer needed because the PMC and PERFEVTSELx MSRs are no longershared."

If that is the case and, in fact, the PMCs and PERFEVTSEL registers aren't shared, then I think it was a valid follow up question to ask whether there were really 8 PMCs/PERFEVTSEL registers per core or not. Your answer didn't fit with what the manuals said.
Can I use PMC0 on LP0 and PMC0 also on LP1 counting different events? If the answer is Yes then great - the counters aren't shared. If the answer is No, and that's an OK answer, then the counters are shared. No big deal.
Either one is fine, I just wanted to know which is was because it made a difference on what needed to be done to program the counters.

Enio.

Quoting - EnioPineda

Can I use PMC0 on LP0 and PMC0 also on LP1 counting different events?
Enio.

the answer is yes.

Leave a Comment

Please sign in to add a comment. Not a member? Join today