Intel PCM: discrepancy between PCM and PAPI

Intel PCM: discrepancy between PCM and PAPI

Hi,

I was trying to compare the output of PCM vs PAPI for a software router tool (Click)

I find a non-trivial difference between the output of PAPI and PCM for simple metrics like total cycles and total L3 misses. Each tool is self-consistent (modulo some stochastic noise)

with PAPI PAPI_TOT_CYC = 560548883 PAPI_L3_TCM = 993702
with PCM getCycles = 1288193707, getL3CacheMisses = 746465

If I understand the semantics correctly, both are accessing the same
hardware counters and for the same workload the values should be in the
same ballpark, but this huge discrepancy is really puzzling

Are there known issues that result in different outputs with different performance tools?

Any suggestions?

Thanks!

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

There are different ways to measure L3 cache misses and they vary with architecture. Therefore, let's focus on the cycles first. Are you measuring cycles on 1 core or on the complete CPU? PCM reports the cycles including turbo mode (in contrast to "reference cycles"). Is PAPI doing the same?

What kind of CPU are you using and how many threads is your workload using?

Kind regards
Thomas

Hi,

getCycles is returning the count for CPU_CLK_UNHALTED.THREAD event (which also accounts Turbo Boost as Thomas mentioned). Which underlying hardware eventdoes PAPI map to PAPI_TOT_CYC?

In PCM the mapping of getL3CacheMisses to HW event depends on processor/architecture type. What is your cpu model?

Are you measuring (aggregated) cyclesfor all cores or a particular core or for just your thread(s)when usingPAPI?
PCM can measure cycles/cache missesfor particular cores or sockets or the whole system. The scope of the measurement depends on what PCM state object you are using:CoreCounterState or SocketCounterState or SystemCounterState.

Best regards,
Roman

Thomas and Roman,

Thanks for your inputs.

1. I am running a single threaded program on a Intel Xeon CPU X5560 @ 2.80GHz (26)

2. I am running PCM in the default mode (I didnt set any of the Core/Socket or System settings)

3. PAPI seems to map PAPI_TOT_CYC to the same counter as well as far as i can see:

heres the output of

papi_avail -e PAPI_TOT_CYC
Available events and hardware information.
--------------------------------------------------------------------------------
PAPI Version : 4.1.3.0
Vendor string and code : GenuineIntel (1)
Model string and code : Intel Xeon CPU X5560 @ 2.80GHz (26)
CPU Revision : 5.000000
CPUID Info : Family: 6 Model: 26 Stepping: 5
CPU Megahertz : 2793.259033
CPU Clock Megahertz : 2793
Hdw Threads per core : 1
Cores per Socket : 4
NUMA Nodes : 1
CPU's per Node : 4
Total CPU's : 4
Number Hardware Counters : 16
Max Multiplex Counters : 512
--------------------------------------------------------------------------------
The following correspond to fields in the PAPI_event_info_t structure.

Event name: PAPI_TOT_CYC
Event Code: 0x8000003b
Number of Native Events: 1
Short Description: |Total cycles|
Long Description: |Total cycles|
Developer's Notes: ||
Derived Type: |NOT_DERIVED|
Postfix Processing String: ||
Native Code[0]: 0x40000000 |UNHALTED_CORE_CYCLES|
Number of Register Values: 4
Register[ 0]: 0x0000003c |Event Code|
Register[ 1]: 0x0000003c |Event Code|
Register[ 2]: 0x0000003c |Event Code|
Register[ 3]: 0x0000003c |Event Code|
Native Event Description: |count core clock cycles whenever the clock signal on the specific core is running (not halted). Alias to event CPU_CLK_UNHALTED:THREAD

Hi VS,

a few more questions:

How do you run PCM: are you using the command line pcm.x and start your program from pcm.x to measure and outputthe metrics? If you use the command line pcm.x interface, you can post the output here to shed more light. Or did you instrument your program/function calls in the program using PCM API (retrieving SystemCounterState objects and calling getCycle methods on these objects) ?

Best regards,
Roman

Hello V S,
I have some questions about the results you are seeing.
The questions are kind of basic but need to be asked...

What workload are you running while you are taking the measurements?
The PAPI measurement (at 5.6e8 cycles) covers about 0.2 cpu seconds.
The PCM measurement (at1.29e9 cycles) covers about 0.46 cpu seconds.
If you are measuring a basically idle system then the 'unhalted cycles' can vary quite a bit depending on what random process is running.
Or if the cpus are halted, then the unhalted clockticks won't increment.
I would expect that, if you ran workload that kept all the CPUs busy for say, 10 seconds, then PAPI and PCM would agree within a percent or so.
Pat

I instrumented the code to use the PCM API (not the cli tool)

well, these are all valid questions!

1. This is the "Click" modular router -- I am basically reading an offline packet trace and processing it with some modules within click

2. I dont think the cpus are halted on disk reads -- I also ran a similar workload where I load all packets into memory first and find that the numbers are similar

3. I also checked cpu utilization with atop etc, and it is usually close to 100 (and not really stalled while running)

Re: your point about random processes -- I ran several runs and the numbers within each library is self consistent, and consistently different from the other library.

Hi VS,

does Click/your workload include processing in Linux kernel module? You can see that if youobserve some"system time" in top or vmstat utility.

Does PAPI account clock cycles spent outside of your user thread (in kernel)? Intel PCM accounts every clock tickon the cores: no matter if it was system (ring0)or user cpu time (ring3).

--
Roman

Hi all,
I have a similar problem and I came across this post from a google search. The problem I'm having is that the cycles reported by PCM and the cycles measured by a custom rdtsc code doesn't match each other. First of all, I make sure that threads are affinitized to certain cores. I do the rdtsc measurements on the same thread. The locations for rdtsc and PCM calls are at the same lines. I understand thatSystemCounterState reports among all the sockets/cores so I changed my code to useCoreCounterState. However the problem still persists where the cycles reported by PCM is less now. Number of cycles reported by rdtsc is about 12 billion and cycles reported by PCM is about 9 billion. I'm still curious why there is such a difference. Just to give more information, the machine I'm working with isIntel Xeon L5520 2.26Ghz with 2 sockets and each cpu has 4 cores with HT enabled. I'm only using 8 threads.Thanks in advance for the help,btw here is the code for rdtsc:
unsigned long lo, hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); tick = (unsigned long long) hi << 32 | lo; unsigned long lo, hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); tick = (unsigned long long) hi << 32 | lo;

Hello cagribal,
The rdtsc instruction returns the time stamp counter.
The time stamp counter (on the L5520 nehalem-based processor) continues counting when the cpu is halted.
The getCycle() routine uses the CPU_CLK_UNHALTED.THREAD event which stops counting when the cpu is halted.
This is probably the difference that you are seeing.
Pat

Edit: It seems CPU_CLK_UNHALTED is affected fromcpu speed throttling, frequency changes and turbo boost. I disabled these features and finally got consistent cycle numbers. Another option is using CPU_CLK_UNHALTED.REF which is not effected from these and incremented based on a reference clock.

Hi,

the getCycles() function returns the CPU_CLK_UNHALTED.THREAD event count.It isthe number core clock cycles when signal on a specific core is running (not halted).

The counter does not advance in the following conditions:
- an ACPI C-state is other than C0 for normal operation
- HLT
- STPCLK+ pin is asserted
- being throttled by TM1
- during the frequency switching phase of a performance state transition

The getRefCycles() function returns the CPU_CLK_UNHALTED.REF event countwhichisthe number of reference clock cycles while clock signal on the core is running. The reference clock operates at a fixed frequency, irrespective of core frequency changes due to performance state transitions. Note that CPU_CLK_UNHALTED.THREAD can exceed the CPU_CLK_UNHALTED.REF event count if Turbo Boost kicks in.

Roman

one can find documentation for the PCM methods in Doxygen format in the cpucounters.h header. HTML documentation can be easily generated from it (the included doxygen project file iscalled "Doxyfile").

Leave a Comment

Please sign in to add a comment. Not a member? Join today