Intel® Performance Counter Monitor - A better way to measure CPU utilization

Submit New Article

July 25, 2010 1:00 AM PDT


Download Code Sample
IntelPerformanceCounterMonitorV1.7.zip

The complexity of computing systems has tremendously increased over the last decades. Hierarchical cache subsystems, non-uniform memory, simultaneous multithreading and out-of-order execution have a huge impact on the performance and compute capacity of modern processors.

Figure%201%20%u201CCPU%20Utilization%u201D%20measures%20only%20the%20time%20a%20thread%20is%20scheduled%20on%20a%20core
Figure 1: “CPU Utilization” measures only the time a thread is scheduled on a core

Software that understands and dynamically adjusts to resource utilization of modern processors has performance and power advantages. The Intel® Performance Counter Monitor provides sample C++ routines and utilities to estimate the internal resource utilization of the latest Intel® Xeon® and Core™ processors and gain a significant performance boost

When the CPU utilization does not tell you the utilization of the CPU

CPU utilization number obtained from operating system (OS) is a metric that has been used for many purposes like product sizing, compute capacity planning, job scheduling, and so on. The current implementation of this metric (the number that the UNIX* “top” utility and the Windows* task manager report) shows the portion of time slots that the CPU scheduler in the OS could assign to execution of running programs or the OS itself; the rest of the time is idle. For compute-bound workloads, the CPU utilization metric calculated this way predicted the remaining CPU capacity very well for architectures of 80ies that had much more uniform and predictable performance compared to modern systems. The advances in computer architecture made this algorithm an unreliable metric because of introduction of multi core and multi CPU systems, multi-level caches, non-uniform memory, simultaneous multithreading (SMT), pipelining, out-of-order execution, etc.

Diagram%20of%20a%20multi-socket%2C%20multi-core%20system
Figure 2: The complexity of a modern multi-processor, multi-core system

A prominent example is the non-linear CPU utilization on processors with Intel® Hyper-Threading Technology (Intel® HT Technology). Intel® HT technology is a great performance feature that can boost performance by up to 30%. However, HT-unaware end users get easily confused by the reported CPU utilization: Consider an application that runs a single thread on each physical core. Then, the reported CPU utilization is 50% even though the application can use up to 70%-100% of the execution units. Details are explained in [1].

A different example is the CPU utilization for “memory throughput”-intensive workloads on multi-core systems. The bandwidth test “stream” already saturates the capacity of memory controller with fewer threads than there are cores available.

Abstraction Level for Performance Monitoring Units

The good news is that Intel processors already provide the capability to monitor performance events inside processors. In order to obtain a more precise picture of CPU resource utilization we rely on the dynamic data obtained from the so-called performance monitoring units (PMU) implemented in Intel’s processors. We concentrate on the advanced feature set available in the current Intel Xeon 5500, 5600, 7500, E7 and Core i7 processor series [2-4].

We have implemented a basic set of routines with a high level interface that are callable from user C++ application and provide various CPU performance metrics in real-time. In contrast to other existing frameworks like PAPI* and Linux* “perf” we support not only core but also uncore PMUs of Intel processors (including the recent Intel Xeon E7 processor series). The uncore is the part of the processor that contains the integrated memory controller and the Intel® QuickPath Interconnect to the other processors and the I/O hub. In total, the following metrics are supported:
  • Core: instructions retired, elapsed core clock ticks, core frequency including Intel® Turbo boost technology, L2 cache hits and misses, L3 cache misses and hits (including or excluding snoops).
  • Uncore: read bytes from memory controller(s), bytes written to memory controller(s), data traffic transferred by the Intel® QuickPath Interconnect links.
Intel® PCM version 1.5 (and later) also supports Intel® AtomTM processors but counters like memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 because there is no L3 Cache in the Intel® AtomTM processor and no on-die memory controller or Intel® QPI links.

Intel® PCM version 1.6 supports on-core performance metrics (like instructions per clock cycle, L3 cache misses) of 2nd generation Intel® CoreTM processor family (Intel® microarchitecture code name Sandy Bridge) and an experimental support of some earlier Intel® microarchitectures (e.g. Penryn): it can be enabled by defining PCM_TEST_FALLBACK_TO_ATOM in the cpucounter.cpp .

I want to see these counters!

As an additional goody, the package includes easy-to-use command line and graphical utilities that are based on these routines. They can be used out-of-the box by users which cannot or do not want to integrate the routines in their code but are willing to monitor and understand the CPU capacity limits in real-time.

Figure 3 shows the screen shot of the command line utility on the Windows* platform. Whereas the Linux* version can rely on the MSR kernel module that is provided with the Linux kernel, no such facility is available on Windows. For Windows, a sample implementation of a Windows driver provides a similar interface.

Screenshot%20of%20Intel%20PCM%20command%20line%20tool
Figure 3: Intel Performance Counter Monitor command line version

But there is more to come. For the Linux operating system, the package includes an adaptor that plugs into the KDE* utility ksysguard. Using this daemon, it is possible to graph the various metrics in real-time. Figure 4 shows a screen shot where some of the metrics are displayed during a workload run.

Screenshoot%20of%20ksysguard%20with%20Intel%20PCM
Figure 4: The KDE utility ksysguard on Linux can graph performance counters using a plug-in.

Since these utilities provide a direct insight into the system, they can even be used to quickly find and understand fundamental performance bottlenecks in real-time. (In contrast to the Intel® VTuneTM Performance Analyzer, they won’t however tell you what parts of the application are causing the performance issue.)

Since version 1.5 the Intel® Performance Counter Monitor package contains a Windows* service, based on Microsoft .Net* 2.0 or better, that will create performance counters that can be shown in the Perfmon program that is delivered with the Microsoft Windows* OS. Microsoft's perfmon is capable of showing many useful performance counters on the Windows* OS like disk activity, memory usage, cpu load. More information about perfmon for Windows* 7 and Windows* 2008/R2 can be found at here (but perfmon has been available for many releases of Windows now). Please read the Windows_howto.rtf file on how to install and remove the service for Intel® PCM.

For all of the above mentioned hardware counters on the Nehalem and Westmere based platforms, a corresponding perfmon counter is created and therefore all features supported by perfmon are also available for these counters like logging over time in a file or database. For Intel® Atom processors the perfmon counters for memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 for reasons mentioned above. In a future update of Intel® Performance Counter Monitor the service will only show the available counters.

PCM+Service+screenshot.png

Figure 5: Windows* Perfmon showing data from Intel® Performance Counter Monitor





Intel® Performance Counter Monitor inside your programs

Thanks to the abstraction layer that the library provides, it has become very easy to monitor the processor metrics inside your application. Before their usage, the performance counters need to be initialized. Afterwards, the counter state can be captured before and after the code section of interest. Different routines capture the counters for cores, sockets, or the complete system, and store their state in corresponding data structures. Additional routines provide the possibility to compute the metric based on these states. The following code snippet shows an example for their usage:

   PCM * m = Monitor::getInstance();
if (m->program() != PCM::Success) // program counters
return -1; // error occured during programming
SystemCounterState before_sstate = getSystemCounterState();
[run your code here]
SystemCounterState after_sstate = getSystemCounterState();
cout << “Instructions per clock:“ << getIPC(before_sstate,after_sstate)
<< “L3 cache hit ratio:” << getL3CacheHitRatio(before_sstate,after_sstate)
<< “Bytes read:”<< getBytesReadFromMC(before_sstate,after_sstate)
<< [and so on]…


“CPU resource“-aware scheduling

To assess the potential impact of having precise resource utilization, we have implemented a simple scheduler that executed 1000 compute intensive and 1000 memory-bandwidth intensive jobs in a single thread. The challenge was the existence of non-predictable background load on the system, a rather typical situation in modern multi component systems with many third party components. Figure 5 depicts a possible schedule for a scheduler that is unaware of the background activity.

Scheduler%20without%20Intel%20PCM
Figure 5: Scheduler without Intel® Performance Counter Monitor

If the scheduler can detect (using the provided routines) that a lot of the memory bandwidth is currently used by a different process, it can adjust its schedule accordingly. Our simulations show that such a scheduler executes the 2000 jobs 16% faster than a generic unaware scheduler on the test system.

Scheduler%20with%20Intel%20PCM
Figure 6: Scheduler using Intel® Performance Counter Monitor
Changelog

Version 1.0
- Initial release

Version 1.5
- Integration into Windows* perfmon
- Intel® AtomTM support

Version 1.6
- Intel Xeon E7 series support (Intel microarchitecture code name Westmere-EX)
- On-core performance metrics of 2nd generation Intel® CoreTM processor family (Intel® microarchitecture code name Sandy Bridge)
- Highly experimental support of some earlier Intel® microarchitectures (e.g. Penryn). Enable by defining PCM_TEST_FALLBACK_TO_ATOM in the cpucounter.cpp
- Enhanced Linux KDE ksysguard plugin
- New options for the command line pcm utility
- Support of >64 cores on Windows 7 and Windows Server 2008 R2
- Support of Performance Monitoring Unit Sharing Guideline to prevent collisions with other processor performance monitoring agents (e.g. Intel® VTuneTM Performance Analyzer)

Version 1.7
- Intel PCM is distributed under new BSD license
- Support additional processor models with Intel® microarchitecture code name Nehalem
- New metrics: timestamps via RDTSCP instruction, C0 active core residency and a few other derived metrics
- Extended custom core configuration facility/mode
- Bug fixes

For questions and comments about Intel PCM and its use-cases, we recommend the Software Tuning, Performance Optimization & Platform Monitoring forum.

[1] Drysdale, Gillespie, Valles “Performance Insights to Intel® Hyper-Threading Technology
[2] Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Programming Guide, Part 2
[3] Intel® Xeon® Processor 7500 Series Uncore Programming Guide
[4] Peggy Irelan and Shihjong Kuo “Performance Monitoring Unit Sharing Guide
[5] David Levinthal ”Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors

Intel, Xeon, Core, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. The software license text is included into the code sample.

Intel® Turbo Boost Technology requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

This software is subject to the U.S. Export Administration Regulations and other U.S. law, and may not be exported or re-exported to certain countries (Burma, Cuba, Iran, North Korea, Sudan, and Syria) or to persons or entities prohibited from receiving U.S. exports (including Denied Parties, Specially Designated Nationals, and entities on the Bureau of Export Administration Entity List or involved with missile technology or nuclear, chemical or biological weapons).