Intel® Performance Counter Monitor (Intel® PCM) is discontinued. Instead, we will contribute updates and new features to the fork Processor Counter Monitor on GitHub.
Short URL for this page: www.intel.com/software/pcm
Roman Dementiev, Thomas Willhalm, Otto Bruggeman, Patrick Fay, Patrick Ungerer, Austen Ott, Patrick Lu, James Harris, Phil Kerly, Patrick Konsor,Andrey Semin, Michael Kanaly, Ryan Brazones, Rahul Shah, Jacob Dobkins
The complexity of computing systems has tremendously increased over the last decades. Hierarchical cache subsystems, non-uniform memory, simultaneous multithreading and out-of-order execution have a huge impact on the performance and compute capacity of modern processors.
Software that understands and dynamically adjusts to resource utilization of modern processors has performance and power advantages. The Intel® Performance Counter Monitor provides sample C++ routines and utilities to estimate the internal resource utilization of the latest Intel® Xeon® and Core™ processors and gain a significant performance boost
CPU utilization number obtained from operating system (OS) is a metric that has been used for many purposes like product sizing, compute capacity planning, job scheduling, and so on. The current implementation of this metric (the number that the UNIX* "top" utility and the Windows* task manager report) shows the portion of time slots that the CPU scheduler in the OS could assign to execution of running programs or the OS itself; the rest of the time is idle. For compute-bound workloads, the CPU utilization metric calculated this way predicted the remaining CPU capacity very well for architectures of 80ies that had much more uniform and predictable performance compared to modern systems. The advances in computer architecture made this algorithm an unreliable metric because of introduction of multi core and multi CPU systems, multi-level caches, non-uniform memory, simultaneous multithreading (SMT), pipelining, out-of-order execution, etc.
A prominent example is the non-linear CPU utilization on processors with Intel® Hyper-Threading Technology (Intel® HT Technology). Intel® HT technology is a great performance feature that can boost performance by up to 30%. However, HT-unaware end users get easily confused by the reported CPU utilization: Consider an application that runs a single thread on each physical core. Then, the reported CPU utilization is 50% even though the application can use up to 70%-100% of the execution units. Details are explained in .
A different example is the CPU utilization for "memory throughput"-intensive workloads on multi-core systems. The bandwidth test "stream" already saturates the capacity of memory controller with fewer threads than there are cores available.
The good news is that Intel processors already provide the capability to monitor performance events inside processors. In order to obtain a more precise picture of CPU resource utilization we rely on the dynamic data obtained from the so-called performance monitoring units (PMU) implemented in Intel's processors. We concentrate on the advanced feature set available in the current Intel® Xeon® 5500, 5600, 7500, E5, E7 and Core i7 processor series [2-4].
We have implemented a basic set of routines with a high level interface that are callable from user C++ application and provide various CPU performance metrics in real-time. In contrast to other existing frameworks like PAPI* and Linux* "perf" we support not only core but also uncore PMUs of Intel processors (including the recent Intel® Xeon® E7 processor series). The uncore is the part of the processor that contains the integrated memory controller and the Intel® QuickPath Interconnect to the other processors and the I/O hub. In total, the following metrics are supported:
Intel® PCM version 1.5 (and later) also supports Intel® Atom™ processors but counters like memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 because there is no L3 Cache in the Intel® Atom™ processor and no on-die memory controller or Intel® QPI links.
Intel® PCM version 1.6 supports on-core performance metrics (like instructions per clock cycle, L3 cache misses) of 2nd generation Intel® Core™ processor family (Intel® microarchitecture code name Sandy Bridge) and an experimental support of some earlier Intel® microarchitectures (e.g. Penryn): it can be enabled by defining PCM_TEST_FALLBACK_TO_ATOM in the cpucounter.cpp.
As an additional goody, the package includes easy-to-use command line and graphical utilities that are based on these routines. They can be used out-of-the box by users which cannot or do not want to integrate the routines in their code but are willing to monitor and understand the CPU capacity limits in real-time.
Figure 3 shows the screen shot of the command line utility on the Windows* platform. Whereas the Linux* version can rely on the MSR kernel module that is provided with the Linux kernel, no such facility is available on Windows. For Windows, a sample implementation of a Windows driver provides a similar interface.
But there is more to come. For the Linux operating system, the package includes an adaptor that plugs into the KDE* utility ksysguard. Using this daemon, it is possible to graph the various metrics in real-time. Figure 4 shows a screen shot where some of the metrics are displayed during a workload run.
See figures 9 and 10 below for PCM version 2.0 versions of these screenshots.
Since these utilities provide a direct insight into the system, they can even be used to quickly find and understand fundamental performance bottlenecks in real-time. (In contrast to the Intel® VTune™ Performance Analyzer, they won't however tell you what parts of the application are causing the performance issue.)
Since version 1.5 the Intel® Performance Counter Monitor package contains a Windows* service, based on Microsoft .Net* 2.0 or better, that will create performance counters that can be shown in the Perfmon program that is delivered with the Microsoft Windows* OS. Microsoft's perfmon is capable of showing many useful performance counters on the Windows* OS like disk activity, memory usage, cpu load. More information about perfmon for Windows* 7 and Windows* 2008/R2 can be found at here (but perfmon has been available for many releases of Windows now). Please read the Windows_howto.rtf file on how to install and remove the service for Intel® PCM.
For all of the above mentioned hardware counters on the Nehalem and Westmere based platforms, a corresponding perfmon counter is created and therefore all features supported by perfmon are also available for these counters like logging over time in a file or database. For Intel Atom® processors the perfmon counters for memory and Intel® QPI bandwidth and L3 Cache Misses will always show 0 for reasons mentioned above. In a future update of Intel® Performance Counter Monitor the service will only show the available counters.
Thanks to the abstraction layer that the library provides, it has become very easy to monitor the processor metrics inside your application. Before their usage, the performance counters need to be initialized. Afterwards, the counter state can be captured before and after the code section of interest. Different routines capture the counters for cores, sockets, or the complete system, and store their state in corresponding data structures. Additional routines provide the possibility to compute the metric based on these states. The following code snippet shows an example for their usage:
PCM * m = PCM::getInstance(); // program counters, and on a failure just exit if (m->program() != PCM::Success) return; SystemCounterState before_sstate = getSystemCounterState(); [run your code here] SystemCounterState after_sstate = getSystemCounterState(); cout << "Instructions per clock:" << getIPC(before_sstate,after_sstate) << "L3 cache hit ratio:" << getL3CacheHitRatio(before_sstate,after_sstate) << "Bytes read:" << getBytesReadFromMC(before_sstate,after_sstate) << [and so on]...
To assess the potential impact of having precise resource utilization, we have implemented a simple scheduler that executed 1000 compute intensive and 1000 memory-bandwidth intensive jobs in a single thread. The challenge was the existence of non-predictable background load on the system, a rather typical situation in modern multi component systems with many third party components. Figure 6 depicts a possible schedule for a scheduler that is unaware of the background activity.
Figure 6: Scheduler without Intel® Performance Counter Monitor
If the scheduler can detect (using the provided routines) that a lot of the memory bandwidth is currently used by a different process, it can adjust its schedule accordingly. Our simulations show that such a scheduler executes the 2000 jobs 16% faster than a generic unaware scheduler on the test system.
Figure 7: Scheduler using Intel® Performance Counter Monitor
Intel PCM version 2.0 adds support for the Intel® Xeon E5 series processor based on Intel microarchitecture codenamed Sandy Bridge EP/EN/E. This processor has a new uncore with lots of monitoring options.
For general info on the Intel® Xeon® E5 processors see this page.
For Intel® Xeon® E5 technical info see this page.
Below is a block diagram of the new processor from the Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring Guide.
Figure 8: Intel® Xeon® E5 series block diagram
The Xeon E5 series processor's uncore has multiple 'boxes' similar to the Xeon E7 processor (Intel microarchitecture codename Westmere-EX). Intel PCM v2.0 supports Intel®QPI and memory metrics for the new processor.
Comparing the output of 'pcm.exe 1' version 1.7 versus version 2.0 on a Xeon E7 (Westmere-EX) based system, the primary differences are:
The PCM version 2.0 information below applies to the Intel® Xeon® E5 series processor.
PCM version 2.0 adds more Intel® QPI info:
Please, note that availability of Intel® QPI information may depend on support of Xeon E5 uncore performance monitoring units in your BIOS and the BIOS settings.
PCM version 2.0 also adds energy usage info:
For the Intel® Xeon® E5 series processor, PCM version 2.0 also provides the pcm-power utility. The MSVS Windows project file for this utility is in the PCM-Power_Win directory.
The pcm-power utility displays, for all cases:
The pcm-power '-m' option displays IMC (Integrated Memory Controller) PMU (Performance Monitoring Unit) power state info. The valid options are:
The pcm-power '-p' option displays PCU (power control unit) PMU power state info. The valid options are:
In addition to the command line tools the graphical plugins for Linux Ksysguard and Windows* Perfmon have been extended with essential energy related metrics (C-states, thermal headroom, processor and DRAM energy).
Figure 9: Intel PCM version 2.0 Ksysguard plugin showing energy metrics.
Figure 10: Intel PCM version 2.0 Windows* Perfmon plugin showing energy metrics.
For questions and comments about Intel PCM and its use-cases, we recommend the Software Tuning, Performance Optimization & Platform Monitoring forum.
 Drysdale, Gillespie, Valles "Performance Insights to Intel® Hyper-Threading Technology"
 Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2
 Peggy Irelan and Shihjong Kuo "Performance Monitoring Unit Sharing Guide"
Intel, Xeon, Core, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. The software license text is included into the code sample.
Intel® Turbo Boost Technology requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
This software is subject to the U.S. Export Administration Regulations and other U.S. law, and may not be exported or re-exported to certain countries (Burma, Cuba, Iran, North Korea, Sudan, and Syria) or to persons or entities prohibited from receiving U.S. exports (including Denied Parties, Specially Designated Nationals, and entities on the Bureau of Export Administration Entity List or involved with missile technology or nuclear, chemical or biological weapons).
Intel Performance Counter Monitor is discontinued. Instead, we will contribute updates and new features to the fork Processor Counter Monitor on github.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804