Intel® Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide (Nehalem Core PMU)

Download Article

Download entire Intel® Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide Core [PDF 654KB]

Preface

This document contains advance information. While every effort has been made to ensure the accuracy of the information contained herein, some errors may occur. Please contact thomas.m.johnson@intel.com if you have questions or comments.

This document describes the programming interface to the performance monitoring hardware on the Nehalem processor core. This document does not exhaustively describe all of the performance monitoring events which may be counted in the Nehalem. A detailed description of these events may be released separately.

Terms

Table 1: List of Terms

TermDefinition
BTM Branch Trace Message. A message sent on the system bus which external hardware can capture and thereby develop a reconstruction of program control flow.
BTS Branch Trace Store. A memory buffer containing a collection of branch trace messages.
Clear In reference to register programming, this means a bit is programmed to binary zero (0).
CPL Current Privilege Level. The current privilege level at which the processor is executing (Ring 0, 1, 2, or 3).
DCU Data cache. The cache closest to the processor core. This cache provides data to the core with the minimum latency.
EBS Event Based Sampling. A technique in which counters are pre-loaded with a large negative count, and they are configured to interrupt the processor on overflow. When the counter overflows the interrupt service routine capture profiling data.
GO Globally Observable. The point in time at which data in the machine is architecturally observable.
GP General Protection (fault).
ISR Interrupt Service Routine.
LBR Last Branch Record. A facility which provides branch trace information either through special bus cycles on the system bus, or through records written to a user defined memory buffer (the BTS).
LLC Last-level cache. The lowest level of cache, after which memory requests must be satisfied by system memory.
MLC Mid-level cache. This is the intermediate level cache which lies between the DCU and LLC.
MSR Model Specific Register. PMU counter and counter control registers are implemented as MSR registers. They are accessed via the rdmsr and wrmsr instruction. Certain counter registers can be accessed via the rdpmc instruction.
NHM Nehalem. Specifically the Nehalem processor core.
PEBS Precise Event Based Sampling. A special counting mode in which counters can be configured to overflow, interrupt the processor, and capture machine state at that point.
PerfMon Short for Performance Monitoring
PMI Performance Monitoring Interrupt. This interrupt is generated when a counter overflows and has been programmed to generate an interrupt, or when the PEBS interrupt threshold has been reached.

The interrupt vector for this interrupt is controlled through the Local Vector Table in the Local APIC.
PMU Performance Monitoring Unit
RFO Read for ownership. When a cache line is written that misses in the cache, it must first be read into the cache so that the line exists in cache and can then be modified.
RO A bit is read-only.
RW A bit is readable and writeable.
Set In reference to register programming, this means a bit is programmed to binary one (1).
SMM System Management Mode.
SMT Simultaneous Multi-threading.
Supervisor (SUP), or privilege level 0 Supervisor state is the most privileged state of execution. Typically operating system code executes in privilege level 0.
TBS Time Based Sampling. A technique in which a time base is used to determine when to capture profiling data. This time base can be a timer interrupt or the occurrence of a certain number of other events, such clock ticks or instructions retired.
Thread A hardware thread of execution. In other words, Hyper-Threading Technology.
Uop Micro-operation. Macro instruction are broken down into uops within the machine, and these uops are executed by the execution units.
User (USR), or privilege level 1, 2, or 3 Intel processors operate in privilege levels zero through three, where lower numbered privilege levels operate in a more privileged state. User (or privilege levels 1, 2, or 3) refers to less privileged states of execution. User code typically executes at privilege level 3.
WO A bit is write-only.
WO1 A bit is write-only, and should be written to a '1' (set).

About this document

This is a programmer's reference manual for the Nehalem core performance monitoring units (PMU). This is targeted for current tool owners requiring documentation updates for Nehalem based platforms. It is not intended for first time tool developers or as a user analysis guide. Additional documents will be available at a later date targeted at providing that information.

Nehalem-based PMU Architecture

Intel processor cores for many years included a Performance Monitoring Unit (PMU). This unit provided the ability to count the occurrence of micro-architectural events which expose some of the inner workings of the processor core as it executes code.

One usage of this capability is to create a list of events from which certain performance metrics can be calculated. Software configures the PMU to count events over an interval of time and report the resulting event counts. Using this methodology, performance analysts can characterize overall system performance.

The PMU also provides facilities to generate a hardware interrupt through the Local APIC integrated within the processor core or logical thread. In this case software can pre-load event counter registers with a "sample after value," in which case a hardware interrupt is generated after the occurrence of N number of events. In the interrupt handler software collects additional architectural state which provides analysts with information regarding the performance of specific areas of application code. This methodology is sometimes referred to as profiling the execution of an application.

Products based on the Nehalem processor core include the capability to collect event data under both of these scenarios. In addition, these products include various platform features (uncore) integrated on the same die as the processor core. The uncore is essentially everything on the processor chip that is not part of the core. This includes point-to-point interconnect logic, memory controllers, and last-level caches, among other things. The uncores also provide an additional PMU facility that has the ability to interrupt the processor core in order that profiling information may be collected. This document describes the Nehalem processor core PMU.

Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.