Analyze Pipeline Stalls on 64-Bit Intel Architecture

Submit New Article

February 28, 2009 11:00 PM PST



Challenge

Identify and categorize the causes of pipeline stalls for maximum performance on the Itanium® processor and Itanium® processor. The objective of microarchitectural optimization is to maximize the flow of instructions through the CPU’s functional units. This is equivalent to minimizing the CPU cycles wherein the core pipeline is stalled. To accomplish this in a methodical manner, the causes of pipeline stalls invoked by the execution of an application must be categorized quantitatively. The dominant sources of pipeline stalls can then be remedied in an efficient manner.

On Itanium processors, each stage that can stall the pipeline has an associated event that allows the accumulation of the stalls due to that stage. In addition, these events are prioritized so that in the event of multiple stall conditions, each stalled cycle is attributed to one and only one stage of the pipeline. Stalled CPU cycles are assigned to the highest priority pipeline stage experiencing a stall during that cycle.

Priority of the stall increases as the end of the pipeline is approached. This creates a sum rule whose components divide the stalled CPU cycles between various architectural subsystems. These cycle accounting components correspond to different architectural features in the pipeline and in themselves categorize the dominant architectural features responsible for the execution inefficiency.

The objective of microarchitectural performance tuning is to minimize the components of the sum rule that correspond to pipeline stalls. In other words, minimize cycles that are not spent issuing instructions to the functional units.


Solution

Count individual events on the Itanium processor, derive analogous events on the Itanium processor, and categorize both sets to allow direct comparison. On Itanium processors, even cycles spent issuing instructions to the functional units invoke an event that can be explicitly counted. On the Itanium processors, cycles spent issuing instructions to the functional units must be calculated from other events.

The event BACK_END_Bubble.ALL accumulates the cycles where the instruction pipeline stalled for any reason. The result of CPU_Cycles - Back_End_Bubble.ALL is equal to the number of cycles spent issuing instructions to the functional units. To determine the breakdown of the pipeline stalls, apply the main sum rule:

Back_End_Bubble.ALL =
BE_Flush_Bubble
+ BE_L1D_FPU_Bubble
+ BE_EXE_Bubble
+ BE_RSE_Bubble
+ Back_End_Bubble.FE 
 

The components are listed in order of decreasing priority (i.e., from the downstream to upstream ends of the core pipeline). Each component accumulates stall cycles caused by different architectural subsystems. These are similar to the Itanium processor performance monitoring events, which work in a similar prioritized manner. The following table illustrates the comparison between Itanium processor events and Itanium processor events used to count cycles lost in stalls:

 Events for Itanium Processor Cycle Accounting

Events for Itanium Processor Cycle Accounting

 BE_Flush_Bubble

 Pipeline_Backend_Flush_Cycle

BE_L1D_FPU_Bubble
BE_EXE_Bubble

Data_Access_Cycle
Dependency_Scoreboard_Cycle

BE_RSE_Bubble

RSE_Active_Cycle.d

Back_End_Bubble.FE

Unstalled_Backend_Cycle
Inst_Access_Cycle
Taken_Branch_Cycle.d

 

For more information on these events, see the manual, "Introduction to Microarchitectural Optimization for Itanium® Processors."

In order to compare the Itanium and Itanium processor cycle accounting events equitably, you must group the events into sets corresponding to identical causes of the pipeline stalls. The Itanium processor events BE_L1D_FPU_Bubble and BE_EXE_Bubble and the Itanium processor events Data_Access_Cycle and Dependency_Scoreboard_Cycle form such a set.

Between the two pairs of events, you can account for all memory access stalls and scoreboarded register dependency stalls On Itanium processors, the memory access stalls and scoreboard dependency stalls are assigned to individual counters: Data_Access_Cycle and Dependency_Scoreboard_Cycle, respectively. On Itanium processors, these stalls are assigned to counters more closely tied to the architectural subsystems.

The Itanium processor counter RSE_Active_Cycle.d is a derived quantity and is equal to the difference of:

Memory_Cycle - Data_Access_Cycle


Source

Introduction to Microarchitectural Optimization for Itanium® Processors