Prioritize Bottlenecks on the Itanium Processor

Submit New Article

March 9, 2009 1:00 AM PDT



Challenge

Prioritize Performance Bottlenecks in terms of their impact on performance to support their resolution in order of importance. The key to optimizing an application is to use performance monitoring events to identify the dominant stall contributions and their relative importance. Removing cache misses when they have a minimal contribution to the reduction in performance is an inefficient and wasteful exercise. 


Solution

Express bottleneck information in terms of a normalized ratio. While the sum rule shown in the separate item, Analyze Pipeline Stalls on 64-Bit Intel® Architecture communicates this information, it is easier to quickly comprehend its impact when presented as a normalized ratio.

The naïve thing to do is to normalize to CPU_CYCLES. This would be a zero sum game that makes it difficult to understand your progress. In fact, it is better in general to normalize to a measure of the amount of work done (an application-dependent definition), or to Retired Itanium® Instructions (event code 8).

For the purposes of the following discussion, the accumulated data will be normalized to the event IA64_Inst_Retired (IA64IR). This is a subevent of IA64_Tagged_Instructions_Retired and defined to count all instructions with a true predicate and all branch instructions, regardless of predicate. It is therefore slightly different from the corresponding counter on Itanium processors.

The application efficiency analysis expressed in Cycles Per Instruction (CPI) has a simple algebraic structure that enables you to drill down into any component of the stall cycle accounting. This is simplified by the denominator, IA64_Instructions_Retired (IA64IR), which is common for all the components and relatively stable even as the optimization progresses:

CPI = CPU_CYCLES/IA64IR 

As you optimize the application algorithmically or with compiler flags, the total number of instructions (path length) changes. This in turn causes variations in CPI. However, it remains one of the best normalizations for standardized (i.e. comparable) execution-efficiency measurement.

The objective of microarchitectural optimization is the reduction of stall cycles. You can measure this reduction using the Back_End_Bubble event. If you define the quantity CYC_RET_INST as the number of cycles spent retiring instructions, then

CYC_RET_INST = CPU_CYCLES - Back_End_Bubble 

This is equivalent to the Itanium processor event All_Stops_Dispersed. Thus, we can write:

CPI = CYC_RET_INST/IA64IR + Back_End_Bubble/IA64IR 

and

Back_End_Bubble/IA64IR = ΣBE_Bubble_COMPONENTS/IA64IR 

The BE_Bubble_Components (the elements in the cycle accounting sum rule) can be further broken down with the use of subevents and in some cases with models incorporating penalties for the occurrence of other architectural monitoring events (called “occurrence events”). This is particularly true for the memory access stalls that contribute to BE_EXE_Bubble and BE_L1D_FPU_Bubble. This leads to relations like:

BE_Bubble_Component/IA64IR ~ ΣOccurrence_Events*Penalty/IA64IR 

where the sum extends over the architectural monitoring events (and in many cases differences between counts of such events) which have a relationship to contributing to the pipeline s talls accumulated in the corresponding BE_Bubble stall counter.

This last step is not as accurate as it was on Itanium processors, due to the out-of-order data returns from the L2 cache and the complex scheduling that can result in the OzQ. The Itanium processor L2 FIFO queue made such a modeling remarkably accurate. On Itanium® processors, the out-of-order returns from the OzQ allow access penalties to mask each other, making the situation more complex. However, you can still use memory access modeling as a guide to help determine where the most benefit is likely to be gained.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors