| February 19, 2009 11:00 PM PST | |
by Richard J. Greco
Hotspot analysis using tools like the Intel VTune™ Performance Analyzer is a common technique for identifying bottlenecks and improving software performance. The problem with this technique is that while it identifies where a hotspot exists, it does not identify why the hotspot exists. This paper describes an analysis methodology using the cycle accounting events on the Itanium® 2 processor that overcomes this problem and provides more accurate identification of the location of hotspots. If you are new to using the Intel VTune Performance Analyzer, you may want to check out the Related Resources at the end of this article.
Traditionally, users of tools like the VTune Performance Analyzer use samples of elapsed time to discover the locations in a program where the majority of the execution time occurs. The source code at these hotspots is then analyzed to determine why the software is bottlenecked at that point. Once the bottleneck is known, appropriate changes are made to eliminate the bottleneck, and performance is improved.
There are two problems with this approach. First, the process of determining why a program is stalled once you know the hotspot is not always easy. This is especially true on wide issue processors like those in the Itanium processor family, where up to six instructions can be issued on a single clock, and any one of the instructions can be the source of the bottleneck. The second problem is that this technique always looks at 100% of the execution time. Places in the program that are heavily optimized, but where significant amounts of execution time are spent, also appear in the hotspot profile. Sample intervals have to be carefully crafted so the hotspots you are interested in show up in the hotspot profile reported by the performance tools.
Performance engineers working with the Itanium processors have an alternative to this approach, that allows them to first identify why a program is bottlenecked, then selectively profile on only the stalling portions of the execution time. Using this approach, the performance engineer can generate hotspot profiles of only the locations where the program is stalled, and know the precise reason for the stall at that location.
This alternative methodology is built around the Stall Cycle Events provided on Itanium processors. These special performance events monitor the amount of time the processor is stalled, and prioritize the reasons for the stall. Once the important stall reasons are known, the next step is to use standard profiling techniques to locate the program hotspots for these stalls.
The implementation of the Stall Cycle Events is similar, but different, on the Itanium processor and the Itanium 2 processor. While the methodology works for either processor, the remainder of this paper discusses how to use this methodology on the Itanium 2 processor. To use this methodology on an Itanium processor, simply substitute the Itanium processor Stall Cycle Events for the Itanium 2 processor Stall Cycle Events described below. These are documented in the Intel Itanium 2 processor manuals (see Related Links at the end of this article).
This section is intended as a brief introduction to the Itanium 2 processor Stall Cycle Events, in order to use them to characterize an application. A complete description of the Itanium 2 processor Stall Cycle Events can be found in the document "Intel Itanium 2 Processor Reference Manual for Software Development and Optimization", available through the Related Resources at the end of this article. The Itanium 2 processor separates Stall Cycle Events into two sets: front end stalls and back end stalls. Because the front end and back end of the Itanium 2 processor operate asynchronously, it is meaningless to compare the two sets of events because they both count during 100% of the execution time.
For purposes of performance characterization, always start with the Back End Cycle Events. If the back end is stalled, those stalls have to be removed first. A Back End Stall Cycle Event is provided that identifies when front end stalls are significant. Only if this event says that front end stalls are significant is it necessary to characterize using both front end and back end stall events.
Stall Cycle Events only count processor clocks when the processor is stalled. Two events count the total cycles the Back End and the Front End (respectively) are stalled:
BE_BUBBLE.ALL
FE_BUBBLE.ALL
Subtracting the count for these events from CPU_CYCLES will give the total time either the front end or back end is not stalled, and is retiring instructions. BE_BUBBLE.ALL always represents the upper limit of the time that could be removed by optimizing the program to remove all stalls. If, when you characterize the program, BE_BUBBLE.ALL is small (so most cycles the processor is retiring instructions) it indicates that the instruction stream is well optimized, and performance improvements need to come by either reducing the number of instructions, or by creating more instruction level parallelism so more instructions execute on each clock. Only highly optimized programs exhibit this behavior, generally BE_BUBBLE.ALL will be a significant portion of the execution time for your program. When this occurs, profile on CPU_CYCLES to identify the parts of the application that will have the biggest impact on performance if changed.
Both BE_BUBBLE.ALL and FE_BUBBLE.ALL can be broken into multiple events that give detailed information about the nature of the stalls. The first level of breakdown of the stall cycle events is prioritized so that only one event increments on any clock. Prioritization mimics the operation of the pipeline so the more serious stalls are always reported.
BE_BUBBLE.ALL can be separated into six subevents that sum to the value counted by the BE_BUBBLE.ALL event (listed in priority order):
BE_FLUSH_BUBBLE.XPN — the processor is stalled due to an exception or interrupt.
BE_FLUSH_BUBBLE.BRU — the processor is stalled due to a mispredicted branch.
BE_L1D_FPU_BUBBLE.ALL — the processor is waiting for exception detection to complete for either memory operations or floating point operations.
BE_EXE_BUBBLE.ALL — the processor is waiting for an operand to be returned from memory or from an execution unit.
BE_RSE_BUBBLE.ALL — the process or is waiting for the Register Stack Engine to complete operations.
BE_BUBBLE.FE — the processor back end is stalled waiting for instructions to be fetched by the front end.
Because many reasons exist for the events BE_L1D_FPU_BUBBLE.ALL, BE_EXE_BUBBLE.ALL and BE_RSE_BUBBLE.ALL, subevents provide detailed information about the sources of the stall. These subevents are not prioritized, so if multiple problems exist at once (such as waiting on an operand from memory and waiting on an operand from an ALU in the same issue group), each reason will be reported. If BE_BUBBLE.FE is high, the reasons can be inferred by looking at the subevents for FE_BUBBLE.ALL.
FE_BUBBLE.ALL can be separated into seven subevents that sum to the value counted by the FE_BUBBLE.ALL event (listed in priority order):
FE_BUBBLE.FEFLUSH — the front End stalled because of a front end flush.
FE_BUBBLE.TLBMISS — the front End stalled because of a level 1 or level 2 ITLB miss.
FE_BUBBLE.IMISS — the front end stalled because of an L1I cache miss.
FE_BUBBLE.BRANCH — the front end stalled by a branch recirculate.
FE_BUBBLE.FILL_RECIRC — the front end stalled by a recirculate for a fill operation.
FE_BUBBLE.BUBBLE — the front end stalled because of a branch prediction bubble.
FE_BUBBLE.IBFULL — the front end is stalled because the Instruction Buffer is full.
For compiled code, front end stalls can usually be removed by increasing optimization levels, using profile guided optimization, or using inter-procedural or global optimization at compile time.
To understand these events and the reasons they occur, you can find detailed event descriptions, a detailed pipeline description, and detailed processor architecture in the "Intel Itanium 2 Processor Reference Manual for Software Development and Optimization."
Because the Stall Cycle Events provide detailed information about why an application is stalled, they can characterize the execution behavior of an application to identify what needs to be optimized.
The first step is to determine the causes for back end stalls. While these can be counted in total by the event BE_BUBBLE.ALL, it is more useful to sample subcomponents of BE_BUBBLE.ALL. The Itanium 2 processor is capable of monitoring four performance events at one time. The following four Stall Cycle Events sum to the event BE_BUBBLE.ALL:
BE_FLUSH_BUBBLE.ALL
BE_BUBBLE.L1D_FPU_RSE
BE_EXE_BUBBLE.ALL
BE_BUBBLE.FE
By sampling all four in one run of the program, you will guarantee the independent counts will sum to BE_BUBBLE.ALL.
If your program has uniform execution characteristics (or the characteristics have a dominant series of operations) and a short execution time, you can sample these over the entire program execution. If your program has a long execution time and uniform characteristics, take a sample for some part of the execution time after initialization has completed. If your program has varying execution characteristics, take several equal duratio n samples at different points during the execution.
Next, sum these four events and subtract them from the execution time (if you sampled over the entire execution) or the sample interval (if you sampled for less than total execution time) to get the count of the cycles the back end is not stalled. You can either compare the numbers, or plot them as shown in Figure 1.
This chart shows that the primary reason for stalls is Execution Unit Stalls (BE_EXE_BUBBLE.ALL). Before moving to profiling, a second characterization run should be made to determine which of the many reasons counted by this event are the source of the stall. Characterizing with the following four events generally pinpoints the dominant causes:
BE_EXE_BUBBLE.GRALL
BE_EXE_BUBBLE.GRGR
BE_EXE_BUBBLE.FRALL
BE_EXE_BUBBLE.ARCR_PR_CANCEL_BANK
In rare cases the composite event BE_EXE_BUBBLE.ARCR_PR_CANCEL_BANK will be dominant. If it is, simply make another characterization using the three subevents of BE_EXE_BUBBLE that count each of these uniquely.
Similarly, if the value for BE_BUBBLE.L1D_FPU_RSE is dominant, characterize again using the events to separate this composite Stall Cycle Events into its individual components:
BE_RSE_BUBBLE.ALL
BE_L1D_FPU_BUBBLE.L1D
BE_L1D_FPU_BUBBLE.FPU
The event BE_L1D_FPU_BUBBLE.FPU uniquely counts waiting on floating-point exceptions, while the events BE_RSE_BUBBLE.ALL and BE_L1D_FPU_BUBBLE.L1D can be separated using other Stall Cycle Events. These are documented in the "Intel Itanium 2 Processor Reference Manual for Software Development and Optimization".
Focused Profiling
Now that the characterization is complete, use a profiling tool like the VTune Performance Analyzer to find the hotspots for the Stall Cycle Events that are the dominant part of the execution time. When you profile using a Stall Cycle Event you know the exact reasons that cause the event. For the example in Figure 1, profiling on just BE_EXE_BUBBLE.ALL shows that the hotspots are all related to the consumption of an operand. This means that when looking at the assembly language associated with the application, any instruction that does not consume an operand can be ignored. By profiling on the Stall Cycle Event, both the hotspot and the reason for the hotspot are identified.
Also, because the profile was collected on just BE_EXE_BUBBLE.ALL, the number of counted clocks is about half of the clocks counted with CPU_CYCLES. The sample interval can be reduced by half and still produce a sample file of about the same size. By using a smaller sample interval, hotspots are more accurately identified.
This white paper has discussed a methodology for using the Stall Cycle Events on the Itanium 2 processor to identify where hotspots exist in a program.
Stall Cycle Events characterize a program to learn what is causing the program to stall the processor pipeline. When processor stalls are a significant portion of the execution time, Stall Cycle Events identify what program changes must be made to improve application performance.
Once the dominant Stall Cycle Events a re known, profile using them. The hotspots identified all have known reasons for their cause. This allows the performance engineer to know both where and why hotspots exist, making them easier to remove. Also, because only a portion of the execution time is monitored by each Stall Cycle Event, smaller sample sizes can be used allowing for more accurate identification of hot spots.
Related Links
Secrets of the Intel VTune™ Performance Analyzer
Intel Itanium 2 Processor Reference Manual for Software Development and Optimization
Richard Greco is an applications engineer with Intel's Software and Solutions Group.
For more complete information about compiler optimizations, see our Optimization Notice.

