Characterize Application Performance with Stall Events on 64-Bit Architecture


Challenge

Use front-end stall-cycle events and back-end stall-cycle events to characterize the performance of an application on the Intel® Itanium® Processor. The Itanium® 2 Processor separates Stall Cycle Events into two sets:

  • front-end stalls
  • Back-end stalls

 


Solution

Analyze back-end stall-cycle events separately from and before front-end stall-cycle events. Because the front end and back end of the Itanium 2 processor operate asynchronously, it is meaningless to compare the two sets of events, because they both count during 100% of the execution time.

For purposes of performance characterization, always start with the Back End Cycle Events. If the back end is stalled, those stalls have to be removed first. A Back End Stall Cycle Event is provided that identifies when front-end stalls are significant. Only if this event says that front-end stalls are significant is it necessary to characterize using both front-end and back-end stall events.

Stall Cycle Events only count processor clocks when the processor is stalled. Two events count the total cycles the Back End and the Front End (respectively) are stalled:

  • BE_BUBBLE.ALL
  • FE_BUBBLE.ALL

 

Subtracting the count for these events from CPU_CYCLES will give the total time either the front end or back end is not stalled, and is retiring instructions. BE_BUBBLE.ALL always represents the upper limit of the time that could be removed by optimizing the program to remove all stalls. If, when you characterize the program, BE_BUBBLE.ALL is small (so during most cycles, the processor is retiring instructions), it indicates that the instruction stream is well optimized, and performance improvements need to come by either reducing the number of instructions or by creating more instruction-level parallelism so more instructions execute on each clock. Only highly optimized programs exhibit this behavior; generally, BE_BUBBLE.ALL will be a significant portion of the execution time for your program. When this occurs, profile on CPU_CYCLES to identify the parts of the application that will have the biggest impact on performance if changed.

Both BE_BUBBLE.ALL and FE_BUBBLE.ALL can be broken into multiple events that give detailed information about the nature of the stalls. The first level of breakdown of the stall cycle events is prioritized so that only one event increments on any clock. Prioritization mimics the operation of the pipeline, so the more serious stalls are always reported.

BE_BUBBLE.ALL can be separated into six subevents that sum to the value counted by the BE_BUBBLE.ALL event (listed here in priority order):

  • BE_FLUSH_BUBBLE.XPN - the processor is stalled due to an exception or interrupt.
  • BE_FLUSH_BUBBLE.BRU - the processor is stalled due to a mispredicted branch.
  • BE_L1D_FPU_BUBBLE.ALL - the processor is waiting for exception detection to complete for either memory operations or floating-point operations.
  • BE_EXE_BUBBLE.ALL - the processor is waiting for an operand to be returned from memory or from an execution unit.
  • BE_RSE_BUBBLE.ALL - the processor is waiting for the Register Stack Engine to complete operations.
  • BE_BUBBLE.FE - the processor back end is stalled waiting for instructions to be fetched by the front end.

 

Because many reasons exist for the events BE_L1D_FPU_BUBBLE.ALL, BE_EXE_BUBBLE.ALL and BE_RSE_BUBBLE.ALL, subevents provide detailed information about the sources of the stall. These subevents are not prioritized, so if multiple problems exist at once (such as waiting on an operand from memory and waiting on an operand from an ALU in the same issue group), each reason will be reported. If BE_BUBBLE.FE is high, the reasons can be inferred by looking at the subevents for FE_BUBBLE.ALL.

FE_BUBBLE.ALL can be separated into seven subevents that sum to the value counted by the FE_BUBBLE.ALL event (listed in priority order):

  • FE_BUBBLE.FEFLUSH - the front end stalled because of a front-end flush.
  • FE_BUBBLE.TLBMISS - the front end stalled because of a level 1 or level 2 ITLB miss.
  • FE_BUBBLE.IMISS - the front end stalled because of an L1I cache miss.
  • FE_BUBBLE.BRANCH - the front end stalled by a branch recirculate.
  • FE_BUBBLE.FILL_RECIRC - the front end stalled by a recirculate for a fill operation.
  • FE_BUBBLE.BUBBLE - the front end stalled because of a branch-prediction bubble.
  • FE_BUBBLE.IBFULL - the front end is stalled because the Instruction Buffer is full.

 

For compiled code, front-end stalls can usually be removed by increasing optimization levels, using profile-guided optimization, or using inter-procedural or global optimization at compile time.

You can find detailed event descriptions, a detailed pipeline description, and detailed processor architecture in the "Intel Itanium Processor Reference Manual for Software Development and Optimization."


Source

Performance Analysis of Applications Running on Itanium® Processors

 


Etiquetas:
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.