Determine what memory accesses are causing EXE pipeline stalls accumulated by the BE_EXE_Bubble counter. Most memory-access stall cycles are accumulated by the BE_EXE_Bubble counter. This counter accumulates stall cycles in the EXE stage of the pipeline. These stall cycles occur mostly because the data loaded into registers is not ready for consumption by the functional units.
These dependency stalls break down to two sources:
- Data has not arrived from the memory subsystem in time for use by the functional unit.
- Data from a functional unit was not written back to the register in time for use by a subsequent instruction.
In both cases, the problem is insufficient independent instructions between the placing of the data in the register (load or instruction output) and its later use as input for other instructions, to absorb the latency of getting the data into the register.
Use the Intel® VTune™ Performance Analyzer to analyze the subevents of the BE_EXE_Bubble counter. The following table shows the subevents for BE_EXE_Bubble that can be selected with a specific umask value. In the VTune analyzer, the subevents are predefined with the specific umask values:
Back-end was stalled by exe
Back-end was stalled by exe due to GR/GR or GR/load dependency
Back-end was stalled by exe due to FR/FR or FR/load dependency
Back-end was stalled by exe due to PR dependency
Back-end was stalled by exe due to AR or CR dependency
Back-end was stalled by exe due to GR/GR dependency
Back-end was stalled by exe due to a canceled load
Back-end was stalled by exe due to bank switching.
ARCR, PR, CANCEL or BANK_SWITCH
(* nothing will be counted *)
In the case of general registers, the two classes of dependency stalls can be explicitly measured:
- BE_EXE_Bubble.GRALL accumulates all integer-data dependency stalls.
- BE_EXE_Bubble.GRGR accumulates stall cycles due to integer functional-unit latencies not being completely absorbed by the instruction scheduling.
Using the subevents, the stall cycles attributed to loading integer data can be approximated from the difference, as:
Frequently, just collecting data on BE_EXE_Bubble.GRALL is sufficient, as the compiler usually hides the functional-un it latency. The difference is not exact, because these two counters are not prioritized, and there are situations where both could increment on a single cycle. For example, consider the case where a load and an MM instruction are issued on the same cycle and the load misses in L1. If the code tries to use both results too soon, then both of the stall conditions would be true, but the functional-unit stall would mask the memory-access stall.
Since there is no corresponding pair of subevents of this type for floating point loads, there is some uncertainty in measuring the stall cycles due to floating-point data access. If only the performance events can be used, you must also look at the associated memory-subsystem occurrence events (other than BE_EXE_Bubble.Frall) to determine the severity of floating-point data-access stalls. With the VTune analyzer, you can simply drill down to the disassembly view, and code inspection should make it obvious.
Introduction to Microarchitectural Optimization for Itanium® Processors