Analyze Memory Accesses on 64-Bit Intel Architecture


Challenge

Determine what memory accesses are causing EXE pipeline stalls accumulated by the BE_EXE_Bubble counter. Most memory-access stall cycles are accumulated by the BE_EXE_Bubble counter. This counter accumulates stall cycles in the EXE stage of the pipeline. These stall cycles occur mostly because the data loaded into registers is not ready for consumption by the functional units.

These dependency stalls break down to two sources:

  • Data has not arrived from the memory subsystem in time for use by the functional unit.
  • Data from a functional unit was not written back to the register in time for use by a subsequent instruction.

 

In both cases, the problem is insufficient independent instructions between the placing of the data in the register (load or instruction output) and its later use as input for other instructions, to absorb the latency of getting the data into the register.


Solution

Use the Intel® VTune™ Performance Analyzer to analyze the subevents of the BE_EXE_Bubble counter. The following table shows the subevents for BE_EXE_Bubble that can be selected with a specific umask value. In the VTune analyzer, the subevents are predefined with the specific umask values:

Extension

PMC.umask

Description

ALL

B0000

Back-end was stalled by exe

GRALL

B0001

Back-end was stalled by exe due to GR/GR or GR/load dependency

FRALL

B0010

Back-end was stalled by exe due to FR/FR or FR/load dependency

PR

B0011

Back-end was stalled by exe due to PR dependency

ARCR

B0100

Back-end was stalled by exe due to AR or CR dependency

GRGR

B0101

Back-end was stalled by exe due to GR/GR dependency

CANCEL

B0110

Back-end was stalled by exe due to a canceled load

BANK_SWITCH

B0111

Back-end was stalled by exe due to bank switching.

ARCR_PR_CANCEL_BANK

B1000

ARCR, PR, CANCEL or BANK_SWITCH

---

B1001-b1111

(* nothing will be counted *)

In the case of general registers, the two classes of dependency stalls can be explicitly measured:

  • BE_EXE_Bubble.GRALL accumulates all integer-data dependency stalls.
  • BE_EXE_Bubble.GRGR accumulates stall cycles due to integer functional-unit latencies not being completely absorbed by the instruction scheduling.

Using the subevents, the stall cycles attributed to loading integer data can be approximated from the difference, as:

BE_EXE_Bubble.GRALL- BE_EXE_Bubble.GRGR 

Frequently, just collecting data on BE_EXE_Bubble.GRALL is sufficient, as the compiler usually hides the functional-un it latency. The difference is not exact, because these two counters are not prioritized, and there are situations where both could increment on a single cycle. For example, consider the case where a load and an MM instruction are issued on the same cycle and the load misses in L1. If the code tries to use both results too soon, then both of the stall conditions would be true, but the functional-unit stall would mask the memory-access stall.

Since there is no corresponding pair of subevents of this type for floating point loads, there is some uncertainty in measuring the stall cycles due to floating-point data access. If only the performance events can be used, you must also look at the associated memory-subsystem occurrence events (other than BE_EXE_Bubble.Frall) to determine the severity of floating-point data-access stalls. With the VTune analyzer, you can simply drill down to the disassembly view, and code inspection should make it obvious.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors

 


For more complete information about compiler optimizations, see our Optimization Notice.