Identify Back-End Bubbles on 64-Bit Intel® Architecture


Identify a processor back-end bubble on the Intel® Itanium® processor. A 'bubble' is defined as any delay in the processor. The 'back end' is the place where instructions are retired when they are complete. There are five main causes of bubbles in the Itanium 2 processor:

  • Pipeline flush (BE_FLUSH_BUBBLE)
  • Stalls in the L1 data-cache or floating-point processing-unit pipelines (BE_L1D_FPU_BUBBLE)
  • Stalls in the execution stage of the pipeline due to data not being available (BE_EXE_BUBBLE)
  • A need for the Register Stack Engine to free registers for the current stack (BE_RSE_BUBBLE)
  • A lack of instructions coming from the Front End (BACK_END_BUBBLE.FE)


The following matrix-multiplication code (MatrixMultiply.c) provides an example of code that runs sub-optimally (the printf statement is present solely to make sure the compiler does not optimize the code into nothingness):

#include "stdio.h"
#include "stdlib.h"

int main () { 

int i, j, k; 

int a[512][512], b[512][512], c[512][512];

for (i = 0; i < 512; i++) {

for (j = 0; j < 512; j++) { 

a[i][j] = rand();

b[i][j] = rand(); 

c[i][j] = rand(); 



for (i = 0; i < 512; i++) {

for (j = 0; j < 512; j++) {

for (k = 0; k < 512; k++) { 

c[i][j] = c[i][j] + a[i][k] * b[k][j];


printf("Multiply done, %d", c[i][j]);




Use the Intel® VTune™ Performance Analyzer to sample the application using the BACK_END_BUBBLE-ALL COUNTER.

Open a 64-bit command window and compile the code given in the Challenge section above with the following command:

ecl MatrixMultiply.c /o matrix.exe 


Copy the code to a 64-bit machine if it is not there already. Open the VTune analyzer and start a new project (the example is called 'Matrix' for convenience). Start a new Sampling Wizard and select Win32*/Win64*/Linux* profiling. Enter the path to your executable in the “Application to Launch” field and uncheck “Run Activity when done with wizard.” Click “Finish.”

Next, select “Configure” on the menu bar and “Modify ... <sampling> collector.” Click on “Events” and add the BACK_END_BUBBLE-ALL counter. Click on the green arrow in the VTune analyzer window and results similar to the following should appear:

Notice that the execution of the triple-nested loop took more than five billion clockticks and that the event BACK_END_BUBBLE-ALL consumed 4.5 billion clockticks. These figures mean that, on a 1 GHz Itanium 2 processor, the matrix multiply took 5.3 seconds and spent 4.5 seconds (85% of the execution time) waiting on the back end.

If BACK_END_BUBBLE had not been the culprit, other sampling runs using a counter with -ALL at the end of the name could have been used, or even run simultaneously with the BACK_END_BUBBLE run. For the sake of simplicity, this example takes th tcut of using only one counter.

This item is part of a series, which is introduced in the separate item "How to Resolve Back-End Bubbles on 64-Bit Intel® Architecture."


Identifying Root Causes Using the VTune™ Performance Analyzer


For more complete information about compiler optimizations, see our Optimization Notice.