How to Perform Back-End Bubble Root-Cause Analysis on 64-Bit Intel® Architecture


Challenge

Identify the root cause of a back-end processor bubble on the Intel® Itanium® 2 processor. A separate item, How to Identify Back-End Bubbles on 64-Bit Intel® Architecture, shows how to use the Intel® VTune™ Performance Analyzer to identify a bubble. In order to resolve this performance issue, the root cause of the bubble must be determined.


Solution

Create a VTune analyzer activity to analyze the bubble. In the VTune analyzer, beginning with the environment state after stepping through the solution in the item How to Identify Back-End Bubbles on 64-Bit Intel® Architecture go back to “Modify <…><sampling> collector” and a new window should pop up, warning you that you are about to modify an activity that already has results associated with it.

Choose the first option, "Make a copy of the Activity and modify the copy." When the sampling box appears, add the counters listed below in the right side of the window shown here. (There should be a 'Sample After' column in the right window; it was moved out of the picture in order to show the full counter name.)



After applying the changes and exiting the window, rename the new project to “BACK_END root cause.” Next, run the new activity by pressing the green arrow on the VTune analyzer toolbar.

At the end of the run, notice that BE_EXE_BUBBLE-ALL has three billion events associated with it and that every other counter added is an order of magnitude smaller:



Choose to modify the collector and make a copy of the activity again. Remove all the counters and keep CPU_CYCLES or IA64_INST_RETIRED-THIS if desired. Add all counters with BE_EXE as a prefix. Leave the collector, rename the project to BE_EXE, and then run it. The results look somewhat like this:



The problem areas are GRALL and FRALL, which are the general and floating registers. The VTune Performance Analyzer online help states that GRALL and FRALL events occur when the registers are dependent on each other or a load is waiting for the results in another register. After examining the core of the matrix code, it is evident that each iteration is independent of others, and the problem must be due to waiting for loads to complete. The primary cause of load stalls is that the data required is not located in cache. This implies inefficiency in memory access.

This item is part of a series, which i s introduced in the separate item "How to Resolve Back-End Bubbles on 64-Bit Intel® Architecture."


Source

Identifying Root Causes Using the VTune™ Performance Analyzer

 


For more complete information about compiler optimizations, see our Optimization Notice.