Quit Stalling and Execute!

David Levithal's class at the SW Enabling Summit, Software Optimization and Performance Analysis on Intel® Core™ Micro-Architecture, introduced an alternate way of looking at performance tuning by considering the "stall" cycles occurring in the micro-architecture, cycles where execution is delayed because of resource limits. In my last note I talked about micro-operations (μops) retired versus dispatched and presented David's definition of stalls as cycles where no μops get dispatched. This time we'll delve deeper into the sources of stalls and see how this fits into our methods for performance analysis.


David's methodology was developed while studying applications whose work is dominated by loops employing floating point instructions. As a Technical Consulting Engineer for the VTune™ Analyzer team, David has seen action working on many of the High Performance Computing applications that run on Intel® Architecture. Although not all applications are dominated by such loops, the insights gleaned there can be applied to many performance problems.


Consider CPI (Cycles Per Instruction), which in conventional parlance means Cycles Per Instruction Retired. It's a convenient measure for CPU efficiency, but not always a very useful one. Part of the problem is in the nature of the ratio. If we do our optimization job right, we expect to decrease both the number of instructions completed (i.e., retired), and the time (number of cycles) it takes to execute them; that is, our improvements should decrease both the numerator and the denominator of the CPI ratio. With one change making the ratio bigger and the other making it smaller, the indications of our improvements would be minimized. If a "goodness" indicator doesn't show positive results for using better code, what use is it? Better just to count the number cycles executed to complete a task (a measure of time), and then think about the efficiency of instructions a different way. (To be fair, in server applications that exhibit a fairly constant CPI, it can be combined with pathlength, the number of instructions it takes to produce some quantum of work, say completing a transaction, to produce an indicator, cycles per quantum, that does reflect performance. Often improvements in CPI come at a cost in pathlength, or visa versa.)


Our alternative to considering retirement is to think about deviations from "ideal" execution (those stall cycles). Minimizing the number of instructions and maximizing the efficiency of their execution will be the pillars of this methodology. We must consider each equally in order to squeeze the maximum performance into our code. Instruction reducing ploys such as vectorization make more efficient use of execution resources while stall analysis improves the mix of instructions to balance resource use.


Stalls can be divided into several categories: stalls due to mispredicted branches, execution stalls, front end stalls, and cycles that are wasted executing μops that never get retired. Using various event registers in the processor we can measure the cost of some of these behaviors and begin to understand how time is spent executing our application. The process of dividing time spent into these categories of execution is called cycle accounting and can reveal a lot about the behavior of an application. We'll start digging into the meat of that next time.



All opinions here are my own and are not the position of Intel Corporation or its subsidiaries.


Intel, Intel Core, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

For more complete information about compiler optimizations, see our Optimization Notice.