Top Down Methodology for Software Performance Analysis

I wanted to start this blog by first discussing software performance optimization in general and then delving into some cool tools and analysis techniques our team has developed in follow up blogs. Software performance optimization is a very complex science. After years of analyzing software performance, I’ve been taught and subsequently relearned many times after various failures that you must start with a structured approach to diagnose software performance issues. To give an example, many times at Intel we are so focused on finding and fixing microarchitectural issues that we zoom straight in to find what we are hoping to find and/or what our bias from previous experiences tells us. However, if you start analysis without looking at the higher order issues first (e.g. your app is calling sleep unnecessarily, using excessive system calls, etc.), you can find and fix microarchitectural issues until the cows come home and not get any significant performance gain. What is needed is a “Top Down” analysis to ensure that your decisions on what to investigate are data driven and avoid wasted time and effort on wild goose chases.

Top Down Methodology
The “Top Down” methodology is an ordered and structured way to analyze application performance. You look at higher order performance issues/indicators first, then based on that data you can follow up for additional investigation and/or dig deeper into the lower tiers of analysis. Below are the 3 main tiers of performance issues and some examples:

1) System Level Issues
Examples: I/O bottlenecks (e.g. disk or network), high system call or context switch rate, high page faults, high privileged time, etc..

2) Application Level Issues
Examples: Use of an inefficient algorithm or API, excessive or poor use of synchronization/locking, poor usage of cache (e.g. poor choice or implementation of data structures, unnecessarily iterating through an entire dataset or array multiple times, etc.)

3) Microarchitectural Issues
Examples: Cache misses, branch mispredicts, various code generation or architecture specific issues, high latency instruction usage, etc.

Often it seems tempting to ignore System Level analysis and proceed straight to the lower level issues. For instance usually the first thing developers often want to know is “what are the hotspots in my code?”. However, looking at the hotspots likely wouldn’t do much good if you were at 100% disk utilization with an average disk queue of 5. Now you may need to know the hotspots to try and determine how you’re pounding the disk so hard, but you wouldn’t even know that that was your primary issue unless you have already done the step of looking at all the system wide counters using a tool such as Perfmon or Typeperf provided with Windows.

Once you have determined that you do not have any gross system level issues, typically you would use a performance tool to analyze where you are spending your time or “clockticks”. We would typically use various Intel tools such as PTU, VTune, or Amplifier to accomplish this. Often times, just seeing modules or particular functions that were taking much more time that you would expect can point to specific performance issues.

The last tier is to look for microarchitectural issues. We have the capability to measure everything ranging from what level of cache we are hitting or missing in, branch mispredicts, and various other known stalls in the CPU with literally hundreds of performance counter events which we have on our latest processors. Of course, that is the bad news as well since it can be a daunting task to collect, analyze, and understand all of this data. In future blogs I hope to go into some tools and techniques that we have available to accomplish this analysis. In the meantime, if you’re itching to get started, please give our recently released Intel® Performance Bottleneck Analyzer (PBA) a try. PBA is able to collect the most important performance events on our CPUs, perform automatic analysis to identify static and dynamic performance issues with approximate costs, and display the analysis at various granularities in a graphical interface.

This is just a general guideline of how to do a data driven and structured performance analysis. Also, note that often there is overlap the performance tiers and analysis techniques. As an example, you may notice that you are disk bound during system level analysis. Then in your hotspots you notice that you are spending a lot of time in memory management code (application level analysis). Then you also notice that you have many cache misses while looking at the CPU event counters (microarchitectural analysis). Often as you go through the levels of analysis, the data are like pieces of a puzzle that you put together to support your theory which you eventually prove. Sound like fun? It is, so happy hunting!

By following a top down methodology and using a data driven approach for performance analysis, you can save yourself a lot of time and anguish. In future blogs, I hope to post more about our recently released toolset called the Intel® Performance Bottleneck Analyzer (PBA) and some of the cool features we have made available in it such as the Top Down Counter Analysis Methodology which uses the same principle of a top down approach, but applied to the hundreds of CPU event counters we have available on our latest microarchitecture codename Sandy Bridge.

For more complete information about compiler optimizations, see our Optimization Notice.