As the first in a collection of articles, this article helps an experienced software programmer like you understand how to optimize an application for Hyper-Threading Technology using the Intel® VTune™ Performance Analyzer. The discussion assumes that you already understand how to use this tool to sample on various processor events and to generate a call graph. If you are new to using this tool, you should review the Getting Started Tutorial that comes with the analyzer before continuing with this article. (See the Intel® VTune™ Performance Analyzer web page for more information.)
So, you are finally ready to take the big step and start optimizing an application to use Hyper-Threading Technology. You already know a bunch of things about Hyper-Threading Technology, such as how one processor can execute two threads simultaneously, how some of the processor's resources are shared and others are replicated, and how new tools like OpenMP and the Intel VTune Performance Analyzer make the transition from single-threaded code to multi-threaded code easier. You also know that in order to maximize the performance benefit of Hyper-Threading Technology, your application must contain threads optimized for performance. But, where and how to start optimizing?
The first thing to do is to get the latest copy of the Intel VTune Performance Analyzer. Version 6.1 includes processor event counters specific to Hyper-Threading Technology and integrates Automatic Hotspot Analysis, formerly called AHA, with the Intel Tuning Assistant feature. This tool is an important part of the optimization process for both single and multi-threaded applications. With it, you can locate time-based hotspots - these areas of the application consume the most time and are therefore the focal point of optimizations - and root out thread-specific issues, like load imbalance, regions of high overhead, idle time, and processor architectural issues.
You already know that in order to use the Intel VTune Performance Analyzer you must create a benchmark that runs your application, or at least the portions that you want to optimize, in a repeatable fashion. You can use the same benchmark for a multi-threaded application as you run for a single-threaded application, with the exception that it is very helpful if the benchmark can be run repeatedly without any user intervention. This capability is a timesaver because the Intel Tuning Assistant feature needs to collect multiple processor event counters that, unfortunately, cannot all be collected during one sampling session. And, since we are talking about optimizations, why not start by optimizing the analysis process? Once the benchmark is ready for action, it is time to look for places to optimize.
Discovering What to Optimize and Where to Add Threads
Before you begin changing your application, you need to decide what to change and how best to change it. The decision process begins by locating the application's hotspots using time-based sampling. Most application s contain a small number of very significant hotspots, the places where you should focus your optimization energy, like the one shown in Figure 1.
Figure 1. Sampling Clearly Shows a Hotspot
But don't jump in and start threading these areas right away because these hotspots are "local" hotspots, and optimizations on local hotspots should focus first on improvements that don't use threads, like reducing the number of cache misses, branch mispredictions, and so on. Before reading any further, stop to think about where the hotspots are located in your application, and ask yourself why those regions are taking so much time and how might you best improve them. Then, go improve them. Having done so, come back and finish reading.
OK, you're back reading and now you're confident that all the easy, or should I say easier, local hotspot optimizations have been made. Now, you want to improve performance even more by using threads.
Threading is best used on the global hotspots, and to find them, you will need to be very familiar with the application or you need to turn to the call graph feature of the Intel VTune Performance Analyzer. Your goal is to find large chunks of the call tree that can be identified as a collective global hotspot, and as such, it is a candidate for optimization using threading. Identifying the big picture hotspot is important for multi-threaded optimizations because the best optimizations focus on dividing work evenly among multiple threads, and the more work you identify, the greater your chances for a robust multi-threaded solution.
Once you have identified the global hotspots using a combination of time-based sampling, call graph, and your own knowledge, you can begin designing and developing a multi-threaded solution that decomposes the largest possible problem into independent and balanced threads of execution.
Analyzing Load Imbalance
By now, you have identified the global hotspots, designed a robust solution, coded and tested it, and you are now looking ways to make more performance optimizations on your multi-threaded code. Luckily for you, the Intel VTune Performance Analyzer contains many features, some that you have even used before, that help detect the four predominate thread performance issues: load imbalance, excessive overhead, idle time, and processor architectural issues like 64k memory aliasing and false sharing.
Load imbalance occurs when one or more of the processors, logical or physical, are sitting idle waiting for others to finish work. Most of the time, load imbalance is simply caused by one thread finishing its allocated work before the other and it usually gets worse as the number of processors increases because it becomes harder to split up the work into progressively smaller chunks of execution that take the same amount of time. The Intel VTune Performance Analyzer shows load imbalance in two ways. First, you can inspect the amount of time that threads take to execute using both sampling and call graph. Figure 2 shows the sampling results where one thread contains significantly more samples than the others do.
Figure 2. Load Imbalance Detected Using Sampling
In Figure 3, you can look at the Self Time column to see how the same results are visualized using call graph.
Figure 3. Load Imbalance Detected Using Call Graph and the Self Time Column
On its own, this display of thread execution times does not indicate load imbalance. Only with the additional design knowledge of how the application is supposed to work - for example, you identify the threads that should take the same time to execute - will you be sure that this is conclusive evidence of a load imbalance.
Figure 4 shows the other type of load imbalance. When viewing the CPU information, by clicking on the CPU button on the toolbar, the Analyzer colors the samples by processor. In an ideally balanced situation, each processor would execute equal amounts of work, and that balance would show up as an equal amount of color distributed across all of the samples on the graph. However, in Figure 4, you can see that the colors are not equal, meaning that one processor is doing more work than the other. Again, on its own this information does not automatically mean you have a load imbalance. However, the designers of this application should know whether they expect these threads to consume the same amount of time or, in other words, to be load balanced. Only after knowing the expectations of how these threads were designed to run, can you make an accurate determination of the load balance. Improving the load balance, for the most part, means rethinking the design and splitting up the work into more evenly distributed workloads.
Figure 4. Load Imbalance Shown by Viewing Samples Per Processor
Defined as anything that takes away from the continual progress made on a workload, overhead is a nightmare for performance, whether you are using threads or not. However, with multiple processors comes the added chance you will inadvertently multiply the overhead. For example, if you create too many threads, task-swapping overhead can become a significant source of inefficiency, not to mention the performance loss associated with all the calls to create the threads in the first place. Luckily, the Intel VTune Performance Analyzer can detect overhead.
Viewing the sampling results by Sampling Processes, as shown in Figure 5, gives you the first clear view of overhead. In this view, the overhead probably lies in all or most of the samples that were not collected in the application's process. Usually, these samples belong to the system processes, background tasks, and processes running unexpectedly.
Figure 5. Overhead Shown By Looking at Sampling Per Process
To see the samples that are part of your application's process on the Sampling Process view, select just your process or processes and then select View by Module, as shown back in Figure 1. In this view, samples not collected in the application's module are direct calls into other modules, such as the operating system, and these calls could be the cause of overhead.
You've made a good start by using these two methods, looking for overhead outside your process and overhead caused by your process, but these two views alone might not give you the complete picture. Overhead could still exist in your module, and to detect it, you must drill down to the function level, or even the source code level, to see what code is executing and what parts take a reasonable or unreasonable amount of time.
In addition to time-based sampling, call graph can also be very helpful to track down overhead. Look around the call graph trying to identify a function that is taking much longer than expected. For example, Figure 6 shows that the function SendMessage is taking a surprisingly long time and should be investigated.
Figure 6. Call Graph Shows SendMessage Is Taking an Unexpectedly Long Amount of Time
The Special Case of Idle Time
Load imbalance and synchronization issues can create idle time, which, of course, is bad. Luckily, you can detect idle time with a variety of simple techniques. The first method involves looking for samples collected in module processr.sys. Looking back at Figure 1, you can see that samples were collected in the module labeled processr.sys, meaning that the processor was idle. With operating system symbols installed, drilling-down into module processr.sys, jumps to a function named something similar to AcpiIdle, and disassembly shows a nearby halt (hlt) instruction. Another method is to compare the number of time-based samples collected with the expected number of samples. When the number of samples collected is less than expected, usually 1000 per second, the deficit indicates that the processor went to sleep and the clock tick counter stopped incrementing. The extra clock ticks might show up in Ring 0 or not at all, depending upon the operating system, upon any power saving features enabled in the BIOS, and upon what the other logical processor is executing. The Intel VTune Performance Analyzer displays this information in the Legend, as shown in Figure 4.
Another technique uses call graph to detect idle time. In call graph, idle time is called Wait Time, which is the amount of time that a function (self wait time) or that a portion of the call tree (total wait time) went idle, as shown in Figure 7.
Figure 7. Call Graph Detecting Idle Time
Finally, the counter monitor feature also detects processor idle time. The Processor performance object contains a % Processor Time counter that tracks the time busy or not idle. To see the idle time directly, in the Process performance object select the % Processor Time counter for the Idle instance. This performance object is shown in Figure 8.
Figure 8. Counter Monitor Detecting Idle Time
Reducing the amount of idle time requires some amount of recoding. Sometimes, the change is as easy as calling a different operating system function, like PostMessage instead of SendMessage or EnterCriticalSection instead of WaitForSingleObject. Other times, you might have to redesign how the work is split among the threads to achieve best results.
The Intel Tuning Assistant feature is the quickest and easiest way to detect processor issues. The brute force method is to configure sampling to collect all the events from the Performance Tuning for Hyper-Threading Technology category, as shown in Figure 9.
Figure 9. Most Useful Processor Events for Hyper-Threading Technology
After running a sampling session, select your module and pick the menu item Get Tuning Advice(F8). Advice that is based on the issues related to the processor events that were collected will appear, as shown in Figure 10. Sometimes, a second sampling session, with additional event counters from the other Performance Tuning event groups, may be required.
Figure 10. Advice Received From The Intel Tuning Assistant
This brute force approach has a drawback. Since collecting multiple counters means executing the application multiple times, analysis time may be wasted collecting counters that do not apply to your application. You can save this time by using only the counters that are relevant to your application. For example, if you know your application does not contain a bunch of random branches, you don't have to sample the branch counters. Every counter that you add could require an additional sampling and calibration run, depending upon the combination of events that are selected.
Using threads for optimizations can result in great performance gains. But, even the best solutions will run into some amount of threaded specific headaches like correctness, load balancing, overhead, and processor issues. The same tool you already use for single-threaded applications, the Intel VTune Performance Analyzer, addresses these multi-threaded issues. The sampling and call graph features can detect thread specific issues such as idle time, system overhead, and load imbalance. Furthermore, the Intel Tuning Assistant detects the processor architectural issues, even the ones specific to Hyper-Threading Technology, without requiring you to become an expert in the processor architecture or even to read the processor manuals.
About the Author
Richard Gerber joined Intel in 1991, and through years of learning-by-doing, he has become an expert in performance programming and optimizations. He has worked on numerous multimedia projects, 3D libraries, and computer games. As a software engineer, he works on the Intel® VTune™ Performance Analyzer and trains developers on optimization techniques.