Download ArticleDownload The Serial On-Ramp to the Multicore Highway: Preparing to Parallelize Code [PDF 143KB]
Without proper preparation, parallelizing code might provide no lift, or worse, it could diminish performance. What are the rules of the road for getting code ready for a parallel world? We speak to the experts.
By Andrew Binstock
When writing software for games and data visualization, it is tempting to optimize code on the fly. That is, as you code, you make small design and implementation tweaks to the code-change a data type here, fix a data structure there, execute function B before function A, and so on. The temptation to make these small modifications in the name of performance should be resisted-its goal is at odds with the primary mission of most developers: writing and delivering reliable functionality. Later, when the code is working, true systemic optimization and parallelization can be profitably undertaken.
This article discusses how these two goals-coding and optimizing on the fly-are opposed and how performance experts approach performance improvement. It explains how they systematically prepare their code for optimization and how the optimization process is done so that their efforts are effective and don’t interfere with working code.
Functionality vs. PerformanceA safe rule in software development is that if an eminent computer scientist recommends not doing something, it’s worth thinking about whether you too should avoid it. When two eminent computer scientists warn against the same practice, then the entire burden of proof falls on the developer who disregards the advice. Perhaps the only adage enunciated in identical words by two prominent computer scientists is: “Premature optimization is the root of all evil.” This maxim, first stated by Tony Hoare, the inventor of Quicksort, became widely known when it was later quoted by Donald Knuth. (For this reason, the quotation is often attributed to Knuth.)
In Hoare’s and Knuth’s heyday, the temptation to cut corners for the sake of speed was nearly overwhelming because all hardware was slow, often very slow. Using shortcuts that could save processor cycles or reduce I/O was sometimes the only way to guarantee that jobs could finish within the allotted time.
Coming forward 40 years, we find that today’s systems have extraordinary horsepower and that the need for performance is no longer related to completing jobs on time. Rather, as we find in the gaming industry, performance is an imperative for competitive reasons. Games that have better performance and that can leverage the hardware well fare better in the market than games that don’t. And given the intensely competitive games market, this advantage can spell the difference between success and failure. That’s a lot of pressure to optimize!
Unfortunately, for developers and managers who deal with this pressure, it can lead to cutting corners in the same way that Knuth and Hoare railed against-writing bad code just to go faster. Every time software is optimized or parallelized before a feature is correctly implemented, costs accumulate quickly. What are these costs?
1) Impaired code readability. Optimization necessarily transforms algorithms and their implementation. Consequently, getting implementations right, especially in a parallel context in which coding is already difficult, makes the task more complex. Moreover, debugging is more difficult as it can be difficult to determine what non-intuitive code is doing.
2) Harder to maintain the code. Programmers who come after the original developer will find optimized code much more difficult to read. Even simple changes can become mysterious code snippets that subsequent developers are loath to touch. Such code is often marooned forever. Even if the code is actually slowing the performance, developers unsure of what it’s doing won’t touch it for fear of breaking a feature. Consequently, optimization or parallelization, when done unwisely, makes code unmaintainable.
3) Generally does not improve performance. When performance engineers set to work on tuning code, the first thing they do is profile the code. They don’t use their long experience in optimization to guess at what code to parallelize or to tune. So, neither should developers. Careful measurement is the only sure way to identify what needs to be optimized and how. Parallelize the wrong code or do it inexpertly and performance can suffer.
So then how should code be parallelized and optimized?
Make it right. Then, make it parallel.The first and most important step is to get the code working correctly. This means getting the algorithm right and doing all the standard testing on it to make sure it works as requested by the user and in accordance with the specs. In most cases, this requirement means writing serial code and assuring its correctness as serial code.
Because in your later optimization and parallelization, you will need to validate that you have not changed the results of the code, you will want to write plenty of unit tests and even functional tests. Running these on the parallelized code will give you confidence that your optimization has not accidentally unhinged your earlier validated implementation.
Once you’re sure the code is working correctly and you’ve got the unit tests ready to go, you’re still not quite ready. The next thing to do is make sure that the code contains no bugs, such as memory leaks. If leaks are hard to spot in regular code, they’re even more difficult to find in threaded code. Debugging threaded code is much simpler than it was even a few years ago, but despite that progress, it remains considerably harder than cleaning up single-threaded code. So, this is the point at which to check for memory leaks, excess memory consumption, and the like.
In sum, you want the code to be as clean as possible before you begin surgery. One last check that you’ll find well worth your time is to use Intel® Parallel Inspector (/en-us/intel-parallel-inspector/) to make sure your code does not contain bugs or poor designs that can cause problems in a parallel implementation. (Figure 1) The tool, which is a plugin to Microsoft Visual Studio* and works on C/C++ code, is a handy way to make sure you’re ready to parallelize your code. Note that while Intel Parallel Inspector is available as a standalone tool, it also comes bundled in Intel® Parallel Studio (/en-us/intel-parallel-studio-home/), along with Intel® Parallel Amplifier (parallel performance analysis) and Intel® Parallel Composer (parallel compiler and debugger). I’ll come back to this product shortly.
Figure 1. Panels in Intel® Parallel Inspector, showing a lock problem (top) with detail (bottom).
What to ParallelizeNow that the code is known good, has no unwanted side effects, and is supported by a set of unit tests, it’s time to figure out what to parallelize. The initial temptation might be to want to parallelize as much as possible: tear the code into discrete strings of operations-functional decomposition-and run those in parallel. Then take tasks that handle lots of data (such as rendering) and process the data in parallel (data decomposition). Between the two, you might think that you can put the CPU cores and the GPU engine to work at maximum load, thereby turning your machine into a code-blasting beast. This admirable goal, however, is often thwarted by reality.
The first reality to be addressed is that processor cores today execute roughly 2.5 to 3 billion instructions per second. At that speed, many tasks that execute serially are already running very, very fast. This is confirmed by the fact that much of the time, processor cores are running at far less than 100% capacity. Hence, there is lots of code that will not derive much, if any, benefit from being parallelized.
Because parallelizing and optimizing are tasks that carry costs-notably, the time required to change the code, as well as the added complexity of debugging and code maintenance-it’s imperative to parallelize code only where a true performance advantage can be gained. To find such code, it is necessary to profile the software.
Performance profilers, such as Intel Parallel Amplifier (Figure 2) and Intel® VTune™ Performance Analyzer, among others, can show hot spots where performance has become constrained or impeded.
Figure 2. Intel® Parallel Amplifier shows hot spots as a list of functions or methods that took the longest to execute. The function is selected in the left panel, its execution graph appears in the middle, and the call stack appears on the right. Other panels can show execution time to individual code statements. (Courtesy Intel Corp.)
Good performance analysis tools highlight hotspots in descending order of cost (Figure 2); that is, the amount of execution time consumed by the responsible function(s). The best policy is to work on the major hot spots and then progress to the lesser ones as needed. In most applications, a handful of hot spots stand out from the rest of the functions, and so it is easy to know where to focus.
On some programs, however, it can be more difficult to know. As a general rule, functions that consume less than 5% of total execution time will rarely deliver benefits that are worth the effort of parallelization. That is, the reduction of 5% of total runtime to a lower number will frequently not be perceptible, especially in games, visualization, and other software with a computer-human interface.
Performance measurement should also show the usage of system resources, so as to provide a view (as in Figure 3) of how well the workload is being distributed across processing resources.
Figure 3. Intel® Parallel Amplifier, here showing the use of processor cores. The length of the colored bars in the middle panel show core utilization while the highlighted function (left panel) is running. The graph in the right panel shows that the function mostly uses one core. (Courtesy Intel Corp.)
When examining hardware execution: the CPU-GPU relationship should be given special attention. In all graphically intensive software, the work done by these processing units needs to be reasonably matched in order to obtain the maximum system performance. Because of this, it can be valuable to profile performance running the code on systems with different graphical adaptors. This will reveal how much the GPU aids or hurts performance and whether the hot spots change location and size as the GPU’s power is changed.
This is important data, because if hot spots are being caused by the GPU’s performance, parallelizing them is not likely to solve the problem. Rather, the graphics adapter needs to be upgraded or the entire interface to the GPU should be examined to determine what better forms of data streaming-if any-will improve the overall performance.
If the fundamental problem is that the GPU is too slow for the demand placed on it, then the question of optimization needs to be re-evaluated more holistically than just fixing execution hot spots. As I mentioned earlier, in an ideal scenario the CPU and GPU are properly matched. A high-end CPU and a low-end GPU will result in poor performance, with the CPU mostly waiting. If the relationship is reversed, the performance will again be slow with the GPU doing the waiting. In both cases, performance is suboptimal. (For more information on this issue, see the book “Video Game Optimization,” by Ben Garney and Eric Preisz. http://is.gd/eQcwt )
Which Hot Spot?Once the costliest hot spots have been identified and the GPU’s role understood, it’s time to begin working on the code.
If there is one hot spot that is disproportionately bigger than the others, it should be the first target. Before touching the code, though, it is wise to examine the hot spot function and determine its suitability to parallelization. Other articles on this site discuss functional and data decomposition and these approaches should be understood before undertaking this analysis. In its simplest form, the question is: what can be made parallel? If large blocks of data are being handled-as is often the case in graphics-oriented software-these blocks can often be broken into smaller chunks, each of which can then be processed by a separate thread, with every thread performing the same function. The size of the block and the number of threads will be discussed in the next article in this series, but a reasonable rule of thumb is to be liberal on both the number of threads and the number of chunks. The optimal number is determined by profiling the software again and seeing what works and what doesn’t. However, danger lurks in optimizing for a specific hardware configuration. The software will surely run on older and newer platforms with more or less cores, so it’s generally wise to use rough guidelines for deciding how to divvy up data, rather than tuning the parallelization to one specific set up.
Sometimes the hot spot is not processing large amounts of data. Instead, it’s doing a lot of different tasks, some of which are slow. In such a case, the principal approach is to see which of those tasks can be run in parallel with the slow tasks. Ideally, these tasks are independent of the slow tasks or have a manageable dependence. For example, disk I/O is a very slow operation that cannot be substantially improved programmatically. However, it is possible to create a task that begins to process the disk data as soon as it’s read, rather than waiting for all the data to be read in before proceeding. This approach does not diminish the time lost by the slow task, but it does improve the amount of work done during that period. When slow tasks dominate a hot spot and they don’t allow much parallelization, it’s often useful to reconsider design. In the case of disk I/O, for example, can caching data reduce the need for I/O? Can the data be arranged so that the first reads provide enough information for some processing tasks to be spun off in parallel, etc.?
At some point, you might conclude that the hot spot you’re examining is just not amenable to parallelization. If so, it’s important to accept this and move on. You should examine all the primary hot spots to determine the ease with which they can be made parallel. It’s often best to start working on the easy ones first. A common experience is that tackling the easy hot spots provides sufficient performance lift that it’s no longer urgent-sometimes no longer economical-to resolve the remaining one or two hot spots. This judgment cannot be made, however, until all the major hot spots have been examined and assessed.
A Final TipIn the next installment, I’ll discuss how to do the work of parallelizing the hot spots. Meanwhile, there is one useful tip I want to share. An indispensable tool in doing this work is a notebook. This can be an online notebook or a physical binder with paper. Parallelizing code generates lots of data that is valuable to capture. For example, profiling the code on multiple machines to assess the GPU’s effect will generate a slew of numbers that are very definitely worth saving. You will want to refer to them as you consider various optimizations.
An easy way to track the numbers is in a spreadsheet, although I’ve found a simple table in a wiki works somewhat better and it enables easy sharing. The notebook should also hold notes and observations about the suitability for parallelization of the hot spots. What did you observe? What might work? What won’t? This information has two benefits: 1) it leaves bread crumbs for future optimizations and 2) it helps you see similarities between hot spots and possibly identify changes that will improve more than one hot spot. The discipline of taking notes during coding is largely a dying practice, but in optimization, it repays the time spent many times over. If the final form of your notebook is indeed electronic, make sure to check it into the source code management system.