Download Article
Download The Serial On-Ramp to the Multicore Highway: Preparing to Parallelize Code [PDF 143KB]Without proper preparation, parallelizing code might provide no lift, or worse, it could diminish performance. What are the rules of the road for getting code ready for a parallel world? We speak to the experts.
By Andrew Binstock
When writing software for games and data visualization, it is tempting to optimize code on the fly. That is, as you code, you make small design and implementation tweaks to the code-change a data type here, fix a data structure there, execute function B before function A, and so on. The temptation to make these small modifications in the name of performance should be resisted-its goal is at odds with the primary mission of most developers: writing and delivering reliable functionality. Later, when the code is working, true systemic optimization and parallelization can be profitably undertaken.
This article discusses how these two goals-coding and optimizing on the fly-are opposed and how performance experts approach performance improvement. It explains how they systematically prepare their code for optimization and how the optimization process is done so that their efforts are effective and don’t interfere with working code.
Functionality vs. Performance
A safe rule in software development is that if an eminent computer scientist recommends not doing something, it’s worth thinking about whether you too should avoid it. When two eminent computer scientists warn against the same practice, then the entire burden of proof falls on the developer who disregards the advice. Perhaps the only adage enunciated in identical words by two prominent computer scientists is: “Premature optimization is the root of all evil.” This maxim, first stated by Tony Hoare, the inventor of Quicksort, became widely known when it was later quoted by Donald Knuth. (For this reason, the quotation is often attributed to Knuth.)In Hoare’s and Knuth’s heyday, the temptation to cut corners for the sake of speed was nearly overwhelming because all hardware was slow, often very slow. Using shortcuts that could save processor cycles or reduce I/O was sometimes the only way to guarantee that jobs could finish within the allotted time.
Coming forward 40 years, we find that today’s systems have extraordinary horsepower and that the need for performance is no longer related to completing jobs on time. Rather, as we find in the gaming industry, performance is an imperative for competitive reasons. Games that have better performance and that can leverage the hardware well fare better in the market than games that don’t. And given the intensely competitive games market, this advantage can spell the difference between success and failure. That’s a lot of pressure to optimize!
Unfortunately, for developers and managers who deal with this pressure, it can lead to cutting corners in the same way that Knuth and Hoare railed against-writing bad code just to go faster. Every time software is optimized or parallelized before a feature is correctly implemented, costs accumulate quickly. What are these costs?
1) Impaired code readability. Optimization necessarily transforms algorithms and their implementation. Consequently, getting implementations right, especially in a parallel context in which coding is already difficult, makes the task more complex. Moreover, debugging is more difficult as it can be difficult to determine what non-intuitive code is doing.
2) Harder to maintain the code. Programmers who come after the original developer will find optimized code much more difficult to read. Even simple changes can become mysterious code snippets that subsequent developers are loath to touch. Such code is often marooned forever. Even if the code is actually slowing the performance, developers unsure of what it’s doing won’t touch it for fear of breaking a feature. Consequently, optimization or parallelization, when done unwisely, makes code unmaintainable.
3) Generally does not improve performance. When performance engineers set to work on tuning code, the first thing they do is profile the code. They don’t use their long experience in optimization to guess at what code to parallelize or to tune. So, neither should developers. Careful measurement is the only sure way to identify what needs to be optimized and how. Parallelize the wrong code or do it inexpertly and performance can suffer.
So then how should code be parallelized and optimized?
Make it right. Then, make it parallel.
The first and most important step is to get the code working correctly. This means getting the algorithm right and doing all the standard testing on it to make sure it works as requested by the user and in accordance with the specs. In most cases, this requirement means writing serial code and assuring its correctness as serial code.Because in your later optimization and parallelization, you will need to validate that you have not changed the results of the code, you will want to write plenty of unit tests and even functional tests. Running these on the parallelized code will give you confidence that your optimization has not accidentally unhinged your earlier validated implementation.
Once you’re sure the code is working correctly and you’ve got the unit tests ready to go, you’re still not quite ready. The next thing to do is make sure that the code contains no bugs, such as memory leaks. If leaks are hard to spot in regular code, they’re even more difficult to find in threaded code. Debugging threaded code is much simpler than it was even a few years ago, but despite that progress, it remains considerably harder than cleaning up single-threaded code. So, this is the point at which to check for memory leaks, excess memory consumption, and the like.
In sum, you want the code to be as clean as possible before you begin surgery. One last check that you’ll find well worth your time is to use Intel® Parallel Inspector (/en-us/intel-parallel-inspector/) to make sure your code does not contain bugs or poor designs that can cause problems in a parallel implementation. (Figure 1) The tool, which is a plugin to Microsoft Visual Studio* and works on C/C++ code, is a handy way to make sure you’re ready to parallelize your code. Note that while Intel Parallel Inspector is available as a standalone tool, it also comes bundled in Intel® Parallel Studio (/en-us/intel-parallel-studio-home/), along with Intel® Parallel Amplifier (parallel performance analysis) and Intel® Parallel Composer (parallel compiler and debugger). I’ll come back to this product shortly.

What to Parallelize
Now that the code is known good, has no unwanted side effects, and is supported by a set of unit tests, it’s time to figure out what to parallelize. The initial temptation might be to want to parallelize as much as possible: tear the code into discrete strings of operations-functional decomposition-and run those in parallel. Then take tasks that handle lots of data (such as rendering) and process the data in parallel (data decomposition). Between the two, you might think that you can put the CPU cores and the GPU engine to work at maximum load, thereby turning your machine into a code-blasting beast. This admirable goal, however, is often thwarted by reality.The first reality to be addressed is that processor cores today execute roughly 2.5 to 3 billion instructions per second. At that speed, many tasks that execute serially are already running very, very fast. This is confirmed by the fact that much of the time, processor cores are running at far less than 100% capacity. Hence, there is lots of code that will not derive much, if any, benefit from being parallelized.
Because parallelizing and optimizing are tasks that carry costs-notably, the time required to change the code, as well as the added complexity of debugging and code maintenance-it’s imperative to parallelize code only where a true performance advantage can be gained. To find such code, it is necessary to profile the software.
Performance profilers, such as Intel Parallel Amplifier (Figure 2) and Intel® VTune™ Performance Analyzer, among others, can show hot spots where performance has become constrained or impeded.

Good performance analysis tools highlight hotspots in descending order of cost (Figure 2); that is, the amount of execution time consumed by the responsible function(s). The best policy is to work on the major hot spots and then progress to the lesser ones as needed. In most applications, a handful of hot spots stand out from the rest of the functions, and so it is easy to know where to focus.
On some programs, however, it can be more difficult to know. As a general rule, functions that consume less than 5% of total execution time will rarely deliver benefits that are worth the effort of parallelization. That is, the reduction of 5% of total runtime to a lower number will frequently not be perceptible, especially in games, visualization, and other software with a computer-human interface.
Performance measurement should also show the usage of system resources, so as to provide a view (as in Figure 3) of how well the workload is being distributed across processing resources.

When examining hardware execution: the CPU-GPU relationship should be given special attention. In all graphically intensive software, the work done by these processing units needs to be reasonably matched in order to obtain the maximum system performance. Because of this, it can be valuable to profile performance running the code on systems with different graphical adaptors. This will reveal how much the GPU aids or hurts performance and whether the hot spots change location and size as the GPU’s power is changed.
This is important data, because if hot spots are being caused by the GPU’s performance, parallelizing them is not likely to solve the problem. Rather, the graphics adapter needs to be upgraded or the entire interface to the GPU should be examined to determine what better forms of data streaming-if any-will improve the overall performance.
If the fundamental problem is that the GPU is too slow for the demand placed on it, then the question of optimization needs to be re-evaluated more holistically than just fixing execution hot spots. As I mentioned earlier, in an ideal scenario the CPU and GPU are properly matched. A high-end CPU and a low-end GPU will result in poor performance, with the CPU mostly waiting. If the relationship is reversed, the performance will again be slow with the GPU doing the waiting. In both cases, performance is suboptimal. (For more information on this issue, see the book “Video Game Optimization,” by Ben Garney and Eric Preisz. http://is.gd/eQcwt )
Which Hot Spot?
Once the costliest hot spots have been identified and the GPU’s role understood, it’s time to begin working on the code.If there is one hot spot that is disproportionately bigger than the others, it should be the first target. Before touching the code, though, it is wise to examine the hot spot function and determine its suitability to parallelization. Other articles on this site discuss functional and data decomposition and these approaches should be understood before undertaking this analysis. In its simplest form, the question is: what can be made parallel? If large blocks of data are being handled-as is often the case in graphics-oriented software-these blocks can often be broken into smaller chunks, each of which can then be processed by a separate thread, with every thread performing the same function. The size of the block and the number of threads will be discussed in the next article in this series, but a reasonable rule of thumb is to be liberal on both the number of threads and the number of chunks. The optimal number is determined by profiling the software again and seeing what works and what doesn’t. However, danger lurks in optimizing for a specific hardware configuration. The software will surely run on older and newer platforms with more or less cores, so it’s generally wise to use rough guidelines for deciding how to divvy up data, rather than tuning the parallelization to one specific set up.
Sometimes the hot spot is not processing large amounts of data. Instead, it’s doing a lot of different tasks, some of which are slow. In such a case, the principal approach is to see which of those tasks can be run in parallel with the slow tasks. Ideally, these tasks are independent of the slow tasks or have a manageable dependence. For example, disk I/O is a very slow operation that cannot be substantially improved programmatically. However, it is possible to create a task that begins to process the disk data as soon as it’s read, rather than waiting for all the data to be read in before proceeding. This approach does not diminish the time lost by the slow task, but it does improve the amount of work done during that period. When slow tasks dominate a hot spot and they don’t allow much parallelization, it’s often useful to reconsider design. In the case of disk I/O, for example, can caching data reduce the need for I/O? Can the data be arranged so that the first reads provide enough information for some processing tasks to be spun off in parallel, etc.?
At some point, you might conclude that the hot spot you’re examining is just not amenable to parallelization. If so, it’s important to accept this and move on. You should examine all the primary hot spots to determine the ease with which they can be made parallel. It’s often best to start working on the easy ones first. A common experience is that tackling the easy hot spots provides sufficient performance lift that it’s no longer urgent-sometimes no longer economical-to resolve the remaining one or two hot spots. This judgment cannot be made, however, until all the major hot spots have been examined and assessed.
A Final Tip
In the next installment, I’ll discuss how to do the work of parallelizing the hot spots. Meanwhile, there is one useful tip I want to share. An indispensable tool in doing this work is a notebook. This can be an online notebook or a physical binder with paper. Parallelizing code generates lots of data that is valuable to capture. For example, profiling the code on multiple machines to assess the GPU’s effect will generate a slew of numbers that are very definitely worth saving. You will want to refer to them as you consider various optimizations.An easy way to track the numbers is in a spreadsheet, although I’ve found a simple table in a wiki works somewhat better and it enables easy sharing. The notebook should also hold notes and observations about the suitability for parallelization of the hot spots. What did you observe? What might work? What won’t? This information has two benefits: 1) it leaves bread crumbs for future optimizations and 2) it helps you see similarities between hot spots and possibly identify changes that will improve more than one hot spot. The discipline of taking notes during coding is largely a dying practice, but in optimization, it repays the time spent many times over. If the final form of your notebook is indeed electronic, make sure to check it into the source code management system.

Comments
Thanks for the article, I enjoyed reading it.
The maxim "Premature optimization is the root of all evil" is a tautology. Of course premature optimization is bad, otherwise it wouldn't be premature. That understood, the main task of the programmer interested in writing fast, correct code in a production environment comes down to deciding what is, and what is not, premature.
I absolutely agree with the practice of test driven development, with functional tests in place that reinforce our confidence in the correctness of code we have a safety net that ensures that optimizations are not made at the expense of reliability.
However, the single largest factor in the performance of code is the choice of algorithm. No amount of parallelization will make, say, a bubblesort perform, on average, as well as a quicksort. It is also true that some algorithms are more amenable to paralleization than others. Those two facts in mind, that is why I take issue with the oft quoted maxim "Optimize the serial algorithm first". The serial algorithm may not be suitable for parallelization in the first place, so it is worth doing some thinking in advance. Is this operation likely to be a bottleneck in the program? Will it benefit from parallelization? If so, it might be wise to think about parallelism at that point, and choose an algorithm appropriately. That, to my mind, is not premature.
@Planetmarshall: Agreed with your point key points. And certainly designing software and choosing algos with parallelization in mind is not premature at all. Thanks for your comments. ---Andrew
Nice article. But I have a different school of thought on few of the points mentioned above.
1) Use of right data types and data structures at the time of initial phase of coding should be imbibed and encouraged. This should be based upon previous experiences under similar situations and this should not be viewed as Premature optimization.
2) Impaired code readability and Harder to maintain the code : These two factors should never impede any effort towards achieving better performance. The reason for making this assertion is that since enough test cases and code quality checking measures will be in place for a project (making an assumption to support my argument :) ), the same can be applied to see that the code still continues to be in better shape even after making the intended changes. But in reality, the above mentioned two factors prevent achieving significant performance improvements through architecture re-designing. So, the focus should only be on achieving maximum possible performance, rather than on the well-being of the future developer(s) who'll be maintaining the code.
@Mohammed.
"... should be imbibed and encouraged." I don't know what you mean by imbibe here. It is surely not the word you mean. As to encouragement, certainly, I have no argument with using the right data types and data structures. I'm not sure I understand your point.
"impaired code readability...should never impede any effort towards achieving better performance." I mean no disrespect when I say this, but this view is completely wrong. By which I mean it's wrong conceptually, wrong in the implications it creates, and wrong in practice. You should remove it from your programming worldview. It will never be the right path to choose. Moreover, there are many, many organizations where if you articulated that position during a job interview, the interview would immediately draw to a close.
@Andrew
Thanks for your feedback. Let me explain in detail, my point of view.
By imbibe, I mean inculcating the good way of coding by sheer practice.
I found the next comment really very very interesting.
Let me elaborate my view. Given the ownership of a Module (be it a part of an application or the entire application itself), the only aim that the Developer responsible for the Module, must have is that, the Module allocated to him is at the Top compared to similar competitor modules, in terms of performance, functionality, stability and scalability. Pardon me if you feel I sound too demanding. Now, to achieve this simple aim, should the developer be limited by factors like Impaired code readability and Harder to maintain the code ? I do agree that if code changes for performance tuning changes are not done properly, then it could lead to issues. But, our focus should be on overcoming the shortcomings, rather than make an excuse of the same.
Now, coming to the point where it is mentioned that, sticking to my point of view, will result in loss of Job opportunities with 'many, many organizations'. With all due respect, I would ask, when one knows the goal to be reached, does fear of losing few job opportunities really matter ?
Within the very limited goals of paralysing itsy bitsy pieces of an otherwise serial program then I agree with the author. But surely this is a very limited goal. A program that truly exploits multicore has to be designed as such; it has a different structure. The analysis has to be different; the designer has to think parallel from the very start. As to quoting Hoare and Knoth they were talking about something quite different. It is also true that the blind adherence to a form of words should remain within the domain of Religion.
This article has some interesting views. Here are a couple of thoughts that the article brought to mind.
First making code better at any point in development is good. DOCUMENTING code at any point is critical, so if changes are made without documentation, one can expect programmers who follow later will have issues. My experience tells me that even initial code needs to be documented if you want people to understand it later.
Parallel processing is useful only when a number of separate variables can be solved at the same time, such as diverse data searches on different servers. Otherwise in my experience the current processors are so fast it is not relevant or wise to split the tasks.
One rule of the thumb for us is the more code the more chance for error. So parallel processing should be a last resort to solving a problem.
The point in the article to write the code and debug without the parallel processing is excellent. The next step is to see if users even notice a speed bump. If the users notice the bump, then work on changing the code to enhance performance otherwise you just saved yourself a lot of work and trouble.