Download Code Source: Intel Sample Source Code License Agreement
In the process of parallelizing software, once developers have a structured approach for establishing a tuning methodology and workload, as discussed in the first paper in this series, they are ready to locate opportunities for parallelism. Those opportunities correspond to specific bottlenecks in the serial code that lend themselves to parallelization. This paper begins by defining bottlenecks and discussing where they come from, before identifying three top strategies for locating them:
- Optimize serial code before looking for parallelism opportunities, to avoid distraction by serial inefficiencies, helping to ensure that you will focus on the right hotspots.
- Use the tools available to maximum benefit, including the Intel® VTune™ Performance Analyzer and Intel® Parallel Advisor Lite.
- If necessary, tag key pieces of code for redesign with algorithms that lend themselves to parallel processing on n-core systems.
What Are Bottlenecks?
Before we discuss the actual strategies for identifying the parallelism opportunities this paper is named for, we need to establish some background. Specifically, it's important to know what performance bottlenecks are and the characteristics of bottlenecks that can be addressed using software multi-threading. The first order of business is to establish the difference between a hotspot and a bottleneck:
- A hotspot is a place where the processor spends a lot of time, which can be measured using analytical tools during the performance tuning process. Since a lot of time is spent on these sections of the code, it stands to reason that shortening the amount of time spent on them will speed up the overall operation of the code.
At the same time, however, it is also important to note that just because the processor spends a lot of time in a certain operation, that doesn't necessarily imply that the section of code is inefficient. It may simply be that a lot of work is being accomplished there.
- A bottleneck is an area of code that is a hotspot because of inefficiency; in this framework, therefore, some hotspots are bottlenecks, and all bottlenecks are hotspots1. To put it another way, hotspots are potential bottlenecks, so identifying hotspots is a sound first step toward finding bottlenecks. Using this definition of a bottleneck effectively zeroes in on opportunities for optimization.
The focus of this paper is on bottlenecks whose inefficiency on multi-core hardware is because that code doesn't take adequate advantage of hardware parallelism. Bottlenecks may also arise from inefficiencies in serial code, and the relationship between those two types of inefficiency is discussed later in this paper, in the section, "Strategy One: Optimize Serial Code Before Looking for Parallelism Opportunities." While we tend to think of hardware parallelism as being provided by multiple physical cores within a single processor, it is also useful for the sake of completeness to recognize two other common types of hardware parallelism:
- Intel® Hyper-Threading Technology is a hardware feature that providesthe ability of a processor core to support two threads simultaneously.
- Symmetric multi-processing is parallelism provided by the presence of multiple processors within a single system.
Opportunities for addressing bottlenecks using software multi-threading arise when processor-intensive code can be broken down into multiple tasks that can be accomplished simultaneously. It is also important to keep in mind that some slow parts of applications do not lend themselves to parallelization. For example, if the processor is not the rate-determining factor (such as when the processor has to wait for I/O operations or is memory-bound), parallelization may not be a viable approach to optimization.
Where Do Parallelizable Bottlenecks Come From?
Now that we have established what bottlenecks are and have touched briefly on what types of bottlenecks can be addressed using software multi-threading, we are ready to turn our attention to more specific matters about what causes those bottlenecks and how to go about finding them. This discussion will lay the foundations for our top three strategies for identifying the best opportunities for performance improvements related to parallelism.
To start with what may be an obvious observation, bottlenecks that can benefit from parallelization occur within single-threaded or procedural code. It may be slightly less obvious to point out that those serial portions may also occur within a larger application that is already multi-threaded. This distinction is useful in considering parallelization as an iterative process. The developer identifies the parts of an application where threading can benefit overall performance the most and applies threading there first. At that point, the application is part parallel and part serial, where some of the serial portions may also lend themselves to threading in the future.
Given this understanding of the typical multi-threaded application, the task of identifying bottlenecks is finding places where missed opportunities for parallelism exist. Examples of such opportunities, which are discussed in more detail below in the section, "Strategy Three: If Necessary, Tag Key Pieces of Code for Redesign," include the following:
- Processor-intensive tasks that could be performed in parallel but aren't. Missed opportunities in this category include operations such as updating a user interface at the same time as performing some type of back-end processing and are sometimes known as "task-level parallelism."
- Data sets that could be processed in parallel but aren't. Missed opportunities in this category include operations such as converting multiple database records to a different format such as a comma-delimited file and are sometimes known as "data-level parallelism."
As a rule of thumb, looking for these design patterns as early in the design process as possible is best. Identifying these opportunities before initial coding is ideal, since it can help to prevent the need for redundant programming work. That said, it is more usual to identify opportunities for parallelism throughout the software lifecycle, in an iterative process as described above.
The Top Three Strategies for Addressing Parallelizable Bottlenecks
So far, this paper has laid the theoretical foundations for identifying parallelism opportunities in software. We have discussed what bottlenecks are (and what they aren't), as well as what types are good candidates for performance improvement by means of parallelization. We have also covered common issues that cause those bottlenecks, as well as some of the concepts that underlie looking for them.
This section turns to the strategies themselves, with an eye toward helping you create a process for implementing parallelism as simply and effectively as possible in your software projects. These three strategies are not exhaustive, but taken in order, they provide a sound basis for identifying key opportunities in your code to implement software multi-threading.
Strategy One: Optimize Serial Code Before Looking for Parallelism Opportunities
Since finding the best opportunities for parallelization depends on finding parts of the application where the processor spends the most time, it stands to reason that you must first minimize the effect of serial inefficiency. If you inadvertently focus on places where the processor spends more time than it should simply because the serial code is sub-optimal, you will have an inaccurate picture of parallelization opportunities.
Another way of looking at the same concept is that, if serial optimization changes the parts of the application where the processor spends the most time, you will necessarily look at different pieces of code to find parallelizable bottlenecks. A well-optimized serial application provides a strong foundation upon which to base your parallelization efforts. As a corollary to this strategy, being as familiar as possible with the operation of the serial algorithms will be helpful as you identify parallelization opportunities, as well as build the threaded versions.
The white paper "Optimizing Software for Multi-core Processors" explains in more depth how serial optimization fits into the process of parallel programming.
Strategy Two: Use the Tools Available to Maximum Benefit
While a theoretical understanding is a prerequisite for identifying opportunities for parallelization, a variety of programming tools exist to help. Judicious use of these tools helps make those identifications more accurate, as well as improving the efficiency of the process. They are therefore invaluable in improving the quality of the threaded software, as well as containing the investment of time and effort required.
Intel VTune Performance Analyzer is designed to provide profiling of applications that rapidly identifies hotspots. The tool outputs a visual display that shows how the execution path travels through the code, allowing multiple views where a programmer can drill down to the specific piece of code responsible for specific behaviors. Using the VTune environment, development teams can easily see where the processor is spending the most time, which is very valuable in establishing opportunities for optimization of both serial and parallel code. It also integrates with Intel® Thread Profiler, which helps visualize the operation of threads and the interactions between them , so developers can see where imbalances exist within an application's threading model.
Intel Parallel Advisor Lite is a technology preview of an add-on to Intel® Parallel Studio, a multi-part parallelism toolkit that works with Microsoft Visual Studio* to streamline software parallelization. It focuses on modelling parallelism in serial code, to help show how parallelism will benefit the application. This approach allows you to prepare your application for multi-threading to strteamline the overall process by finding artifacts within the serial code that should be resolved before parallelizing. The tool guides you through the process of identifying good candidates for parallelization and addressing data-access issues that might otherwise arise in the threaded version. The preparation of a serial application using Intel Parallel Advisor Lite can help smooth out the entire parallelization process, making it simpler and more successful.
Intel® Software Development Products in general are designed with parallel software in mind. While discussing the full spectrum of those tools is beyond the scope of this paper, the reader should note that those tools address the full software lifecycle. In addition to helping find opportunities for multi-threading, they aid in threading implementation, tuning, and troubleshooting.
"Intel® Parallel Advisor Lite" is an article that introduces the tool and describes how to integrate it successfully into the programming process.
Strategy Three: If Necessary, Tag Key Pieces of Code for Redesign
When analyzing code for parallelization opportunities, it's also necessary to keep an eye out for design aspects of the programming that don't lend themselves to parallelization. The reasoning behind this strategy is fairly simple. Serial code is designed with the intention of completing a set of tasks in succession, one after the other. Parallel code identifies pieces of work that are independent of one another, so they can be done simultaneously.
To create a set of steps that can be completed in parallel, it may be necessary to break the work up differently than in the serial code. There are various techniques for doing so, which are collectively known as decomposition. The main ways of breaking work up for parallelization are task decomposition and data decomposition. A brief discussion of each will introduce the concepts behind arranging data and work for parallel processing. Note that the goal here is not to discuss how to actually redesign algorithms, but simply to recognize some fundamental types of opportunities.
- Task decomposition (also known as "functional decomposition") consists of identifying different tasks that can be carried out at the same time, each of which can be completed on its own thread. For example, a word processor might continually refresh the display to reflect the user's work, perform background save and printing functions, and cross-reference each word of text as it is written against a saved dictionary list. A "master thread" would also be needed to coordinate the work of the other threads.
- Data decomposition consists of simultaneously performing the same task on multiple pieces of data. A classic example is applying a graphical effect to an image, such as lightening a photograph. Each pixel of the picture needs to have the same algorithm applied to it (calculating new color information), and many can be completed simultaneously, as long as a master thread keeps track of the overall work.
Each of these types of decomposition requires a different approach to breaking up the work being performed by the code than the serial version, and in many cases, that will require a fundamentally different approach. Some caution is required here to ensure that the potential benefit is appropriate to the time and cost requirements of recoding. On the other hand, for serial portions of code that take a substantial amount of processor time, fundamental redesign is sometimes the best option.
"Multi-Core Programming, from Intel® Press, contains a wealth of theoretical and practical advice on parallel algorithm design.
While parallelism is a key design precept, opportunities for threading can be found at any point in the application lifecycle. The concepts in this paper will get you started. The next paper in this series will help identify threading optimization issues such as ensuring the right number of threads or resolving dependencies among threads. Good hunting.
Share in the Community
Tell us about your efforts in performance tuning parallel applications, including what has worked and what hasn't, and connect with industry peers as well as Intel experts to help resolve any outstanding issues you have (or help someone else out): Community Forum: Threading on Intel® Parallel Architectures
Resources for Further Study
Now that this article has whetted your appetite for resources that can improve the results of your performance tuning efforts, the following sites will help you take it to the next level:
- Intel® Parallel Programming and Multi-Core Developer Community provides cutting-edge documentation, tools, and community resources like blogs and forums for the software industry.
- Intel® Software Development Products streamline the process of coding and optimizing software, with special attention to effective threading throughout the software lifecycle.
- Technical Books for Multi-Core Software Developers is a list that identifies some of the best resources available.
1 Some people will take issue with this distinction, since a bottleneck could be a minor inefficiency that doesn't necessarily take a lot of processor time (and is therefore not a hotspot). Since we are concerned here only with bottlenecks that represent good opportunities for optimization, however, we use this slightly limited definition.