| Last Modified On : | September 4, 2008 11:42 AM PDT |
Rate |
|
Selecting APIs and Tools to Simplify Your Approach to Parallelism [PDF 101KB]
Software-development organizations must address the rise of hardware parallelism, making best use of the APIs and other tools at their disposal, and they must do so without bringing day-to-day operations to a halt. The long-term view is that a very high degree of parallelism is needed for applications, and yet, re-architecting existing software from the ground up is typically prohibitive. This paper advocates an incremental, iterative approach to incorporating parallelism into applications, while keeping an eye on the need for continuous improvement as time goes on.
The trend toward multi-core processors will continue into the foreseeable future, and software must embrace parallelism to take full advantage of it. Multi-threading is a common means of providing that parallelism, and an ecosystem of tools and best practices have arisen around that effort. At the same time, many developers characterize threading as a necessary evil at best, and a potential path to project failure at worst.
In reality, application threading is a very complex undertaking that brings with it the promise of great rewards, in the form of increased performance, responsiveness, and productivity. The added performance enabled by multi-threading an application also makes it possible to add more features to an application, or even to fundamentally change the way users interact with data.
Beginning from the business goal of making software scale to current and future generations of hardware, software makers must choose among a number of approaches. Software being built from the ground up, of course, should be built with an open-ended amount of parallelism in mind, since processors with very large numbers of cores are assuredly coming. For established applications, the same long-term goal is clear, but how to handle that goal in the near term is more complicated.
Re-architecting applications from scratch to support parallelism is advocated in some circles, although the expense and time requirements associated with that approach are often prohibitive. Joel Spolsky, author of "Joel on Software," points out that programmers will always want to get rid of whatever code already exists and start from a clean, green field, but that doing so can be a business catastrophe. In fact, he refers to rewriting code from scratch as "the single worst strategic mistake that any software company can make."
To bring together a long-term, robust vision with the near-term requirement to keep products shipping, incremental introduction of parallelism is the clear approach. Therefore, it's necessary to find relatively simple ways to introduce parallelism into applications. That means automating the process where possible, depending on threaded libraries where you can, and where you need to thread manually, choosing the simplest way to achieve your goals.
In addition to the incremental approach to parallelization advocated here, development organizations must also decide how much control they need to have over the threading in a specific algorithm, and how much that control is worth to them. Note that some applications may require hybrid threading styles.
Threading options are not mutually exclusive, but in general, Intel advises against explicit threading. The tradeoff is that, because explicit threading is very low-level, it is very powerful and flexible, but also very complex. James Reinders of Intel's Software Development Products team is explicit on the subject:
"Do not use raw native threads (Pthreads, Windows threads, Boost threads, and the like). Threads and MPI are the assembly languages for parallelism. They offer maximum flexibility, but require too much time to write, debug, and maintain."[1]
In the interest of full disclosure, it's important to acknowledge that explicit threading enables certain complex implementations that developers may want to put in place, such as for scheduling specific threads to exert fine-grained control over them. It also provides for open-ended control over issues such as error handling, fine-grained mechanisms to control thread-processor mapping, and synchronization between a subset of threads.
Therefore, some developers will choose to take the explicit threading approach in some cases, but this paper's incremental approach focuses on the following techniques instead:
Threading can often scale algorithms very well, but it can also introduce unpredictable results or performance penalties. Multiple threads updating the same global variable can cause data loss, and improper synchronization among threads can make the threaded version of an application perform far worse than the serial version it replaces.
This section introduces a few common threading issues as a means of demonstrating the complexity of manually introducing parallelism using explicit threading. Of course, these and similar issues can arise whether or not an application uses explicit threading, but in general, as the complexity of your threading implementation increases, so does the likelihood of problems.
When more than one thread accesses memory without correct synchronization, data races can occur, which cause unpredictable computational results. Figure 1 graphically depicts a race condition, where depending on which thread runs first, the resulting value of X will be different. If thread 1 runs before thread 2, the program will exit the parallel region with X equal to 3. If thread 2 runs before thread 1, the program will exit the parallel region with X equal to 43.
Figure 1. A data-race condition
If multiple threads reach a condition where each is waiting for the other, a deadlock occurs, which will cause the application to hang. Figure 2 shows such a condition, where the combination of locks on A and B by threads 1 and 2 prevent either thread from proceeding, effectively deadlocking the application. In far more complex real-world situations, bugs such as this may only occur under certain specific conditions, making them very difficult to predict or recreate.
Figure 2. A deadlock condition
In addition to error conditions such as these, threaded code is subject to performance issues that can easily make a threaded program run more slowly than it should, and even more slowly than the serial version. For instance, excessive locking is a common source of performance deficits; the serial nature of locked regions limits scalability, and too-frequent locking drives up cache evictions and slows down threads Correcting issues such as these can be difficult and time-consuming, as well as being difficult to know when the optimal scenario has been reached for a variety of runtime conditions.
Instead of the grand rewrite using explicit threading, consider a more incremental approach, where instead of trying to re-write a parallel version of your application, you introduce threading gradually, making the simplest changes first. Focus on introducing parallelism that provides performance your application needs—attend the user, not the hardware. Optimize the serial version of your code first, and if you still need higher performance, consider the following approach:
For additional guidance in setting goals and establishing processes, see the following Intel® Software Network resources:
OpenMP has the advantage over explicit threading that it is relatively simple, high level, and vendor/OS-neutral. It consists of a set of pragmas, APIs, and environment variables, and it is supported by compilers on a wide range of platforms, including Windows, Linux, and Mac OS.
OpenMP is designed to support the incremental addition of parallelism to an application, so you can work on one portion of the program at one time, without making any dramatic changes to the code as a whole. In fact, the serial code statements themselves do not, in general, need to be modified, which reduces the likelihood of introducing new bugs into the code. OpenMP creates parallel regions within the code by the addition of a parallel pragma in C/C++ or directive in Fortran, as shown in Figure 3. After the directive, the main (master) thread, shown in magenta, spawns the appropriate number of slave threads (based on the work and available hardware resources) to complete the parallel work. At the end of the parallel region, the execution path pauses while all the threads complete their work, whereupon the slave threads are suspended, and the program returns to serial execution.
Figure 3. OpenMP PARALLEL directive creating a parallel region of code
Part of OpenMP's flexibility is that, if the compiler does not recognize the OpenMP pragma or directive, it simply ignores it, and the code compiles serially. Likewise, OpenMP can be disabled, and the codebase will compile serially. This characteristic (which is governed by ANSI standards) makes OpenMP extremely non-invasive. Another strength of OpenMP is its ability to determine the correct number of threads and automate the distribution of work. This is particularly relevant in light of the increasing number of cores in future processors.
The obverse side of this simplicity is that OpenMP does not in and of itself make code thread-safe. It is still necessary to detect and correct threading errors, which can be facilitated by the use of Intel® Thread Checker. You may also find certain situations where OpenMP does not provide the fine-grained control needed to optimally schedule individual threads. For example, in cases where various aspects of a workload would benefit from having specific numbers of threads assigned to them, operating at specific priorities, native APIs for explicit threading may provide better performance, albeit at the cost of added complexity.
Ultimately, the trade-off between OpenMP and explicit threading is a familiar one. It provides simplicity, at some cost in terms of control and visibility into the heart of what the code is doing behind the scenes, although OpenMP does provide some APIs that can give a higher degree of detail in some cases. For the most part, however, OpenMp is a suitable means of obtaining a fairly high degree of parallel speedup, balanced with favorable characteristics in terms of the time, effort, and cost involved.
Threading libraries have begun to appear on the market, which promise to provide pre-threaded functions without delving into the low-level details of how they work. Intel® Threading Building Blocks is a C++ template-based runtime library built to simplify the porting of common serial data structures and algorithms to efficient parallel code. The resulting application can have excellent performance, scalability and reliability on Windows*, Linux* and Mac OS*, automatically detecting the number of available cores and scaling the application accordingly.
Compared to native threads, Intel Threading Building Blocks is an easier to use, portable solution for C++ developers to introduce application threading for increased performance, scalability and reliability. There is no need to re-write, re-test, and re-tune common parallel data structures and algorithms. Your application simply makes calls to the Intel Threading Building Blocks library, which has parallel programming contained within it. Because Intel Threading Building Blocks is not intended to cover all threading problems, it is designed to coexist with other threading techniques.
The library provides highly tuned, parallel versions of common algorithms, allowing application designers to focus on a higher level of abstraction via tasks and scalable patterns. Intel Threading Building Blocks enables you to specify tasks instead of threads. As discussed elsewhere in this paper, programming directly in terms of threads is complex and tends to lead to inefficient programs, because threads are a low-level and heavy construct that is close to the hardware. The Intel Threading Building Blocks run-time library automatically schedules these tasks onto threads in a way that makes efficient use of processor resources.
Intel Threading Building Blocks emphasizes scalable, data parallel programming. The approach of breaking a program up into separate functional blocks and assigning a separate thread to each block does not scale as well, because the number of functional blocks is typically fixed. In contrast, Intel Building Blocks emphasizes data parallel programming, which enables threads to work on different parts of a collection. Data parallel programming scales to larger numbers of processors by dividing the collection into smaller pieces, or by handling bigger collections so that there are more pieces to go around.
In conjunction with Intel Threading Building Blocks, applications can make use of the algorithms in other Intel libraries. Intel® Integrated Performance Primitives are built to accelerate multimedia, communications, and other applications. Intel® Math Kernel Library provides math routines for scientific, engineering, and financial applications that require maximum performance. Both are fully thread-safe and contain a large number of threaded functions to aid in the expression of parallelism in applications.
When implementing parallelism in software, one needs to find balance between accommodating the future of hardware and the need to bring software roadmaps forward smoothly. Multi-threading software is hard to do well, which is a good argument for implementing parallelism with a low degree of invasiveness. Start with the simplest changes first, and move to the more difficult ones as performance requirements dictate. As the industry continues to mature, APIs and tools for implementing parallelism will continue to become more powerful, and companies should position themselves now to take advantage of those advances as they come.
As you build new parts of your applications or entirely new applications, think parallel. Rather than analyzing a problem in terms of a sequential set of steps toward a solution, consider how the data can be broken into separate pieces that can be performed in parallel. In a sense, the mental transition is very similar to the difference between figuring out how to do something yourself and figuring out how to delegate it to a team of people. The modularization involved in this approach will be readily recognized by programmers, and the actual discrete means of implementing that parallelism will follow.
The following materials provide a point of departure for further research on this topic:
Matt Gillespie is an independent technical author and editor working out of the Chicago area and specializing in emerging hardware and software technologies. Before going into business for himself, Matt developed training for software developers at Intel Corporation and worked in Internet Technical Services at California Federal Bank. He spent his early years as a writer and editor in the fields of financial publishing and neuroscience.
Copyright© 2008 Intel Corporation. All rights reserved.
[1] Dr. Dobbs' Portal, September 5, 2007. "Rules for Parallel Programming for Multicore." http://www.ddj.com/hpc-high-performance-computing/201804248.

English | 中文 | Русский | Français
Matthew Gillespie
|