Get the basics about parallel algorithms, parallel programming APIs, and the tools required to start writing your own parallel programs.
Today’s leading hardware designs depend on parallelism; that is, multiple activities running at the same time. Performance inside a CPU depends on instruction level parallelism. All the major CPU vendors are putting multiple cores on each die. Typical servers have multiple sockets per box. Clusters of servers and grids of systems spanning the globe are becoming commonplace. Parallelism has gone main stream and if software developers want to stay relevant, they had better learn to deal with it.
A good place to start is by taking a closer look at parallel hardware. Supercomputing designers have been building high performance computers by connecting smaller computers into larger systems since the early 80’s. Most of these designs fall into one of two camps: distributed memory multiple instruction multiple data (MIMD) or shared memory MIMD architectures. In MIMD, each processing element has its own stream of instructions and its own stream of data. We call the system a “shared memory” system when there is a single address space that all the processing elements share. If the memories are distinct and the processing elements can only interact through the networks connecting them, we call the system a “distributed memory” system.
Both these architectures are important and, depending on the problem being solved, a software developer may need to choose one over the other. Over the next few years, however, the numbers of shared memory systems will explode. As Intel puts two cores on a single piece of silicon and as multi-socket systems continue to grow, shared memory systems are going to become the norm. Even our laptops and eventually our handheld computers will use multi-core CPU’s and will present the programmer with a parallel hardware platform.
So how is a programmer supposed to deal with these parallel systems? There are two options depending on the workloads and the needs of intended users. For the first option, the goal is to complete a collection of independent jobs in less time. In other words, the throughput or aggregate capacity of the system is emphasized. In this case, each individual task can run on a single processor. All that the software developer needs to do is make sure a job scheduler (usually buried inside the OS) effectively manages the workload. The second option occurs when single jobs must complete in less time. Today’s business climate, for example, often requires answers from a computation fast enough to support real time decision making. In this case, a single job needs to run in less time. We call exploiting parallelism inside a single job to make it run in less time for a given problem size, “parallel programming.”
Parallel programming was for years the domain of the hard core high performance computing (HPC) software developer. Today, however, all software developers need to understand parallel programming. And while sometimes you get lucky and it’s easy to create a parallel program, in most cases, it can be quite challenging.
Fortunately, HPC programmers have been working with parallelism for decades. They’ve learned a great deal about parallel algorithms, parallel programming APIs, and the tools required to make the programmer’s life easier and more productive. We will outline these issues here and hopefully point you in the right direction to learn what you need to start writing your own parallel programs.
We start with some computer science background on parallel programming. Then we will look at the specific APIs and programming environments available for programming shared memory MIMD computers. We then close with concluding remarks and a list of resources you can access to learn more.
Parallel Processing: The Computer Science Perspective
The foundation of parallel programming is concurrency: the condition of a problem or a system in which two or more tasks are active simultaneously. Operating systems have been exploiting concurrency for years to make systems more responsive. In the case of a single processor, the tasks take turns. Their instructions are interleaved and scheduled on that processor. This optimizes system resource utilization and hides system latencies, but it doesn’t speed up the runtime for a single stream of instructions.
To get a fixed amount of work done in less time, you need to follow a process like this:
- examine a problem that contains useful concurrency
- rearrange the problem or craft specialized algorithms that expose this concurrency
- run it on a computer with multiple processing elements
Let’s take a look at each of these steps in isolation.
First, a problem must contain concurrency. If a problem is inherently structured as a single task with a fixed order of events, there just is no concurrency to work with. For example, I am typing this document with Microsoft Word*. The program must wait until I enter a keystroke before it can act. This imposes a fundamental sequential ordering to the task. There just no concurrency to exploit.
Now consider the rendering of an image. This is the processes where a 3D model is turned into a realistic image by following the light rays between the viewer’s eye and the sources of light. Each ray is independent of the other rays and therefore the computation for each ray can be handled independently. This problem is a task-level parallel problem since you are using the multiple tasks to define the parallelism. And since each ray is independent, the programmer doesn’t have to do anything special to coordinate the tasks as they execute. We call these sorts of task-level problems “embarrassingly parallel.”
Finally, consider a more complicated problem where you are modeling the stresses on a jet engine’s gas turbine. Here the concurrency is much more complicated to expose. The core of this problem is to superimpose a mesh over the turbine and solve the differential equations at each point on the mesh. This seems to be one massive task, but when you look at the problem in terms of the data you see that you can partition the grid into larger blocks and solve the interior of these blocks simultaneously. Computing the blocks cannot be done independently since they need to interact at the boundaries, but you can overlap these updates with computation on the interior and extract large amounts of concurrency. This approach finds the parallelism in the data and how i t is decomposed. Algorithms of this type are called geometric decomposition algorithms.
Selecting a parallel algorithm that exposes the concurrency in a problem can be very complicated. Fortunately, software engineers have been working on these types of algorithms for over 20 years and have reduced most algorithms to a collection of Design Patterns. These patterns have been organized into an interconnected network of patterns stretching from the high level problem statement to the parallel algorithm and finally to the low level constructs needed to implement the parallel program. This network of patterns is called a pattern language. Learn more about this pattern language for parallel programs in Patterns for Parallel Programming (Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill, Addison Wesley, 2004).
Once you have a parallel algorithm, you need to translate the algorithm into source code and run it on a parallel computer. We will talk about this important step in the next section. For now, however, we need to explore in a bit more detail the way a parallel program runs on a parallel computer.
A modern operating system organizes its work in terms of processes. A Process is a collection of resources that support the execution of a program’s instructions. These resources can include virtual memory, I/O descriptors, a runtime stack, signal handlers, user and group IDs, and access control tokens. In other words, a process is a “heavyweight” unit of execution that carries its state (including its address space) with it.
To move a process around the machine or swap the active process with another process is very expensive and requires a great deal of work. This greatly complicates an Operating System’s' attempts to optimize utilization of system resources. Hence, modern operating systems include a simpler, lighter weight unit of execution called a thread. Threads are simple executing agents that are associated with a process and share the process's environment. Each thread has its own stack, but for the most part, the memory available to a thread comes from the shared address space owned by the process. This means that switching the context between threads is inexpensive making it easy to move threads around the system.
Threads are the natural unit of execution when programming shared memory computers. The overall application has a single process and the concurrency comes by mapping the parallel algorithm onto multiple threads owned by the processes. We call this multi-threaded programming. Since the threads share an address space, interaction between the threads is simple to manage inside the program. And since so little state is associated with each thread, it’s easy to create programs where the number of threads varies to meet the specific needs of an application.
Note, however, that multi-threaded programming is not the only option on shared memory machines. It is possible to map your parallel algorithm onto multiple processes. The processes don’t share an address space, so interaction between processes can only occur by explicitly passing information between them (for example, using a message passing API such as MPI). This approach has been popular in the scientific computing community where programs often need to run on both distributed memory systems (such as clusters) and large shared memory sys tems. We believe, however, that more mainstream applications will not need to support clusters and for most programmers, the more pragmatic approach for parallel programming is based on threads.
Programming with Threads
Once you have your problem in hand, have identified the concurrency in your problem and have found a parallel algorithm that exposes that concurrency, the real fun begins: you need to translate that algorithm into a parallel program. In other words, you need an Application Programming Interface (API) that lets you express concurrency in a program’s source code.
For multi-threaded programming, the most common APIs are OpenMP* and explicit threading libraries such as pthreads (in a Unix environment) or Windows* threads. The choice of which API to use is complicated and depends on the skills of your programming team and the sorts of problems you wish to solve.
OpenMP is a collection of directives and supporting runtime library routines to support multi-threaded application programming. When writing OpenMP programs, you typically find the most time consuming loops and then use directives to tell the compiler to split the loop iterations between threads. OpenMP can also be used to define task queues and for more general purpose multi-threaded parallelism, but the most common case is loop-level parallelism. The advantage of OpenMP is its simplicity. There are only a small number of directives to work with and usually they do not change the underlying program semantics. As a bonus, OpenMP lets the programmer adopt a programming discipline in which a program smoothly evolves from a working sequential program into a working parallel program; thereby supporting testing at each phase of the transition.
Explicit threading libraries, however, are much more complicated to use. The code to execute in a thread is packaged into a routine. The threads are then forked at a specific point in the program passing the desired routine to the function launching the thread. Considerable amount of code must be written to create the thread and the code that will run within the thread. Consequently, explicit threading libraries are much more difficult and error prone to work with. Hence, the weakness of working with an explicit threading API is that you “must” control all the low level details of managing the threads. On the other hand, the strength of working with an explicit threading API is that you control all the low level details of managing the threads. Since the programmer controls everything, the programmer can handle a wider range of algorithms and do more to tune their program to the needs of a particular system. Finally, while OpenMP requires the availability of an OpenMP enabled compiler, an explicit threads API just needs an interface to the multi-threading libraries. Hence an explicit threads API can support a larger range of compilers and languages.
Regardless of which API the programmer uses, it is important to take great care to assure the resulting program executes quickly. We measure performance in a parallel program in terms of speedup. This is the ratio of the programs runtime in parallel to the runtime of the best available sequential algorithm.
S = t(sequential)/t(parallel)
When you have a perfectly parallel program, the speedup should scale linearly with the number of processing elements (P); i.e., S should equal P. This is called “perfectly linear speedup.” Programmers only rarely attain perfect linear speedup. The problem is that even in a highly parallel problem, there are still numerous sources of overhead, which limits both performance and potential speedup.
The most important issue to consider is the serial fraction of a problem and its impact on the speedup. The rule governing this relation is called Amdahl’s law. Here is a quick and easy development of Amdahl’s law. Consider a problem with total runtime T(s). We can break this down into two parts, that fraction that can run in parallel (fp) and that fraction which can only run serially (fs). The serial fraction arises from operations such as I/O overhead, fixed startup costs, or work required to process the results from the parallel job. As you increase the number of processors available to work on a problem, only the parallel fraction speeds up so the time as a function of the number of processors (P) is:
T(P) = (fs + fp/P)T(s) = (fs + (1-fs)/P)T(s)
Plugging this into the definition of speedup:
S = T(s)/(fs+(1-f(s))/P)T(s) = 1/(f(s) + (1-f(s))/P)
which in the limit of large P gives us
S = 1/f(s)
These equations showing the impact of an algorithm’s serial fraction are vital when thinking about parallel algorithms. If you have parallelized only 80% of your algorithm, the best speedup you will ever get is 5, regardless of how many processors you have. With two processors, the best speedup you can get is only 1.4. Therefore if you care about performance, you need to pay a great deal of attention to minimizing the fraction of work in your algorithm that can only run serially.
What makes parallel programming so challenging is that in addition to performance problems arising from the parallel algorithm, the issues familiar to serial programs still exist. For example, you must be careful to manage your memory access patterns so processing elements are close to the data they will use. This is even more the case as you move to multiprocessor systems; especially as they grow in size and the costs of accessing different regions of memory vary from one processor to the next (i.e. non-uniform memory architecture or NUMA systems).
High performance is the name of the game in parallel programming. But quickly getting the wrong answers is undesirable. Correctness in a multi-threaded program can be very challenging. The fact that the threads all share a single address space greatly simplifies parallel programming. But this feature also raises unique challenges since threads may share data in ways the programmer did not anticipate. When a program’s output changes as the detailed scheduling of threads changes, it is called “a race condition.” These are among the most dangerous of all programming errors since it can be almost impossible to prove that you don’t have any race conditions in your program. It may run perfectly the first 1000 times, for example, but on the 1001st time, the thread schedules line up a certain way and a memory conflict causes the program to fail.
To eliminate race conditions in multi-threaded programming, the programmer must assure that every conceivable way the instructions can be interleaved yields the correct results. This means that the programmer must identify objects that multiple threads need to read or write and protect them with synchronization constructs so every order of access provides the correct results.
For example, consider a simple program that computes two independent results (A and B) and then combines them with a third value to compute a final answer. Here is a multi-threaded program to carry this out on two threads:
|A = BigJob()
||B = BigJob()
|Res += A
||Res += B
If the calculation of one value (e.g. A) takes much longer than the other value (e.g. B) this program would probably generate the correct result. If both threads try to sum their values into Res at the same time, however, the results are unpredictable. For example, if A = 1, B = 2 and Res is 3 (prior to the calculation of A and B), the result could be:
|Final value of Res
||Sequence of ops
||if you get lucky and the A and B calculations take widely different times
||if thread 1 reads Res, but while it's doing the sum, thread 2 reads Res, does its sum and writes the result after thread 1 completes
||if thread 2 reads Res, but while it's doing the sum, thread 1 reads Res, does its sum and writes the result after thread 2 completes
Each of these values result from a legitimate interleaving of the instructions in the multi-threaded program. But only the first one represents the correct answer. The programmer, therefore, must constrain the program so no matter how the instructions are interleaved it finds the correct answer. Commonly, a synchronization construct protec ts the update of Res. Every multi-threading API includes an assortment of synchronization constructs. With OpenMP, the atomic construct is the natural synchronization construct to use in this case. It assures that only one thread at a time will do the protected operation and that the operation with one thread will complete before the other thread is allowed to begin. The resulting program follows:
|A = BigJob()
||B = BigJob()
|#pragma omp atomic
||#pragma omp atomic
|Res += A
||Res += B
While it is easy to see these problems in tiny code segments, in a real application where the memory conflicts may be buried in thousands of lines of code spread out between many different files, it can be extremely difficult to make sure you have found all of the potential memory conflicts. Similarly, finding all the sources of parallel inefficiency can be extremely challenging. The programmer must have a deep understanding of the program to deal with performance and correctness issues. The best way to do this is with the right tools. Intel has a suite of tools for analyzing the system-level behavior of your program (Intel® VTune™ Performance Analyzer), the memory conflicts in a multi-threaded program (Intel® Thread Checker) and to visualize the parallel performance of your program. These tools are essential if you are serious about creating effective multi-threaded programs.
Parallelism is coming to everything from laptops to high-end servers. It will exist at multiple levels with multiple cores per chip, multiple sockets per box, and multiple boxes per system (cluster). While only some programmers will need to worry about clusters, all programmers will need to worry about shared memory multiprocessors. And the way you program these systems is with multi-threading.
Threads are lightweight processes that share an address space. Programmers use an API such as OpenMP or an explicit threading library to create these threads and coordinate their activity. The challenge is to minimize parallel overhead while assuring that every legal way to interleave the threads results in the correct answers. To do this by “inspection alone” is extremely difficult. The key is to use the right set of tools to help you.
This brief article is just an introduction to the high-level issues raised in multi-threaded programming. Its goal is to provide context and to provide an overview of the key issues programmers need to pursue. To master multi-threaded programming, do some additional reading.
To learn about parallel algorithms and how to “think parallel,” design patterns can help greatly. The following book presents the major patterns parallel programming experts have developed over the years:
- Tim Mattson, Beverely Sanders, and Berna Massingill. Patterns for Parallel Programming: Addison Wesley Professional
- To learn more about OpenMP, read the specifications available on the OpenMP Web site at: www.openmp.org*
- For those new to directive driven multi-threaded programming, a more pedagogical discussion may be important. While it is slightly out of date (referencing only the 1.0 specification), the basic ideas are still accurate and well developed in: Parallel Programming in OpenMP, Chandra, Rohit, San Francisco, Calif: Morgan Kaufmann; London: Harcourt, 2000, ISBN: 1558606718
- For libraries for multi-threading, both pthreads and Windows threads, we recommend several good books at: http://www.intel.com/intelpress
- While we encourage you to buy books and support struggling authors, Intel has a wealth of free information on the web. The following table shows you some useful links for items discussed in this paper:
About the Author
Timothy G. Mattson has a Ph.D. in chemistry (1985, U.C. Santa Cruz) for his research on Quantum Scattering theory. He has been with Intel since 1993 and is currently a research scientist in Intel's Parallel Algorithms Laboratory where he works on technologies to support the expression of parallel algorithms. Tim's life is centered on his family, snow skiing, science and anything that has to do with kayaks.