"Thinking in Parallel"
"Developing for Terascale-on-a-Chip Computing" is a series of three articles on looking at the future of programming for many-core parallel computers that run in the teraflop and above range from a single processor package – terascale-on-a-chip. This is the second article in the series.
Developing for terascale-on-a-chip platforms takes parallel programming complexity to a whole new level. Evolving sequential code into a parallel program for dual-core and quad-core platforms might allow a programmer to parallelize loops and break up easily managed data structures into code that can run concurrently with little overhead. That’s not enough for terascale-on-a-chip.
When you’re dealing with 64 or more cores, you have to have many, many more tasks than cores available. This way, the system can do useful work while dealing with memory accesses and other poorly scalable activities. Having excess concurrency in an algorithm to hide latencies is called “parallel slackness.”
To develop highly concurrent programs you have to expose as much concurrency as you can in your problem. And that requires spending a lot more time thinking in the problem domain before going to the program domain, because it takes iterations of going back and forth between
- Decomposing the problem into collections of tasks, and
- Analyzing and minimizing the impact of the data dependencies that the decomposition exposes.
Some problems end up being naturally, or embarrassingly, parallel, where tasks are completely independent from each other – there are no data dependencies between them. Most programs, however, do not offer this luxury; there’s always something that must be communicated, synchronized, or shared. So, you have to refine your problem to minimize these dependencies as you iteratively expose concurrency, organize and manage dependencies, and structure (and possibly restructure) your algorithm. This is the “wetware” part of parallel programming. And there are few (if any) tools that automate this process.
Fortunately, there’s a whole industry of developers who have been creating highly parallel programs for years. Their experiences are incredibly valuable to all programmers as they expand their skills to create codes for future terascale-level computing platforms. These experts have already helped populate the programming industry with usable codes, development tools, programming models, and a pattern language readily available to programmers.
A Parallel Programming Pattern Language
The results of many designs, from architecture to programs and beyond, are built from a collection of often easily recognized, commonly used, well-known design concepts, or patterns, used in that discipline. They are assembled together into a web of patterns, and expressed as the final product using concrete, steel, or programming code. Over the years, in parallel programming, several patterns have emerged that parallel developers use to help them solve specific types of issues in their problem and build efficient, highly concurrent cod e. Structured into a pattern language, these patterns can help guide you through the multiple phases of creating efficient programs.
Figure 1 shows a schematic of this parallel programming pattern language.
Figure 1. Parallel Programming Pattern Language schematic
Parallel programming patterns are organized into design spaces to help you
- Find concurrency,
- Structure the algorithm based on characteristics exposed by the concurrency,
- Discover which programming structures support the structure of your algorithm, and
- Decide on which specific coding implementation mechanisms to use to express your design.
The Finding Concurrency design space is concerned with the original problem domain, where you work out the parts of your problem before mapping them into source code. Here, you decompose your problem and understand the dependencies that the decomposition exposes. For some developers, the concurrency and dependencies will be readily apparent. But for most of us, it takes many iterations of moving back and forth between the decomposition and dependency analysis patterns before we’re ready to move onto the Algorithm Structure design space. And it’s absolutely critical to spend enough time in this space before proceeding, because you can miss valuable design options if you move on too soon.
From the Finding Concurrency design space, you know the concurrent tasks in your problem and the dependencies between them. The goal of the Algorithm Structure design space is to devise a strategy or high-level structure for an algorithm that executes those tasks in parallel. In most cases, a particular “organizing principle” will jump out from the design. The decomposition into blocks of mostly independent data elements or the scheduling of tasks might stand out as the major feature of the problem. Or the problem may naturally decompose into a small set of functions with data flowing between them. Whatever the case may be, this organizing principle will guide you into the Algorithm Structure patterns and help you move to the right patterns to consider.
Note that at this point in the design, we are still largely thinking at an abstract level. Issues pertaining to source code in a program will begin to emerge, but the focus is still on the formal algorithm design itself.
As you move into the Supporting Structure patterns, the issues move from abstract algorithms to source code. The Supporting Structures design space focuses on programming and computing environments, for example a distributed memory or shared memory environment. The Supporting Structures patterns help you deal with the following types of issues:
- How can I structure my parallel program to make interactions more manageable and easier to integrate with the core computations? SPMD pattern.
- How can I effectively load balance all the activities so that my parallel program stays highly efficient? Master/Worker pattern.
- How do I efficiently translate a program dominated by computationally intensive loops into a parallel program? Loop Parallelism pattern.
- How can I construct a parallel program dominated by complicated sets of dynamic tasks that vary as the program executes and cannot be managed effectively with parallel loops? Fork/Join pattern.
- How do I explicitly manage shared data inside a set of concurrent tasks? Shared Data pattern.
- How can concurrently executing threads and processes safely share a queue data structure? Shared Queue pattern.
- Arrays often need to be partitioned between multiple threads or processes. How do I partition arrays between multiple threads or processes so the resulting program is both readable and efficient? Distributed Array pattern.
The patterns in this space map to the patterns in the Algorithm Structure design space fairly clearly and, while they can be applied to several programming models – OpenMP,* MPI,* and Java* – they often map better to one programming model over another. Thus, they help you choose which structures will support your algorithm in a particular programming model.
Some of these patterns are well supported in the different programming environments, easing the work developers must do to express these structures. The patterns, however, are not hard and fast constructs. They represent flexible structures that can be combined to achieve the functionality the program demands to achieve the programmer’s objectives.
Parallel programs use multiple threads or processes to take advantage of concurrency in the algorithm. And, it’s highly likely the threads and processes will need to share information and work together in various ways. Thus, at the basic level, the program will need to manage the creation and destruction of threads and processes, communicate amongst them, and synchronize their work. Various implementation mechanisms enable these activities. But, they can differ from one programming model to another – whether you’re working with OpenMP, MPI, or Java. Each model has its own set of pragmas, directives, constructs, etc. to implement concurrent activities and manage dependencies.
Programmers can continue to use the tools and models they’re familiar with. Writing highly concurrent programs doesn’t change the mechanics of coding. It impacts how you think about your problem from the beginning. As mentioned in the first article in this series, Intel researchers and developers are actively working on solutions that will help programmers develop for terascale-on-a-chip, many-core-based systems – whether the computing model is shared or distributed memory, or something new to come out of the industry.
Keeping an Open Mind
While much of the computing world is used to working with shared memory computing models, terascale-on-a-chip can eventually bring the computing capacity of a clustered, distributed computing model to a workstation. That will give programmers a whole new option of how to look at coding their problem. In other words, just because the platform appears as a shared memory or non-uniform memory architecture (NUMA) system, it doesn’t limit the parallel application to one computing model. Even today’s very large NUMA supercomputers used by both industry and national l aboratories are often programmed using a distributed memory model, using message passing (MPI). The reasons for this include
- Many of the parallel computing codes used for supercomputing were originally written for distributed memory models and reused in these newer machines.
- As shared memory systems get bigger, programs also get bigger, increasing the complexity of synchronizing data and protecting shared data in order to prevent race conditions.
- Distributed model-based applications are easier to debug, even though they require a lot more upfront thinking about organization and communicating amongst processes. The more massive code becomes, so goes the debugging process. And a parallel program multiplies the debugging complexity. While new tools will need to be developed to ease the debugging work on terascale machines, the smaller, self-contained processes of a distributed computing model, with data sharing explicitly controlled through message passing, simplifies finding problems in parallelized code.
The vast majority of developers today are intimately familiar with shared memory platforms, but as terascale-on-a-chip platforms evolve, they will likely be more exposed to distributed memory computing models in a workstation-sized platform. Therefore, programmers should keep an open mind about how their new problems might best map to a distributed computing model.
A Grand Challenge
With terascale-on-a-chip computing, developers have a grand new challenge before them. Fortunately, many experienced programmers have already walked this road. The existing and evolving tools and a parallel programming pattern language provide a firm foundation, while Intel and other industry researchers work on developing new tools and programming models for future terascale-on-a-chip machines. What will come out of these machines is really up to our imaginations.
- Mattson, Timothy G., Beverly A. Sanders, and Berna L. Massingill. Patterns For Pattern Programming, Addison-Wesley, 2005.
- Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
About the Author
Ken Strandberg writes technical articles, white papers, seminars, web-based training, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom. His technology background enables him to write from an engineering perspective for technical audiences. Ken’s work has appeared in national engineering magazines, such as EE Times and EE Design, and on Fortune 100 enterprise web sites. Mr. Strandberg lives in Nevada and can be reached at email@example.com.
 The following discussion attempts to summarize the text, Patterns For Parallel Programs by Mattson, Sanders, and Massingill. For more details on this pattern language, see the book. Illustrations and text used by permission.
 Used by permission.