# Is Parallel Programming Hard?

Categories:
Posted by Ilya Mirman on Thu, Apr 16, 2009

(The following is a guest post from Prof. Guy Blelloch of Carnegie Mellon University.)

"Is Parallel Programming Hard?"

I believe that at the right level of abstraction, it is not and in fact can be as easy as sequential programming. Note, however, that "at the right level of abstraction" should be considered with care. Pragmatically there are some problems today with taking advantage of the "easiness" of parallel programming. Let me describe what I consider the three main problems and why I expect these can and will be solved.

• Parallel Thinking: The first problem is that programmers have been trained for years in sequential thinking. Most of us have been taught over and over since our very earliest days of programming how to think about programming in a sequential way. This permeates just about everything we do in code development, from the abstract description and analysis of algorithms, to the coding of these algorithms, and from debugging and verifying code, to the way that we read code.
• Wheat from Chaff: The second problem is that most existing parallel programming environments do not properly separate the wheat from the chaff. Parallel programming today often involves a huge number of decisions that have little to do with the inherent parallelism at hand and more to do with specifics of the particular programming model, compiler, or machine type. Since the various choices are not cleanly separated, the programmer is left with the sense that parallel programming is a hugely complicated task.
• Determinism: The third problem is a lack of support for deterministic parallelism and overuse of non-deterministic parallelism. By deterministic parallelism I mean a parallel program for which the results and partial results are the same independent of how the computation is scheduled or the relative timing of instruction among processors. In non-deterministic parallelism, the results depend on such scheduling and timing.

These three problems are not unrelated, as I will return to at the end, but let me first address each in more detail.

#### Sequential Vs. Parallel Thinking

When we think of a summation in mathematics we often write it using the Sum symbol. In studying mathematics most of us do not think of it as a loop but rather as just the sum of elements---we implicitly understand that addition is associative and the order does not matter. But yet many programming courses spend weeks describing how to write loops to take sums of elements and similar tasks. Guy Steele pointed out in a talk that once you put an x=0 in your code, you are already doomed to think sequentially. I argue that there are many problems in which we are taught to do in some sequential way, but when thought about abstractly the parallel solution is natural.

Quicksort is my favorite example of an algorithm that is naturally extremely parallel but rarely taught as such. Here is the code for Quicksort from Aho, Hopcroft and Ullman's textbook "The Design and Analysis of Computer Algorithms", a book for us old timers:

`procedure QUICKSORT(S):`
`if S contains at most one element then return S`
`else`
`  begin`
`    choose an element a randomly from S;`
`    let S1, S2, S3 be the sequences for elements in S less `
`       than equal to, and greater than a, respectively;`
`    return (QUICKSORT(S1) followed by S2 followed by`
`         QUICKSORT(S3)`
`  end`

The interesting property of the pseudocode is that whether they intended to or not the authors wrote a highly parallel description of the algorithm. In the code there are two forms of parallelism. Firstly there is parallelism from the recursive calls to Quicksort. The code does not indicated that these calls need to be executed sequentially---it only says how the results should be put together. Secondly, there is parallelism from the fact that when selecting the elements in S1, similarly S2 and S3, we can compare all the keys to the pivot a in parallel. In combination these two forms of parallelism lead to an algorithm that is extremely parallel---in a theoretical sense it can effectively take advantage of O(n log n) processors. This means that in theory when sorting 20 million keys we could effectively take advantage of over a million processors. Practice might be less generous, but we would be happy with enough parallelism for 100,000 or 10,000 processors.

Yet most of our classes teach Quicksort without pointing out it is parallel. We carefully go through the sequential execution and analysis and argue it takes O(n log n) expected time. Most students are left clueless of the amount of parallelism in the algorithm, and certainly on how to analyze the parallelism. For example, what if one only took advantage of one or the other form of parallelism, how much is left.

#### Separating the Wheat from the Chaff

In the early days of sequential programming people apparently programed with gotos, important languages did not have recursion requiring the user to use a stack, and it was reasonably common to go into assembly code. I'm often surprised to go back and read some of the algorithm papers from 40 years ago. They typically spend pages describing various details we would never spend time on today (e.g. a memory allocation). The typical programmer today almost never thinks of gotos, simulating recursion on how to implement memory management.

It seems that parallel programming today has some of the same issues as sequential programming had in the early days. In the interest of performance there has been significant emphasis on low-level details and often these low level details are not properly separated from the high-level ideas in parallelism. Various libraries and languages extensions require the user, for example, to specify stack sizes for forked tasks, implement their own schedulers or specify scheduling policies, or understand mapping of memory to processors. When presented with such level of detail it both makes programming seem much harder and also obscures the core ideas in the rest. The programmer might be left believing that specifying stack size or scheduling policy when forking a task is an inherent necessity of parallel programming.

#### Deterministic vs. Non-Deterministic Parallelism

Surely the most important property of a program is correctness. Therefore a major concern about parallel programming should be that it will make it harder to write and maintain correct programs. Indeed it would be quite unfortunate if parallel programs were less reliable. Most programmers who have done some coding of parallel or concurrent applications, however, have experience with how much more difficult it can be to convince oneself that such a program is correct or to debug the program. Correctness of of parallel programs is therefore a legitimate concern.

The problem with developing correct parallel programs is mostly to do with non-determinism introduced by the different interleavings of instructions in parallel streams. Different executions might interleave the instructions differently depending on many factors including what else is running on the machine, how many processors are being used, the initial state of the machine, the layout of the input in memory and its effect on cache or page misses, or varying delays in communication networks. A program might work correctly on odd numbers of processors but incorrectly on even. It might seem to work on Tuesdays but not Wednesdays. Such non-determinacy makes regression testing of code of questionable utility. It also makes debugging much more difficult since a bug might not be repeatable.

Non-determinacy is not only a problem with testing and debugging programs, but in verifying their logic. Whether the verification is done in ones head, on the back of an envelope, in a lemma in a paper, or using automated verification tools, the non-determinism requires considering all feasible interleavings. Many bugs have been found in published algorithms for concurrent data structures, and it is well known in the verification community that verifying programs with arbitrary interleavings is much more difficult than verifying a sequential program.

Does this mean that we are doomed? The answer is no. The fact of the matter is that programs or parts of programs that don't involve non-determinacy from the environment do not need to be non-deterministic. Furthermore it is known how to have a programming model and/or development environment enforce determinacy. If we consider the Quicksort code above, although it is parallel, the logic is such that it will always return the same answer independent of how it is run. The programmer need not consider interleavings.

As described in a previous post on this blog one way to do get deterministic programs is to have a serial semantics to the program. In this approach the programmer reasons about the correctness with regards to a particular serial ordering of the program (typically a depth-first traversal of the fork-join structure). The language then makes a guarantee that if the program does not have race conditions, then any feasible interleaving of the execution will return the same result as this sequential ordering. In the case of Cilk++ race conditions can be checked using a race detector. This is perhaps not a perfect solution since the races can only be detected on the test cases, but is as good as any regression testing. Also it is much easier to verify that a program or program fragment does not have race conditions than to verify correctness in general.

#### In Summary : It is all About Abstraction

Although presented separately, I believe the three problems I described are really three faces of the same problem. The problem is one of abstraction---both identifying the right abstractions and working at the right level of abstraction.

In the case of sequential vs. parallel thinking, I believe that at the right level of abstraction they are really not that much different. In both the example of summation and of Quicksort when expressed at the right level the code is either sequential or parallel. This is not to say that all algorithms are naturally parallel when abstracted. In the case of merging two sequences, for example, the standard algorithm that starts at the front of each sequence is inherently sequential. On the other hand there is a very simple and efficient divide-and-conquer parallel algorithm. To convert programmers to parallel thinking will require some effort and time, but it is actually an opportunity to simultaneously raise the level of abstraction at which programmers approach problems.

In the case of separating the wheat from the chaff, it is really about identifying the right abstractions for parallel programming and avoiding or depreciating other non-essential issues. Of course there is not universal agreement on what the important issues are, but it is important to error in the direction of fewer core ideas than more ad hoc ideas even at the risk of loosing some performance. My opinion is that as a programming language and runtime system Cilk++ does a excellent job at separating the wheat from the chaff. It introduces a small set of parallel constructs each with few if any options and presents a simple model for analyzing performance and correctness. Indeed I think that divide-and-conquer parallelism, parallel loops, analyzing performance in terms of work and span, and analyzing correctness using sequential semantics are perhaps the four most important concepts in deterministic parallel programming.

In the case of non-determinism, I also believe this is an issue of abstraction. Under the hood the program is always going to be non-deterministic since there is no reasonable way to run a parallel program in lock-step in exactly the same way on every run. Beyond the technical problems, such lock-step execution would likely be terribly inefficient. The idea is therefore to abstract away the non-deterministic behavior so that the programmer only sees deterministic behavior and any non-determinacy is hidden in the run-time system. Daring programmers might expose non-determinacy at some levels but then encapsulate it into deterministic components. Even in non-deterministic environments where a program is interacting with concurrent request from the outside world, the majority of the program should be deterministic and non-determinism should be limited to what is needed by the task. I think this topic needs more study and many parallel programmers need significantly more education on the implications of non-determinacy, and approaches to deterministic parallel programming.

In summary I believe that when parallel programming is brought to the right level of abstraction it is not inherently difficult. It requires parallel thinking, simple programming constructs, and limited use of non-determinism.

Interesting article. Personally, I'd like to think of Parallel Programming
in a way that transcends the moment.

This starts with the supposition that ... "we"
+) do not have, and
+) never will have
such a capability as a parallel debugger; i.e.:
+) serial debugger, yes
+) O(1) debugger (very small constant), yes,
+) O(P) debugger, never ever

Therefore, a lot depends on our answer to the question:

What is Parallel Programming?

Take I:

Parallel programming is nothing more than the management of a
collection of serial tasks

where:
management refers to the policies by which (a) tasks are scheduled,
(b) premature terminations are handled, (c) preemptive support is provided,
(d) communication primitives are enabled/disabled, and (e) the manner in
which resources are obtained and released,

and
serial tasks are classified in terms of either (a) Coarse grain --
where tasks may not communicate prior to conclusion, or (b) Fine grain --
where tasks may communicate prior to conclusion.

Take II:

There is a tight coupling between:
+) Management of tasks (across resources -- threads/processes) on one hand, and
+) Communication primitives any of the said tasks may utilize on the other-hand

I.e. they are two sides of the same coin.

So, for example, if I want to schedule a task in isolation of any other task,
I need to ensure that the said task will not be able to communicate with any

Conversely, for example, if I want N tasks to communicate among themselves,
I have no choice but to manage the N tasks as a single unit; i.e. for example,
if one task goes down, the rest of them are forcibly shutdown.

Take this approach to its logical conclusion, and you have a wide and rich
variety of (scheduler + communication primitives) tuples.

posted @ Friday, April 17, 2009 2:51 PM by Minesh B. Amin

I think the previous blog article by Steve Lewin-Berlin on serial semantics gets to your issue on parallel debugging. In particular if you have a serial semantics then you only really need to debug the serial version. In this context parallel programming is just about performance and not about correctness. Of course this does not include what I referred to as non-deterministic parallelism...but as I argued the use of such parallelism should be minimized. I actually like to call this form of non-deterministic parallelism as "concurrent programming" as opposed to "parallel programming". People have been doing it for 40 years as part of multiprocess operating systems for machines that have no parallelism.

posted @ Saturday, April 18, 2009 10:06 AM by Guy Blelloch

Seems like the biggest difference between "concurrent programming" and "parallel programming" is the shear scale
and the number of "finite state machines" distributed across all components. So, while the code that allows a
browser to interact with a web-sever may involve "concurrent programming", the stuff that happens at the back-end
of the web-server seems much more involved than a multiprocess operating system.

I must admit my take on parallel programming is biased by the experience gained and lessons learnt while developing
two, and thus far, unique commercial parallel fine-grain applications. The main challenge in both cases was to
deliver scalable parallel performance by leveraging (a) massive amount of existing, highly specialized serial
software code-base, and (b) serial software engineers.

It seems like there are two general ways by which people approach this topic. One is "software developer" centric.
The other, and my preference, is "end-user" centric.

The root cause of most of the difficulties in parallel programming can be traced back to one key point.

There is an "asymptotic" parity between the resources available to the serial software developer on the one hand,
and the end-user of the serial software on the other-hand. This is most definitely not the case when it comes to
(non "GPU") parallel software -- a point borne out by the fact that, by my count, there are only __5__ commercial
parallel fine-grain applications in the market today that can deliver speed-ups in the range of between 40 and 140
using up-to 200 machines (am excluding applications that leverage GPUs).

Therefore, given the asymmetry in resources, the only way to make progress is to both acknowledge and manage the
inherit complexity. Specifically, the need to fix "parallel" bugs without having to reproduce them.

In other words, a parallel application must have three attributes: (a) serial component that is deterministic,
(b) parallel component that must be non-deterministic given that the hardware systems do not work off a single
global clock, and (c) an API that exposes the parallel component in a deterministic manner.

Furthermore, the parallel component, unlike the serial component, must be correct by construction. And, to the extend
that it isn't, all assumptions made by the parallel component must be enforced and tracked. Why? Bugs in the parallel
component must be fixed without having to reproduce them (due to the aforementioned asymmetry in resources).

posted @ Saturday, April 18, 2009 7:35 PM by Minesh B. Amin

I agree with the main point of the article - that we need to find ways to abstract parallel programing so that application designers can focus on the tasks at hand and not on the underlying hardware implementation. This will become even more evident as the number of processor cores on a chip grows. Is it feasible to expect that a programmer keep track of 16 parallel paths in sequential code?

Programming languages like LabVIEW are implementing this abstraction today. Using LabVIEW, programmers can "think parallel" by laying down parallel loops, subroutines, or any other piece of code. The LabVIEW compiler automatically maps sections of non-dependent code to different threads so that they can be executed in parallel.

I also agree that determinism needs to be carefully considered in parallel applications. Keeping track of the dependencies between pieces of parallel code is key to avoiding race conditions. Using LabVIEW, programmers specify a data-flow representation of their code which inherently keeps track of any dependencies.

In short, I think that as mentioned abstraction is the key. Programmers need to focus on writing code that contains both dependencies and the necessary information to execute different parts in parallel. Graphical programming is one approach that has been verified to provide these benefits.

posted @ Monday, April 20, 2009 2:51 PM by Casey Weltzin

For more complete information about compiler optimizations, see our Optimization Notice.