About the pain of parallel programming

If you’ve ever heard about parallel programming it probably sounded like a painful endeavor. Those who have experience with it know that it’s mostly true with “traditional” approaches which incorporate parallel constructs into main-stream programming languages like C++ or FORTRAN. In the light of decades of research in parallel computing this is an irritating situation - and more so now that multi-core systems have been mainstream for years. 

 But why does parallelism hurt? And does it really have to?

 You probably guessed already that I don’t think it needs to be that painful, but let’s defer that for the moment. So what it is that makes traditional approaches so painful? The pain is in what they require the programmer to think about and more importantly what they leave implicit. The programmer is required to explicitly state what computations should go in parallel. However, the programmer must implicitly make sure that no race-conditions can occur; it is up to the developer to guarantee that the computations s/he declared as parallel do not depend on each other. Without offering a way to specify/define dependences (if one computation depends on the other we say there is a dependence between the two) it is simply required (or assumed) that none exist. Prominent examples of tools for C/C++ that make you guarantee the non-existence of dependences are OpenMP, MPI, TBB and Cilk. Without becoming too philosophical,  it should be obvious that proving the non-existence of something is very hard (and in some cases impossible). Not uncommonly this approach brings programmers to the verge of despair by making them think about what doesn’t exist.

 That said, the key to less pain with parallelism becomes apparent. It’s not in syntax, more powerful parallel constructs, automatic parallelization or new programming languages. As long as dependences are implicit, getting parallelism won’t become significantly simpler. However, “inverting” the problem simplifies it: it’s much easier to tell what exists than proving what doesn’t. If we define what can *not* go in parallel, e.g. which (dependent) computations need to go in order, then we also know what can go in parallel: everything else. So why not just letting the programmer define the dependencies (or the required orderings) between computations? It is a simple thought process; simple mostly because the necessary information is known by the developer anyway: when writing a computation kernel the programmer knows what input/information it needs, so it is clear what it depends on. If all such dependences are declared, executing non-dependent pieces in parallel is an almost straight forward task.

 In other words, the key is making the relevant information explicit rather than tediously working around implicit information. Serial languages (like C/C++) hide such information even though it is known to the programmer. They have no capability to express the necessary information directly and natively. Tools for automatic parallelization then try to uncover what’s not defined explicitly - in general that’s a futile endeavor.

 So what’s needed is a program structure to explicitly express the dependences which are actually needed to find a semantically correct execution order of compute kernels. Even a runtime can take care of the parallelization and things like “parallel_for”, pipelining-constructs etc. become superfluous. Expressing the dependences avoids the requirement to make parallelism explicit or even to think about it but still exposes the available parallelism in the program.

 Inverting the traditional approach of thinking about parallelism, e.g. thinking about what needs to be serial/ordered, makes parallelism not only easier but also uncovers more of it (because no explicit construct limits it to a specific type of parallelism).

 There are different approaches to tackle this, like streaming and functional languages. In a series of articles I will introduce Intel’s C++ library/runtime implementation of CnC* which directly implements the above idea. Without specializing to a certain domain it provides purity in design and proven scalability from single-core to multi-cores systems up to clusters of multicore-workstations. I will demonstrate that the thought process is simple and even detecting errors is usually easy, because the runtime has the information to actually issue meaningful warnings and messages.

 For the impatient: Here is CnC’s homepage with the free download, papers, talks and tutorials. 


* “Concurrent Collections” (formerly TStreams) is a programing model developed by Kath Knobe. Its new name is rather misleading, therefore we prefer using its abbreviation “CnC”. The full name of Intel’s offering is “Intel® Concurrent Collections for C++”.