Parallelism as a First Class Citizen in C and C++, the time has come

It is time to make Parallelism a full First Class Citizen in C and C++.  Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.

The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++.  This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.

Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.

We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.

We should tackle task and data parallelism both, and as an industry we know how.

  • Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.
  • Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions. 

The solutions that show the most promise are documented in the Cilk™ Plus open specification (http://cilkplus.org).  They are as follows:

For data parallelism:

  • Explicit array syntax, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times
  • Elemental functions, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions. 
  • Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism… 

For task parallelism:

  • spawn/join to spawn a function, and to wait for spawns to complete
  • parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently. 

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!

For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement  parallelism.

Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability.  These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.

Top of mind thoughts are:
 

  • Both C and C++ are important.  No solution should be specific to only one.
  • There is strong value in adding some basic task parallelism and data parallel support as a first class citizen into both C and C++. The time has come.
  • Maintaining serial semantics (see four-reasons-why-parallel-programs-should-have-serial-semantics ) is important. A program should be able to be understood (and executed) as a serial program. This is critical for making it easy to reason about a program that uses parallelism.
  • Task parallelism: nothing is more proven than the simple spawn/sync and parallel for of Cilk Plus.
  • Data parallelism: nothing is more simple than extending syntax to make data parallel operation explicit via array operations such as a[:]=b[:]+c[:]   Fortran 90 added similar capabilities over two decades ago!
  • Data parallelism: elemental functions have an important role and should be included
  • Task parallelism goal: shift parallel programming to explicit language features, making parallelism easy to express and exploit task parallelism that can be optimized by a compiler, and can be more easily tested and debugged for data races/deadlock.
  • We need strong space-time guarantees in parallelism constructs.
  • Data parallelism goal: shift programming to EXPLICIT and EASY-TO-FIND (exploit) data parallelism, so all varieties of hardware can be addressed
  • Everything proposed here is incredibly easy to teach. The power that can be placed underneath via a compiler is a big bonus, of course! That power makes all these very compelling. The data parallelism is immediately convincing by its compact form, but the task parallel constructs are convincing as well. 


None of this is radical – and none of it need be proprietary.

If we don’t get carried away adding other stuff, KISS, we can add these and make a fundamental and important advance for task and data parallelism in C and C++… the languages that lead the evolution to parallelism, and are becoming more (not less) important in the future.

- james
 

For more complete information about compiler optimizations, see our Optimization Notice.

16 comments

Top
anonymous's picture

I've been working on a utility library that provides some constructs to simplify CUDA coding....

Using that, you can write programs that don't need CUDA installed or available, run and debug in serial mode, and later recompile on CUDA with 0 lines of code change. I've written a large complicated simulation program in it and it worked perfectly. It's a very C++ish API, with the aim of reducing the coding effort

The next plan is to allow OpenMP and perhaps TBB to be seamlessly integrated into it, so you can forget about the framework stuff and concentrate on your parallelism.

I intend to put this out as GPL/LGPL as soon as I finish a stage of rewriting. I'm also planning to include some stuff that OpenCV does (frankly OpenCV code quality stinks - It looks like it was written by scientist types in the 1990s). Also looking at implementing image processing stuff using GL shaders stuff, so you can take advantage of GPU parallelism without CUDA, or OpenCL

Here's how a simple bitmap rotation example looks in this framework current version : http://pastebin.com/egaH1eu8

Contributors welcome...

Paul Jurczak's picture

James, Cilk Plus certainly is a strong contender. Thumbs up for making it an open source project!

James R.'s picture

The good news Paul, is my "call to arms" as Raymond called it... is my way of voicing similar concerns. I don't mind the variety on the side-lines where we experiment and learn. I think it gets disruptive when we get to the point where we all need a solution, but there are too many of them - and they differ for the wrong reasons. It's time time to start the process to standardize in high level (abstract) and effective parallelism capabilities for C and C++. And PLEASE - make them the same (of course... except when C++ needs something that simply doesn't translate to C). By starting now, the debate can shift to "what to standardize" instead of "if"... and we might have an adopted standard this decade. I'm patient enough... but ready to see us get onto this path. I'm also bold enough to say "we think we have the answer mostly right with Cilk Plus, and you can try it out!" Not because I'm stuck-up and fixed in my ways.. but because I want to hear those who woud say "we have a better idea, and we've implemented it, and we can compare and discuss!" Let's do so!

Paul Jurczak's picture

It is a little bit of a mixed blessing to see a major new parallel toolkit appear every couple of months. Variety can be good, but parallel programming is difficult enough without having to worry about which tool will flourish and survive the next winter. Is it going to be Cilk Plus, ArBB, C++ AMP, CUDA, OpenCL, something else?

RaymondT's picture

Hi James,

Good post on the call to arms :) liked it a lot. Totally agree with your assessment that "maintaining serial semantics" is important though i think its incredibly challenging to implement "shift programming to EXPLICIT and EASY-TO-FIND..." data parallelism; though i'm not aware of all current literature on this subject. Having programmed in CUDA, the closest i can relate to w.r.t spawn would be the '<<<>>>' syntax but admittedly, CUDA doesn't have the expressivity of Cilk at this point in time but i guess its due to their different approaches and philosophies.

anonymous's picture

Hi James,

Good post on the call to arms :) liked it a lot. Totally agree with your assessment that "maintaining serial semantics" is important though i think its incredibly challenging to implement "shift programming to EXPLICIT and EASY-TO-FIND..." data parallelism; though i'm not aware of all current literature on this subject. Having programmed in CUDA, the closest i can relate to w.r.t spawn would be the '<<<>>>' syntax but admittedly, CUDA doesn't have the expressivity of Cilk at this point in time but i guess its due to their different approaches and philosophies.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.