Four Reasons Why Parallel Programs Should Have Serial Semantics

Submit New Article

Published On :   October 29, 2009 1:00 AM PDT
Rate
 


by Steve Lewin-Berlin

Some parallel programming environments require the developer to relearn the fundamentals of programming in order to think in parallel. Cilk++ takes a different approach. One basic design principle of Cilk++ is that Cilk++ programs have serial semantics, that is, a Cilk++ program can be understood (and executed) as a serial program. The Cilk++ keywords were designed to make this serialization look similar to the parallel Cilk++ program.

 

This simple but powerful principle guides more than just the appearance of Cilk++ programs. It also simplifies the process of developing and testing a Cilk++ program, and it allows Cilk Arts to provide powerful, efficient, and provably correct tools.

(Serial semantics should not be confused with sequential consistency, which is a concept invented by Leslie Lamport and discussed extensively in the literature of parallel computing. A parallel computing system with sequentially consistent memory guarantees that, in Lamport's words, "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." In contrast, serial semantics means that the program can be executed using a single thread of control as an ordinary serial program - one does not need a conceptual parallel model to understand -- or execute -- the program.)

Cilk++ Serial Semantics

First, let's take a closer look at how Cilk++ provides serial semantics.

Cilk++ provides three new keywords: cilk_spawn, cilk_for, and cilk_sync. Each of these has an intuitive serial interpretation:

Keyword

Parallel meaning

Serial meaning

cilk_spawn

Allow a called function to run in parallel with its caller

Call a function normally, where the caller waits for the called function to return before resuming

cilk_sync

Wait for parallel activity to complete

Do nothing (there is no parallel activity)

cilk_for

Allow multiple iterations of the body of a loop to run in parallel

Execute a standard for loop in which loop iterations execute one at a time

As you can see, a Cilk++ program can be converted to a serial program quite easily: simply replace cilk_for with for, and delete the cilk_spawn and cilk_sync keywords. The Cilk++ development system provides a compile-time switch to build the "serialization" of a Cilk++ program. Using Cilk++ for Linux, the option [-fcilk-stub] is used with cilk++ to "stub out" Cilk features. Using Cilk++ for Windows, run cilkpp with the [/cilkp cpp] option.

Okay, I hear you saying, "Fine, your fancy parallel program can devolve into a serial program. So what? If I wanted the serial version, I would have just used C++ in the first place!" Glad you raised the issue! A parallel programming model with serial semantics maintains four real advantages over program models without serial semantics.

  1. The equivalent serial C++ program can easily be debugged and analyzed using existing development tools. This is an ideal way to debug a parallel program without worrying about parallel interaction, and without needing to consider multiple stacks and program counters. Also, as vanilla C++, all symbol and debug information is available to standard system and third-party tools. Tools that are easy to run on the serialization but might require custom versions to operate on the parallel program include performance profilers, code coverage tools, security analysis tools, memory leak checkers, and any other tools that work with C++ programs.
  2. The serial semantics of Cilk++ programs allows Cilk Arts to build better tools that analyze the performance and correctness of Cilk++ programs, and these tools offer strong guarantees about the program. For example, our Cilkscreen race detector runs the test program in a serialized mode on a single processor. With information about the parallel structure of the program (i.e., where the spawns and syncs occur), Cilkscreen can analyze all possible schedules of the program that might execute on any number of processors. In other words, Cilkscreen can detect whether race conditions exist that could manifest under any possible execution of the program. The serial semantics of Cilk++ permits Cilkscreen to analyze the parallel program while running in a single worker (the Cilk++ name for a thread of execution) on a single processor.
  3. The serial semantics of Cilk++ provides a real advantage to program testing and quality assurance. Cilk++ is designed to ensure that parallel programs written in Cilk++ can be deterministic, reliably producing results identical to the serialization. This guarantee makes it easy to leverage traditional testing tools that assume that multiple runs of a program always return the same result if given the same input. The serial semantics of Cilk++ eliminates the need to consider timing and scheduling considerations when evaluating program correctness, and it makes it simple to compare the output of multiple program runs for consistency. Repeatability is one of the key properties that make Cilk++ programs testable and reliable.
  4. Serial semantics provides high performance while allowing you to design your program without specifying the number of processors on which the program will execute. The Cilk++ runtime scheduler takes care of efficiently load balancing your program across however many processors are available. When a portion of your parallel computation executes on a single processor, Cilk++ can execute it just like ordinary C++ code, taking full advantage of all the compiler optimizations and runtime efficiencies that a good C++ system offers. By starting from good single-core performance, Cilk++ ensures that a program with sufficient parallelism gets good speedup whether it is run on a large number of processors or just a few.

What's the Catch?

Perhaps this all sounds too good to be true - and maybe now you're thinking, "Okay, so what‘s the catch?" All right, there are a few things I've glossed over. For one, since there are some kinds of parallelism that actually don't have serial semantics, you can't write programs using these parallelization strategies in Cilk++. For example, Cilk++ doesn't support producer/consumer, software pipelining, or message passing. Nevertheless, although we're giving something up to have to serial semantics, our experience is that less is more. The four advantages outlined above readily make up for any loss in generality.

There's a second catch, however. Note that in the third point above, I said that Cilk++ programs can be deterministic, not that they are deterministic. How so? Well, the output of a program with a determinacy race (see "Are Determinacy Races Lurking in Your Multicore Application") can depend on how the program is scheduled on multiple processors, and the race condition that leads to this nondeterministic behavior is usually a bug. In most programming environments, you'd be stuck at this point trying to find the race bug by laborious program inspection. In contrast, in the Cilk++ programming environment, Cilkscreen can help you find and eliminate these races, giving you your determinism back. Thus, this "catch," though problematic, can be dealt with effectively precisely because of serial semantics: in effect, Cilkscreen uses the serialization as a benchmark for correctness of a parallel execution.

Sometimes, however, a Cilk++ program is intentionally nondeterministic. For example, some search programs use a concurrent hash table to "remember" the results of intermediate subsearches so that the subsearch doesn't have to be reexecuted if it reoccurs. In this context, each slot of the hash table may contain a mutual-exclusion lock to allow parallel branches of a computation to safely access and update the items stored in the slot. Cilkscreen correctly ensures that races are not declared on operations protected by the same lock.

In practice, however, we have found that most Cilk++ programs require few, if any, locks. In particular, Cilk++'s parallel control constructs obviate the need for the locks that other parallel programming models require for interthread communication and synchronization. Moreover, for many common situations where locking would seem to be necessary to avoid a race, Cilk++ provides hyperobjects, a novel data structure that can resolve races on global variables without sacrificing performance or determinism. By avoiding locks and using hyperobjects, Cilk++ sidesteps many performance anomalies caused by locks, such as lock contention, which can slow down parallel programs significantly, and deadlock, which may cause your application to freeze midexecution.

The Bottom Line

To sum up, Cilk++'s serial semantics provide three key benefits - performance, reliability, and productivity:

  • Scalable performance comes from a model that doesn't rely on the programmer knowing the number of available processors in advance. Developers can use traditional serial performance tuning tools to find hotspots and improve both serial and parallel performance.
  • Reliability is enhanced through the testing advantages of repeatability and determinism, as well as the ability to build an efficient and provably correct race detector that can compare the parallel semantics of the program with the serial semantics provided by the program's serialization.
  • Finally, programmers can be more productive when they use familiar paradigms for new development, and of course, legacy serial code can be converted into Cilk++ easily, because it does not need to be rewritten to adopt a completely different programming model such as data-parallel or message-passing styles.

Tags: 

COMMENTS

Nice article, and I'm always glad to hear about different approaches to concurrency issues.  

The only thing that I see here is that it suffers from the master thread, worker thread problem of being confined to extremely well defined sets of data. Much like OpenMP this works great when you're talking about a multicore desktop a with shared memory.  

Using primitives like this is great idea but it doesn't solve the whole problem which is both shared and non-shared memory types of concurrency. 

/revealbias 

Check out erlang :D

posted @ Friday, December 12, 2008 4:43 PM by John Bender


John, 

Your observation that Cilk++ is designed for shared memory systems is absolutely correct, though not all shared memory systems fall in the desktop class. The trend that we see is toward larger shared memory systems. 

I think Erlang offers some good ideas, but it doesn't solve the same problems that Cilk++ addresses, such as providing a parallel strategy for legacy code, easing the transition to parallelism for programmers trained to solve problems serially, and offering a high-performance and reliable solution that scales from one to many processors with minimal overhead. 

Our goal is not to solve all concurrent programming problems with a sitngle solution, but to offer a great answer to what we think is the problem that most developers will need to tackle over the next few years - making the transition from single processor systems to multicore parallel computers.

posted @ Friday, December 12, 2008 5:33 PM by Steve Lewin-Berlin


@Steve 

Spot on with both points. Shared Memory does not mean desktop class, and there is room for both solutions to be sure.  

I certainly the agree that creating language constructs to ease the pain of parallelizing code is pretty important these days.

posted @ Monday, December 15, 2008 8:08 AM by John Bender


Hi Steve! 

I disagree with the premise that scalable parallel programs can be expressed with control flow style on a shared memory space. Sure, we can upgrade legacy code for shared memory multiprocessors with a few markups, but at some scale, physical reality requires us to abandon the shared memory view (Matteo pointed me to the 'Horizons of Parallel 
Computation' paper which proves it).  

When shared memory control flow style stops scaling, composable dataflow pipelines provide the only reasonable mechanism for optimizing process locality and interprocess communication (hardware designers call it "place and route"). 

For process graphs with each node having degree 2 (linear pipelines), the ratio between best and worst case placement of processes in a processor array grows as O(N^(3/2)) in the number of processing elements. At 8 cores you don't have a big problem synchronizing your shared memory space, but how will Cilk address communication issues among 64 cores?

posted @ Tuesday, December 16, 2008 12:48 PM by Amir


Amir, 

I have personally used shared-memory Cilk successfully on 256 processors ten years ago, when the 256 processors filled a room. (The room and the processors were known as ``SGI Origin''). We used Cilk to play chess at the time, and it was working perfectly well on that many cores. 

That being said, I do find Gianfranco's argument persuasive, and it is probably true that in the fullness of time every computer system becomes a huge systolic array. However, we are nowhere close to that point (yet).

posted @ Tuesday, December 16, 2008 1:55 PM by Matteo Frigo


Hi Matteo, 

These days 256 cores fills 4 slots on a PCI bus :) 

Let me step back from my original statement a little. Many big problems (perhaps most of the biggest problems) do not have the sort of interprocess communication constraints to limit them in even widely distributed processor networks. Cilk will rock for these problems. Folding@home, crypto-cracking, chess, and many data parallel operations fit this space. Even the NP-Hard placement optimizations distribute very well with minimal interprocess communication so I don't need to optimize the placement of the parallel processes that are optimizing the placement of some other parallel processes...  

Other problems, like N-body simulations, have a hopeless all-to-all IPC topology and can't be rescued by any placement optimizations. 

But then there's this huge middle ground of problems like funnel sort, spectral transforms, filter banks, FDTD methods and production systems where process placement is built into the way we think about the problem. As we scale these problems, the widely held false assumption of O(1) memory access time breaks down to the physically accurate O((nln n)^(1/2)) from HOPC and the only option is to abandon the global memory model. Someone will have to rewrite these algorithms into communicating processes instead of synchronized tasks. 

I'd like to explore this more. I'll have to talk to you when I'm done with my current contract.

posted @ Tuesday, December 16, 2008 3:31 PM by Amir


Cilk appears, from what I can see, to be an evolution of the parallelization methods used in, say, Occam (ie: simple, effective, explicit parallelizing) with a large dose of the ideas from Manchester University's 1974 paper on parallel programming documenting how it should mostly be done in the compiler with minimal-to-no intrusion into the source. 

Yes, these are old ideas, but let's face it. The Transputer failed to go anywhere useful and the Manchester paper was largely ignored. Parallel methods since then have been horribly complex and daunting to the point that the maxim has become Abandon Hope, All Ye Who Press Enter Here. 

Yes, shared memory is a headache, as DSM (Distributed Shared Memory) is often slow or only available on very expensive iron. Some languages add a keyword of "mobile" to denote when a given block of parallel code need not communicate over shared memory and so isn't faced with this problem. I can't see going from three keywords to four killing the Cilk++ team, but if I were them, I'd want to make sure that was the right fourth keyword to add. If you want to keep the keywords down, you want to make sure that an addition is the best of all possible additions. A suggestion, even if it did sound neat, would be completely insufficient to base a fairly significant code change on.

posted @ Tuesday, December 23, 2008 5:52 PM by Jonathan Day