| October 29, 2009 12:00 AM PDT | |
The Question...
A few people have asked, more or less, "How can I return from a function without waiting for functions I called to complete?"
The answer is, you can't.
In Cilk++ there is an implicit cilk_sync at the end of every function. These three functions behave the same for practical purposes:
void f1() { f2(); }
void f1() { cilk_spawn f2(); }
void f1() { cilk_spawn f2(); cilk_sync; }
In each case function f1 does not return until after f2 returns.
Why Sync?
The sync helps reduce complexity of the implementation, and it also discourages complexity in programs, but most importantly it makes Cilk++ superior to its predecessors in two ways that we believe are essential to effective resource management and ease of use.
First, the implicit sync ensures that resource use of a program does not grow out of proportion to the parallelism in the program. Some older parallel languages turned out to have unbounded resource use -- the longer your program runs, the more memory it uses. In contrast, Cilk++ has been proven to have bounded resource use. On a P-processor system, a Cilk++ program uses no more than P times as much stack memory as it does on a single-processor system. The proof of this property depends on the structure of the control flow graph, including the assumption that functions sync before returning.
Second, a sync at the end of every function helps give a race-free parallel program the same behavior as the corresponding serial program ("serial semantics"). A call to a function works the same regardless of whether the called function contains internal spawns. You don't have to worry about whether the function (which may be in a library you don't have source code for) left another function running in the background that you have to clean up after. Only the spawns visible in the current scope affect a function's behavior.
What Cilk++ is (and is not)
More generally, Cilk++ is not designed for explicit task management: cilk_spawn is not pthread_create. Explicit task management is complicated. Cilk++ is much simpler. Cilk++ keywords grant permission for parallel execution rather than commanding parallel execution. This seemingly small difference actually gives Cilk++ massive advantages.
One important implication is that cilk_spawn is cheap. A spawn just does a little bookkeeping: making a note in a data structure saying that a function call may execute in parallel and checking on return from the function whether it actually executed in parallel. In our current implementation, the cost of cilk_spawn is comparable to the overhead of about 4 function calls. In contrast, pthread_create costs about 1800 times the cost of a function call.
When Cilk++ does need to migrate work to load-balance among the processors, however, it can be expensive -- tens of thousands of cycles instead of tens of cycles. The Cilk++ runtime system guarantees thread migration to be rare if the application has sufficient parallelism. (What is sufficient is a topic we'll address in the future.) Consequently, the overhead of the Cilk++ runtime system for most applications is at most a couple percent, and performance of Cilk++ applications generally scales up linearly with the number of processors.
The requirement that every function syncs before returning means that certain styles of programming work better with Cilk++ than others. Divide-and-conquer techniques tend to work well, for example. Quicksort is an example of such an algorithm. A cilk_for loop is another -- when you change the keyword "for" to "cilk_for" the compiler rewrites your loop as a recursive function that repeatedly divides the range of the loop.
Some multithreaded programs, taken as a whole, do not map naturally to Cilk++. These programs may pass messages back and forth between threads or poll for changes made by another thread. The problem with these programs is that they do not have serial semantics. Serial C++ programs never interleave execution of different functions. A solution is to look for a part of the program with serial semantics. For example, if a GUI thread and a compute thread exchange messages, it may be possible to use Cilk++ within the part of the compute thread that does the work.
Because you can't force creation of a thread, you don't need to explicitly destroy threads either. Cilk++ garbage collects spawned functions. You can spawn off a million function calls in a long running loop without a cilk_sync, if you like. On a four core system your program won't use more than four times the stack space of a single-threaded program. It may be inefficient for one function to spawn a million times (another topic we'll talk about that later), but our system guarantees that the result will not be a million threads running in parallel.
What this means for Cilk++ developers
- Give your functions serial semantics. A function does some work and returns when it's done.
- Think about decomposing the problem into parts, not managing threads.
How best to divide the problem into parts, and what to do when your program doesn't seem to fit these principles, are subjects for future discussion.
For more complete information about compiler optimizations, see our Optimization Notice.

