| Last Modified On : | October 28, 2009 12:59 PM PDT |
Rate |
|
Posted by Matteo Frigo originally on www.cilk.com on Mon, Apr 13, 2009
Some users of Cilk++ do not like the fact that Cilk++ functions and C/C++ functions are distinct types. (We do not like this distinction either.) The purpose of this (somewhat nerdy) article is to explain why this distinction exists in the first place, and to show that bad things happen in systems that do not make such a distinction. Briefly, the source of the difficulty is that C and C++ are normally compiled on a linear stack, whereas parallel code really should target a cactus stack. I now explain what this means.
Pretty much all implementations of C and C++ use a linear (contiguous) stack as storage for procedure activation records (also called stack frames). A procedure call pushes the return address onto the stack, and the callee allocates space on the stack for its own local variables etc. This simple linear stack mechanism works very well in a serial world, and it is really efficient.
The linear stack structure is generally inadequate for parallel code, however. To illustrate why, consider the pseudo-code below, which recursively spawns four task in a balanced binary-tree fashion. (The comments in the code show which processor executes which task, as will be discussed in detail later.) While the program is expressed in pseudo-Cilk syntax, you should think of it as an abstract program that could well be expressed in Cilk++, TBB tasks, OpenMP tasks, or your own favorite tasking system.
A() { /* A executes on P0 */
spawn B(); /* B executes on P0 */
spawn C(); /* C executes on P1 */
sync;
}
B() { /* B executes on P0 */
spawn D(); /* D executes on P0 */
spawn E(); /* E executes on P2 */
sync;
foo(); /* foo is a C or C++ function */
}
C() { /* C executes on P1 */
spawn F(); /* F executes on P1 */
spawn G();
sync;
}
main() {
A();
}
Assume that we have three processors P0, P1, and P2, available for executing the program. Assume that you insist that each processor have its own linear stack, and that the calling conventions of A, B, and C be compatible with normal C/C++ code. (These assumptions will be forced upon you if your task scheduler is implemented as a C or C++ library.) Also assume that a scheduler allocates the tasks A-G to the three processors. We don't really care about how the scheduler works, but the scheduler should be "reasonable" in the sense that it tries not to leave processors idle if there is work to do somewhere in the system. (Otherwise, you can just implement a scheduler that maps the whole program onto one processor, which is not too useful.)
Execution of the program starts with A executing on P0, say. A now spawns B and C. Assume that B stays on P0, and that C executes on another processor, say P1. Let's now focus on B, which spawns D and E. As before, assume that D stays on P0, and that E is scheduled on P2, which is idle. Meanwhile, P1 executes C, which spawns F (on P1) and G. The situation is now as follows: P0 is working on D, and its stack contains D, B, and A. P1 is working on F, and its stack contains F and C. P2 is working on E, and its stack contains E only. G is ready for execution but nobody is working on it, because all processors are busy doing something else.
Assume now that D completes, and thus P0 becomes idle. P0's stack still contains the activation records of B and A, which cannot be deallocated while E is still executing. Because task G is the only work available in the system, a "reasonable" scheduler would start executing G on P0. P0's stack now contains G, B, and A. (This is where things start breaking down, because G is not a descendant of B, and yet G is pushed onto the stack after B. However, we are not in trouble yet.)
Assume now that E completes. Both children of B have now terminated, and in principle B can continue after the sync and call foo(). Task B cannot really continue at this point, however, because the normal C/C++ calling conventions assume that foo() can extend the stack where B resides. If foo() were executed, it would overwrite the stack portion used by G, right after the activation record of B. The scheduler is now stuck: work is available but it cannot be scheduled because of lack of stack resources. (Here, "stuck" means "slow", not "deadlocked").
The really annoying thing about this example is how simple it is. All we are trying to do is traverse a balanced binary tree with four leaves, and still the scheduler cannot proceed because of a bottleneck that the programmer cannot control.
You may think that getting stuck is bad enough (or at least "unreasonable"), but in fact the situation is even worse: The program may run out of memory. Specifically, in our example, P0's stack may overflow. For example, assume that B and G need a "large" activation record about as large as P0's stack, and all other tasks use a "small" activation record of size approximately 0. You can execute the program sequentially if you remove all spawns and syncs, and the stack will not overflow because B and G are never live at the same time in a sequential execution. However, in our example, G and B are both on P0's stack at the same time, causing overflow. (Good luck preallocating a stack large enough that is guaranteed to work in all cases).
What went wrong? The scheduler got stuck because we imposed three constraints on our system: we want activation records to be allocated on linear stacks, we want to employ exactly one stack per processor, and we want no processors to be idle if ready tasks exist in the system. It appears that you cannot satisfy have all three constraints at the same time.
How do practical systems work around this problem? Cilk++ (and its predecessor MIT Cilk) relaxes the constraint that activation records be contiguous. In other words, Cilk++ does not use a linear stack, but instead it allocates activation records on a cactus stack, which is really a heap-allocated tree of activation records. (We use a special-purpose fast memory allocator for this purpose.) This choice lets us implement a system that is guaranteed to use the equivalent storage of one stack per processor (although the stacks are now noncontiguous), and where the scheduler never gets stuck. The price to pay for these properties is that Cilk++ and C++ are two distinct languages with different calling conventions, and the type system enforces this distinction.
The Intel TBB library takes a different approach. TBB uses linear stacks, as is unavoidable for a C++ library, and it restricts the scheduler so as not to run out of stack space. Specifically, if I understand correctly, in TBB a "thief" processor only steals a tasks if the task's stack depth is at least as large as the thief's current stack depth. Consequently, TBB cannot always execute a ready task when a processor is idle. (Arch Robison, the principal author of TBB, is a smart guy and he is well aware of this issue).
You can also imagine systems that throw RAM at the problem by allocating more stacks on demand, up to a certain threshold, above which which they must leave processors idle. (Good luck understanding where either the time or the memory is going if you have one of these environments.) You can further imagine various virtual memory tricks that remap virtual cactus stack pages in order to maintain the illusion of linear stacks. (We did this in MIT Cilk in 1996, and it is a can of worms). If you have a wild imagination, you can also envision various hardware mechanisms for supporting cactus stacks. The Burroughs B6500 had such mechanisms in the late '60s. (By the way, the cactus stack is also called the spaghetti stack, and Guy Steele wrote a paper "Macaroni is better than spaghetti" describing various improvements. I am not kidding).
To go back to our original question, why did we (Cilk Arts) choose to implement Cilk++ on top of a cactus stack, thus forcing Cilk++ and C++ to be distinct languages? Briefly, here is our position:
In this view, trading some typing convenience for unrestricted scheduling appears to us to be the best compromise.
COMMENTS
posted @ Thursday, April 16, 2009 1:49 AM by Aydin Buluc
posted @ Friday, May 08, 2009 1:13 PM by Chris Ryland
