Cilk Wins Most Influential PLDI Paper Award

Submit New Article

Published On :   October 29, 2009 1:00 AM PDT
Rate
 


by Matteo Frigo

Last week I went to the 2008 Programming Language Design and Implementation (PLDI) conference to receive the award for the most influential paper published in PLDI 1998. The award committee had the following to say about our paper:

"The 1998 PLDI paper "Implementation of the Cilk-5 Multithreaded Language" by Matteo Frigo, Charles E. Leiserson, and Keith H. Randall introduced an efficient form of thread-local deques to control scheduling of multithreaded programs. This innovation not only opened the way to faster and simpler runtimes for fine-grained parallelism, but also provided a basis for simpler parallel recursive programming techniques that elegantly extend those of sequential programming. The stack-like side of a deque acts just like a standard procedure stack, while the queue side enables breadth-first work-stealing by other threads. The work-stealing techniques introduced in this paper are beginning to influence common practice, such as the Intel Threading Building Blocks project, an upcoming standardized fork-join framework for Java, and a variety of projects at Microsoft."

At Cilk Arts we are really honored by this recognition. I would like to express my deepest gratitude to Kathleen Fisher and to the rest of the award committee for honoring us in this way.

Our 1998 paper is based on a couple of simple ideas that taken together yield a powerful point of view on parallel programming.

The first idea is that creating a new thread does not have to be expensive. For example, the spawn keyword in the MIT Cilk implementation costs you something like 10-20 cycles. (Mohr, Kranz, and Halstead were the first to show how to do this with their lazy task creation technique, which we optimized by eliminating locks in the common case.)

From the point of view of the programmer, cheap expression of parallelism eliminates the whole issue of thread granularity that you have in, e.g., pthreads programs, where you must make sure that the thread that you create performs enough work to amortize the overhead of the thread creation. (Then you get into thread pools, which are a nightmare to manage when you have more than one library managing its own threads.) In Cilk, in contrast, you do not have to worry too much about the cost of a spawn: If you think that a function can be executed in parallel, just spawn it---it will probably not slow down your program.

The second idea is harder to explain, and you will have to read the paper for the details, but it boils down to this. In the execution of a parallel program, one can distinguish between cycles that would be spent on a single-core execution (called the “work term” in the paper), and cycles that are spent solely because of parallelism, e.g., spent migrating work from one core to another (called the “critical-path term” in the paper). When implementing a parallel programming language, the implementer usually faces tradeoffs between reducing overheads in the work term or reducing overheads in the critical-path term. The idea is that, in Cilk, we *postulate* that the work term is more important.

Why is the work term more important than the critical path term? (By the way, in more recent publications we don't say “critical path,” and use the word “span” instead.) The paper explains that the work term is more important whenever your program has more parallelism than the number of availables cores. This tends to be the case in Cilk programs because 1) expressing parallelism is cheap, so Cilk programs tend to spawn millions of threads, and 2) if your program did not have enough parallelism you would not bother running it on many cores anyway. This assumption of abundant “parallel slackness” leads to several implementation choices explained in the paper, such as the “two-clone” strategy and an asymmetric mostly lock-free protocol for work-stealing.

These ideas form a powerful point of view that continues to be fruitful to this day, and that has impacted the implementation of the new Cilk++ features such as exception handling and hyperobjects. I look forward to writing another paper on these topics some day.

PS: I would also like to mention another deserving paper from PLDI 1998: “Optimizing direct-threaded code by selective inlining,” by Ian Piumarta and Fabio Riccardi. This paper describes the technology that was later used in QEMU, and it qualifies in my opinion as one of the greatest hacks of all time.


COMMENTS

We provided a link to this blog on http://www.multicoreinfo.com , which is a hub for multicore processor resources.

posted @ Tuesday, June 17, 2008 11:40 AM by Suren