Global Variable Reconsidered

Submit New Article

October 29, 2009 1:00 AM PDT


by Charles Leiserson

In a widely applauded article published in 1973 and entitled, “Global variable considered harmful,” Bill Wulf and Mary Shaw argued, “We claim that the non-local variable is a major contributing factor in programs which are difficult to understand.” Thirty-five years later, however, nonlocal variables are still very much in vogue in production code. Moreover, as software developers look toward multicore-enabling their legacy codebases, nonlocal variables pose a significant obstacle to parallelization, because they can cause race bugs, a topic I discussed in an earlier blog. In this blog, I’ll discuss some of the strategies for coping with global and other nonlocal variables when parallelizing software.

 

The Lure of Nonlocal Variables

To begin with, what were Wulf and Shaw concerned about, and why are nonlocal variables nevertheless so prevalent? To be clear, by nonlocal variable, I mean one that is declared outside of the scope in which it is used. A global variable is a nonlocal variable declared in the outermost program scope.

Wulf and Shaw were concerned that when a variable is used far away from its definition, a local inspection of the code cannot easily determine its meaning. They identified several reasons for eschewing nonlocal variables. Among them, they observed that if a function modifies the nonlocal variable, this side-effect can be difficult to reason about. In their words, “Untold confusion can result when the consequences of executing a procedure cannot be determined at the site of the procedure call.”

The clear alternative to referencing nonlocal variables is to reference local ones, which one can do by passing the variable as an argument to function calls. The problem with this approach is that it leads to parameter proliferation — long argument lists to functions for passing numerous, frequently used variables. For every global variable referenced in a function, an extra parameter would have to be declared to the function to pass that variable. Not only would that lead to functions with dozens of extra arguments, there also the cost of passing the arguments to consider.

As a practical matter, nonlocal variables are just damn convenient. If I have a code operating on a large data structure, why should I have to pass the data structure to each and every function that operates on the data structure? It’s much more convenient to keep the function interfaces clean with fewer parameters and simply refer to the data structure as a global variable or, commonly in object-oriented programming, as a nonlocal member variable.

A Real-world Example

Parallel programming exacerbates the problems with nonlocal variables, however, because they can cause races. To illustrate the issues, let me take an example from one of Cilk Arts’s customers who supplies 3D computer-aided design software. They represent a large mechanical assembly as a tree of subassemblies down to the individual parts, which may be complex geometric objects. For example, the first few levels of a pickup truck might be represented something like this:


One of the operations this customer’s product does is collision detection, which involves listing all the parts in an assembly that intersect a given target object. For example, to compute a cutaway, the target might be a plane through 3-space. I’ve abstracted such a code below:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x)

{

switch (x->kind) {

case Node::LEAF:

if (target->collides_with(x))

{

output_list.push_back(x);

}

break;

case Node::INTERNAL:

for (Node::const_iterator child = x.begin();

child != x.end();

++child)

{

walk(child);

}

break;

}

}

In this example, the walk function starts at the root of the assembly and recursively visits all the subassemblies. If it encounters an internal node, it recursively walks the subassemblies. If it encounters a leaf (a part), it tests whether the part intersects the target object, and if it does, it appends the part to the global variable output_list. At the end of the computation, output_list contains all the parts that intersect target.

On the surface, this code is easy to parallelize using Cilk++. The programmer simply needs to replace the for keyword with cilk_for, which allows each of the children to be walked in parallel:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x)

{

switch (x->kind) {

case Node::LEAF:

if (target->collides_with(x))

{

output_list.push_back(x); // race bug

}

break;

case Node::INTERNAL:

cilk_for (Node::const_iterator child = x.begin();

child != x.end();

++child)

{

walk(child);

}

break;

}

}

Unfortunately, the output_list global variable causes a race in the line indicated. When two branches of the computation try to update output_list, they may collide. (See my blog on determinacy races.)

Solution #1: Locking

One way to avoid the data race is to use mutual-exclusion locks, or mutexes, to ensure that only one subcomputation is appending to output_list at a time (modified code is in red):

Node *target;

std::list<Node *> output_list;

cilk::cilk_mutex output_list_mutex;

...

Void walk(Node *x)

{

switch (x->kind) {

case Node::LEAF:

if (target->collides_with(x))

{

output_list_mutex.lock();

output_list.push_back(x);

output_list_mutex.unlock();

}

break;

case Node::INTERNAL:

cilk_for (Node::const_iterator child = x.begin();

child != x.end();

++child)

{

walk(child);

}

break;

}

}

This code is now correct, but it may exhibit an adverse performance problem. If many parts must be added to output_list, the code may be slowed down by lock contention as multiple processors all vie for access to the mutex and then serialize their accesses to the update code. Although this kind of “hotspot” can drastically curtail parallelism, the code modifications are fairly local and straightforward, if you can find all the data races. Another limitation of this approach, though, is that the final list is produced in a non-deterministic (or jumbled) order, depending on how the computation is scheduled.

Solution #2: Refactor the code

Another strategy to solve the race on the global variable output_list is to refactor the code to eliminate nonlocal references. The idea is to construct portions of the output locally and combine them on the way up the call tree, passing the partial results as a returned value or an output parameter of the walk function. Although walk is a void function, the following code illustrates the approach using an output parameter, because that’s the more general way to solve the problem when the function already returns a value or if there are more than one nonlocal variables to eliminate:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x, std::list<Node *> o_list)

{

switch (x->kind) {

case Node::LEAF:

if (target->collides_with(x))

{

o_list.push_back(x);

}

break;

case Node::INTERNAL:

std::vector<std::list<Node *> > children_list(x.num_children);

cilk_for (Node::const_iterator child = x.begin();

child != x.end();

++child)

{

walk(child, children_list[child]);

}

for (int i=0; i < x.num_children; ++i)

{

o_list.splice(

o_list.end(),

children_list[i]

);

}

break;

}

}

In this code, a local array children_list is constructed to hold the output lists from each of the children. An array is necessary, rather than reusing a single child list for all of the children, because otherwise a race might occur as the children are computing their results in parallel. After the parallel recursive walk produces all the output lists from all of the children, a serial for loop gathers all of the results into the output paramenter o_list. The initial call to the walk of the entire assembly passes the global variable output_list as its second parameter.

Although the refactoring solution incurs some overhead due to parameter passing, it generally produces code with predictably good performance. Moreover, unlike the locking solution, it constructs output_list with elements in the same order as in the serial execution. In contrast, locking may cause the elements of output_list to appear in a jumbled order.

Unfortunately, however, refactoring can mean revising the logic of substantial sections of code, and the prospect of refactoring a large codebase is daunting. Should you think you’re safe because you don’t use recursion or lots of nested function calls, think again! The same problem can arise when the body of a parallel loop accesses variables declared outside the loop.

Solution #3: Hyperobjects

Cilk++ provides a novel mechanism called hyperobjects, which mitigate data races on nonlocal variables without the performance problems endemic to locking or the need for code restructuring — the best of both worlds. The basic idea is that different parallel branches of the computation may see different views of the hyperobject, but combined they provide a single, consistent view. The following code uses a particular kind of hyperobject called a reducer:

Node *target;

cilk::hyper_ptr<cilk::reducer_list_append<Node *> > output_list;

...

void walk(Node *x)

{

switch (x->kind) {

case Node::LEAF:

if (target->collides_with(x))

{

(*output_list).push_back(x); // Or use ->

}

break;

case Node::INTERNAL:

cilk_for (Node::const_iterator child = x.begin();

child != x.end();

++child)

{

walk(child);

}

break;

}

}

A reducer over a type T is a hyperobject for which T’s member function reduce() implements a (usually associative) binary operator ? and its constructor implements an identity e. In our example code, the reduce() function for reducer_list_append implements list concatenation, which is an associative operation: if A, B, and C are lists and ? is list concatenation, we always have (A ? B) ? C = A ? (B ? C) .

A declaration hyper_ptr<T> foo instantiates foo as a hyperpointer to a reducer of type T, which is initialized to e. In our code, output_list is a hyperpointer to a list, which is initially empty.

A dereference *foo produces a local view of type T of the reducer. Views may differ from point to point in the program and are combined using reduce() to produce a value for the global reducer hyperobject. In our code, dereferencing *output_list yields a local view of output_list, to which the object x is appended. At the end of the computation, output_list contains all the list elements, and as in the refactoring solution, they appear in exactly the same order as they are in the list produced by a serial execution.

The reducer solution implemented in Cilk++ has low overhead. Dereferencing costs about the same as a null function call, and there is no overhead for calling or spawning functions. As long as there is enough parallelism in the application, the reduce() function, which is invoked as an up-call by the Cilk++ runtime system to combine views, is executed relatively infrequently. Specifically, the number of calls to reduce() is at most proportional to the span of the computation (see my blog on parallelism).

The Cilk++ distribution includes a library for common reducers, including list concatenation, sum, minimum, maximum, and logical operations. Moreover, programmers can write their own. Cilk Arts is also developing other kinds of hyperobjects. We’ll keep you informed about Cilk Arts innovations in this blog and on the Cilk Arts webpage.


COMMENTS

I just stumbled over this blog and really like the well-focused articles. Please, keep up the good work ;-) !

One short remark: shouldn't the "o_list" parameter of the refactored "walk" member function be a reference, like in:

void walk(Node *x, std::list<Node *>& o_list)
.........................................................^............

Otherwise the lists contents are copied to the argument variable when called and all work is done on the copy and not on the original object passed into "walk". The caller wouldn't ever get any results back from the "walk" member function.

Kind regards,
Bjoern

posted @ Wednesday, October 29, 2008 3:22 PM by Bjoern Knafla


Bjoern,

you are correct, of course. Thanks
for catching the bug.

posted @ Thursday, October 30, 2008 4:29 PM by Matteo Frigo