| October 29, 2009 12:00 AM PDT | |
Races and Reducers: the Risks and Rewards
One of the biggest challenges in building a parallel program is dealing with data races. Cilk++ offers several tools and techniques to find and eliminate races from your program. Reducers provide a a powerful mechanism for eliminating races, but as Spiderman said, "with great power comes great responsibility."
The Problem
I recently looked at a customer problem that at first looked like an obvious application for reducers. Briefly, the task was to sort a collection of objects into separate bins based on certain properties of the objects. The simplest parallel implementation, converting the for loop to a cilk_for loop, introduces data races on the bins if two or more objects can be put into the same bin.
At first look, this seemed like a good opportunity to use reducers. The contents of each bin is a list of elements, so why not just have each bin hold a list reducer? Reducers are extremely efficient when used sparingly, but unfortunately can become expensive if you create thousands or even millions of reducers, as would be needed in this case.
The other safe approach to handling races is to use locks to protect access to the shared memory locations. In this case, that would either mean using a single lock to protect the entire collection (safe, but bad because the high contention on this lock would eliminate all advantages of parallelism) or to create and acquire a lock for each bin (also safe, but bad because it requires thousands or millions of expensive locks.)
Luckily, there is another way.
The Approach
Mutex locks provided by the operating system are expensive to create and acquire, and are generally more heavy weight than is required for this problem. However, there is a much lower-cost alternative. The underlying hardware offers atomic instructions, which are effectively locks around single hardware instructions. These are the same instructions that are used to implement mutexes, but we can use them directly to save a significant amount of overhead for our problem at hand. It is important to note that there is no free lunch - atomic instructions require a memory barrier that can cost 40 or more instruction cycles. This is far more expensive than the few cycles used by an unlocked read/write, but still much less than the thousands required to create and acquire mutexes. You can think of these as small locks that lock just the read/write hardware operation to ensure that no other processor can race on the memory between the read and the write.
My approach is to use the atomic instruction to safely update the bins in a way that will not create data races. As with any use of locks, this approach prevents data races, but does not guarantee a deterministic result. In this case, each bin will hold the correct list of elements, but the order of the elements in the list is indeterminate.
The Solution
An atomic swap operation for pointers is available under different names in both Gnu and Windows C/C++ compilers:
| GNU: | __sync_lock_test_and_set |
| MSVC: | InterlockedExchangePointer |
An implementation of a thread-safe sets can be implemented as a template with the class C of the set element as the template parameter:
#ifdef __GNUC__ |
Figure 1 shows the set before the update:
| Figure 1. |
The next diagram in Figure 2 shows the data after the new element (nel) has been created and "swapped in", while the address of the old set is stored in a temporary (tmp). Note that the data is in an inconsistent state at this stage, but, as Figure 4 shows later, it is still comletely thread-safe:
| Figure 2. |
In Figure 3 the contents of tmp variable is copied to the next field of the nel object thus comleting the update:
| Figure 3. |
The diagram in Figure 4 shows what happens if two updates happen in parallel. It assumes that right after the first atomic swap (Figure 2) another threads comes in and does its own atomic swap before the first thread had a chance to finish the update (as in Figure3). The diagram in Figure 4 assumes that the second thread creates objects tmp2 and nel2. Note that objects tmp and nel are visible only to the first thread, while tmp2 and nel2 -- only to the second.
| Figure 4. |
While the situation in Figure 4 looks messy, it becomes perfectly kosher once each thread comletes its update, i.e. when the first thread moves tmp to nel.next and the second thread -- tmp2 to nel2.next :
| Figure 5. |
If the order of the updates was swapped, i.e. thread 2 did the AtomicSwap before thread 1 the situation would be almost the same but the relative order of nel and nel2 in the set would be reversed. Which, of course, is perfectly OK for unordered sets.
From Sets to Hash Tables
Once we have implemented a "safe" unordered set we can extend the implementation to hash tables by using safe sets to chain the values in the hash table. In Figure 6 a hash table is shown as an array of pointers to individual chains:
| Figure 6. |
Notes
- Note that this implementation only guarantees that two add operations would never cause a determinancy race, but add operation could race with other operations on the set, like iteration: If another thread tries to iterate over the set before the next field of the new element nel is updated (Figure 2) then the result would be unpredictable since we haven't even initialized the next field of the new element.
- When the above sample code is run under the Cilkscreen tool the latter will complain about the data race on the elt0 member field of the SafeSet object. This complaint can be suppressed by using the __cilkscreen_disable_checking and the __cilkscreen_enable_checking pseudo-calls. The details can be found in the Cilk++ Programmer's Guide, but the add function from the initial code fragment would look something like this:
void add (const C & x) {
SafeSetElement* nel = new SafeSetElement (x);
__cilkscreen_disable_checking
SafeSetElement* tmp = AtomicSwap (&elt0, nel);
__cilkscreen_enable_checking
nel->next = tmp;
};
Results and Conclusions
The example above illustrates that, while reducers might be very powerful and easy to use in a large set of algorithms, there are situations where reducers are not the best answer.
The example described above has been benchmarked on a quad-core Intel CPU. To make the benchmark simulate a real problem, we've added some cycle-burning code to represent the work that the application would do to select the objects of interest and calculate the appropriate bin. The actual benchmarking program can be found on this page.
The results are listed in the table below, which contains the times for different number of processors (specified by the CILK_NPROC environment variable ) as well as for the serial C++ execution of the program. A couple of points are worth highlighting:
- There is a slight overhead in Cilk++-compiled code when it is run on a single core (about 0.2% in the table below). This overhead can be caused by memory barriers built into the atomic swap instruction, as well as by the Cilk++ runtime system.
- Telling Cilk++ that there are more cores than the actual number is often counter-productive (too many cooks spoiling the performance broth?). We specify the number of cores for benchmarking purposes, but in practice it is usually best to let Cilk++ use the actual number of available processors (4 in this case).
| Core Count | Execution Time | Speed-up relative to CILK_NPROC=1 | Speed-up relative to serial execution |
| Serial | 85.694 | 1.002 | 1.000 |
| CILK_NPROC=1 | 85.828 | 1.000 | 0.998 |
| CILK_NPROC=2 | 43.182 | 1.988 | 1.985 |
| CILK_NPROC=3 | 28.805 | 2.980 | 2.975 |
| CILK_NPROC=4 | 21.958 | 3.909 | 3.903 |
For more complete information about compiler optimizations, see our Optimization Notice.

