Cilk Screen Part II: I found a race. Now what do I do?

This is the second article in a series on the Cilk Screen Race Detector. The previous article, An Introduction to the Cilk Screen Race Detector, described:

  • How Cilk Screen monitors your application for races in your Cilk parallel code
  • Where to get the Cilk Tools; the Cilk screen race detector and the Cilk view scalabitiy analyzer
  • How to run Cilk screen on your application

So you've run Cilk screen on your application, and it's pointed to dozens of races in your code. What do you do about it?

1. Refactor your code

The first question to ask is whether you can refactor your code to remove the race. Or sometimes you can use an alternate algorithm which avoids the problem.

We worked with a customer who wanted to upgrade their CAD system to perform a large calculation in parallel. The data for each object known to the CAD system was stored in a set of large arrays of structs, and the calculation modified those structs, causing a race. We modified the code to create an array of class instances for each object in the user’s model. Then we used a cilk_for to iterate over all of the class instances, writing the results of the analysis were written into class member variables. Once the cilk_for completed the results were combined and the underlying structs were modified.

The key thing here is that the time to generate the class instances and combine the results at the end of the calculation was a very small fraction of the time spent doing the calculation.

2. Use reducers to eliminate races on global variables

Reducers are designed to allow you to use a global variable in parallel strands. A reducer must have an associative operation to combine two views [that is, (a OP b) OP c == a OP (b OP c)], and a well-defined identity value. Reducers have a number of useful properties:

  • Each strand has a private view of the reducer so no locks are required to access it. Contention for locked regions can have a major performance penalty.
  • Each view of a properly written reducer is allocated in a separate cache line, so you’re not slowed down by false sharing.
  • The reducer views are combined by the Cilk runtime in the same order that they’d be if the code was executed serially, so you get the same results, every time. Predictable result ordering makes regression testing much easier!

The Cilk runtime includes a library of sample reducers you can use in your code. Or you can write your own custom reducers.

One thing to be aware of is that your operation may be associative for integer math when it is not for floating point math. For example, summing a list of integer values will always give the same result, regardless of what order they're in. Summing a list of floating point values frequently will not.

  • The conversion of a number from base 10 to binary is not exact due to the finite precision used to represent floating point numbers. For example, (0.1 + 0.2) + 0.3 != 0.1 + (0.2 + 0.3)
  • Floating point numbers are stored as a mantissa and exponent. When adding two floating point numbers, the mantissa of the smaller value is shifted and the exponent is increased until the exponents agree, then the two mantissas are summed.  So when adding a small value to a very large one, the small value may essentially be 0.  So the order of the numbers matters.  If you sum a long list of numbers, you'll get a different result if the numbers are in random order than if they're sorted so the small values are summed first.

3. Use a lock to serialize access to a global resource

Sometimes there’s just no alternative to using a lock to serialize access to a global resource. In general, using locks should be your last resort. But in legacy code, it may be the easiest way to prevent a race. Things to keep in mind with locks are:

  • Using a lock introduces contention for the lock. If two strands need access to the locked resource at the same time, one will have to wait for the other, serializing the code. As the number of strands attempting to access the resource increases, the time spent waiting for the resource can swamp the time gained by executing in parallel.
  • If you use more than one lock, you introduce the possibility of deadlocks. Consider the case where strand A acquires lock 1, and strand B acquires lock 2. If strand A now needs lock 2, it cannot proceed until strand B releases it. If strand B now requires lock 1, neither strand can ever complete. This state is called a deadly embrace. When more than two resources are involved, a deadly embrace can become quite complex and difficult to figure out.
  • Using a lock does not guarantee ordering. The locked list portion of the example in the Cilk Plus Reducer Tutorial demonstrates this. Guaranteed ordering is one of the advantages of using reducers.

Cilk screen will automatically detect the use of many OS locks and assume that any modifications that occur while holding the same set of locks are not races. On Windows*, Cilk screen will track CRITICAL_SECTIONs, and on Linux* Cilk screen will track mutexes.

4. Use a fake mutex

There are cases where you’ve got a race, but it’s benign. Consider the following code:

int found = false;
cilk_for (int i = 0; i < ARY_MAX; i++)
{
    if (42 == ary[i])
        found = true;
}

The writes to found on multiple strands are a race, but we really don’t care. All that matters is that some strand set found; to true. We don't care which strand it was. We can tell Cilk screen to ignore this race using a fake mutex. For example:

#include <cilktools/fake_mutex.h>

int found = false;
cilkscreen::fake_mutex m;

cilk_for (int i = 0; i < ARY_MAX; i++)
{
    if (42 == ary[i])
    {
        m.lock();
        found = true;
        m.unlock();
    }
}

It's important to note that a benign race may not be a correctness problem, but it it can be a performance problem. Each time a core writes to the memory location containing the modified variable, it needs to invalidate the cache line containing that memory location in all other cores. Each of the cores that wants to read anything in that cache line must re-fetch the current value from memory, which is much slower than simply reading it from the copy in cache. A cache line is the unit of memory that the CPU's cache works in. On current Intel processors, cache lines are 64-bytes.

cilkscreen::fake_mutex uses metacalls to notify Cilk screen that a lock has been acquired and released. Metacalls are a technique for notifying Cilk screen of some event.  The metacalls are always present in the code and have very low overhead.

To make notifying Cilk screen about locks even easier, Cilk screen provides a lock guard class which implements the RAII (Resource Acquisition Is Initialization) technique:
#include <cilktools/fake_mutex.h>
#include <cilktools/lock_guard.h>

int found = false;
cilkscreen::fake_mutex m;

cilk_for (int i = 0; i < ARY_MAX; i++)
{
    if (42 == ary[i])
    {
        cilkscreen::lock_guard g(m);
        found = true;
    }
}

When the lock_guard instance g is initialized, it’s constructor will call the lock() method of the fake_mutex m. When g goes out of scope it’s destructor will call the unlock() method of m. Using the RAII technique means you’ll never have to worry about forgetting to release a lock again.

You can use cilkscreen::lock_guard with any class that implements lock() and unlock(). For example, std::mutex. Of course, you can also use std::lock_guard instead of cilkscreen::lock_guard. cilkscreen::lock_guard is provided for those users who aren’t using a C++11-compliant compiler.


More information about Intel Cilk Plus can be found at the Intel Cilk Plus community website. To ask questions about Cilk Plus, discuss issues, or report bugs, please visit the Cilk Plus forum.

For more complete information about compiler optimizations, see our Optimization Notice.