Benign data races: what could possibly go wrong?

Hi,

I’ve been involved with data race detection tools for about 5 years now. Currently, I am working on ThreadSanitizer, a data race detector for C/C++ and Go language. I am getting recurring questions from users about so called “benign” data races -- Why does the tool flag them? Are they actually harmful? How to suppress them? I think it’s the time to write up a detailed answer.

So, a data race occurs when two threads access the same variable concurrently and at least one of the accesses is a write. Data races are one of the most common and hardest to debug types of bugs in concurrent systems. I hope it’s not necessary to explain that data races on complex data structures (like strings and hash maps) are undoubtedly harmful and can lead crashes and memory corruption. They do.

But consider we have such an "innocent" code as:

int op_count;
...
op_count++;  // Executed by several threads, it’s OK if it’s not 100% precise.


First, it’s undefined behavior according to C++ standard, C standard, POSIX, Go memory model and any other relevant combination of standards. The fact that the behavior is undefined means that it can lead to basically any runtime behavior, including accidental nuclear missile launch (see an example below). And it’s declared as undefined for a reason.
The correct way to express such pattern is to use atomic operations. C/C++ provide standard <atomic> header for that, Go provides sync/atomic package and there are other implementations (gcc/clang builtin atomics, Threading Building Blocks, Google perf tools, etc). Despite common misconception, atomic operations do not necessarily incur significant overheads. There are so called relaxed loads and stores that can express the pattern w/o incurring additional costs (except perhaps suppressing some optimizations that can lead to bad things). Here is the correct way to express the pattern in C++:

#include <atomic>
std::atomic<int> op_count;
...
op_count.store(op_count.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed);


Second, usage of atomic operations greatly improves code readability. Plain memory accesses and shared memory synchronization are hideously different operations, and that must be visible in the code. If a shared memory operation is as invisible in the code as simple “x->cnt++”, that will waste developers time trying to understand what actually happens in the code and lead to introduction of bugs during maintenance. Verbose atomic operations clearly say “Be extremely cautious here!”

Third, data race detection tools will bark at such code, and you can miss important warnings in the noise about “benign” races. Think about compiler warnings, you keep your code clean just to not miss the important ones. The same with data races, race-free code will allow you to quickly identify new bugs and perhaps even setup continuous automatic testing for data races.

OK, if you are already convinced, go test and clean up your code. If not, follow with me to the most interesting part -- I am going to show how even the most “innocent” and “benign” races can break badly.

To make it more interesting, let’s assume that we are running on IA-32/Intel64 architecture with quite strong memory consistency and 1/2/4/8-byte memory accesses are atomic. Consider we have such an “innocent” code as:

int stop;
…
stop = 1;  // Says other threads to stop.
…
// Other threads:
while (!stop) {
  ...
}



Compilers assume that programs are race-free (otherwise it’s undefined behavior and all bets are off). So if a program stores to a variable X, the compiler can legally reuse X’s storage for any temporal data inside of some region of code before the store (e.g. to reduce stack frame size). That is, the code can be compiled as:

foo = …;  // Calculate some temporal value.
stop = foo;  // Spill from register into the stop variable.
…
foo = stop;  // Restore from the stop variable into a register.
… = … foo ...;  // Use foo for some computations.
…
stop = 1;


Note that such transformation can’t break any legal race-free programs, because the stop variable can’t be accessed concurrently by other threads and so nobody can see the garbage value. Such transformations can work across function boundaries due to inlining. So if you have a function that solely sets the stop variable, it won’t help.
As a result of such transformation, the stop variable will have random values during program execution. In this particular case other threads will exit prematurely.

Now let’s consider that several threads write to a shared variable without proper synchronization:

int op_count;
…
op_count++;  // Executed by several threads.


Consider that the compiler does the same transformation as in the example above (spills some temp value into op_count), and that the temp value is a function pointer (potentially from virtual table, if you don’t use raw function pointers).
Consider that one thread spills pointer to write_file() function, and another thread spills pointer to launch_nuclear_missile() function. When the first thread restores its pointer, it will get the wrong pointer to launch_nuclear_missile(). Our “innocent” data race just lead to an accidental missile launch. Oops.

So, if a data race involves non-atomic writes, it always can go wrong.
But what if we modify shared state with atomic operations but read with plain loads? Like in the following example:

int stat1, stat2;
...
__sync_fetch_and_add(&stat1, x);
__sync_fetch_and_add(&stat1, y);
...
void print_stats(std::ostream& stream) {
  auto s1 = stat1;  // Racy read.
  auto s2 = stat2;  // Racy read.
  stream << (s2 ? s1/s2 : 0);
}


This can also break after all. The compiler can legally re-read the stat2 variable after the check, and it yet leads to division by zero exception.

You may say that it can’t possibly affect your code, because you have a very simple code along the lines of:

class ref_counted {
  int ref_count_;
  ...
  void ref_counted::acquire() {
    if (ref_count_ <= 0)  // Racy read.
      die(“incorrect reference count”);
    __sync_fetch_and_add(&ref_count_, 1);
  }
  ...
};




Are you joking me? This can’t possibly break!
You are right. Almost.

Modern compilers work in mysterious ways. Especially at high optimization levels. If you disassemble sufficiently large program, you can find code like this (I’ve observed such code with both gcc, clang, -O0 and -O2):

# This one is from OpenSSL:
0000000000fa0390 <widefelem_diff>:
  fa0390: mov    %rdi,-0x8(%rsp)
  fa0395: mov    %rsi,-0x10(%rsp)
  fa039a: mov    -0x8(%rsp),%rsi
  fa039f: mov    (%rsi),%rdi    # Load from memory into the register.
  fa03a2: mov    0x8(%rsi),%rax
  fa03a6: mov    %rdi,(%rsi)    # Store the same value back into memory.

# These two are from elsewhere:
  a398b2: mov    -0x38(%rbp),%rcx
  a398b6: mov    (%rcx),%rdx    # Load from memory into the register.
  a398b9: mov    0x8(%rcx),%rdi
  a398bd: mov    %rdx,(%rcx)    # Store the same value back into memory.

  a2e892: mov    0x30(%rcx),%rax
  a2e896: mov    0x38(%rcx),%rdx    # Load from memory into the register.
  a2e89a: mov    %rdx,0x38(%rcx)    # Store the same value back into memory.


The code is strange but it can’t break correct race-free programs. If a code reads a variable, there must be no concurrent writes. And store of the same value can’t affect potential concurrent reads. But it can break our racy ref_counted code. Consider that a thread loads ref_count_ = 2 into a register, then another thread increments it to 3, then the first thread stores 2 back into ref_count_. Oops.

There are other literature on the peril of data races (“How to miscompile programs with “benign” data races” by Hans-J. Boehm, “C++ and the Perils of Double-Checked Locking” by Scott Meyers and Andrei Alexandrescu). But they describe more obvious dangers, I wanted to show that even the most “innocent” data races can lead to bad things.

The bottom line. Just say No to “benign” races. Even if you still think that that particular data race is 100% safe (which I doubt), it’s still formally incorrect, fragile during code maintenance and produces noise under race detection tools.
Use automatic data race detection tools, like ThreadSanitizer or Intel Parallel Inspector. Use them regularly during continuous testing (like it’s done by the Chromium project). Test your systems under realistic workloads with the tools to find more bugs. And, of course, fix the races. That will make your life much easier and will save your team lots of time.

Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.