Benign Data Races: What Could Possibly Go Wrong?


I’ve been involved with data race detection tools for about 5 years now. Currently, I am working on ThreadSanitizer, a data race detector for C/C++ and Go language. I am getting recurring questions from users about so called “benign” data races -- Why does the tool flag them? Are they actually harmful? How to suppress them? I think it’s the time to write up a detailed answer.

So, a data race occurs when two threads access the same variable concurrently and at least one of the accesses is a write. Data races are one of the most common and hardest to debug types of bugs in concurrent systems. I hope it’s not necessary to explain that data races on complex data structures (like strings and hash maps) are undoubtedly harmful and can lead crashes and memory corruption. They do.

But consider we have such an "innocent" code as:

int op_count;
op_count++;  // Executed by several threads, it’s OK if it’s not 100% precise.

First, it’s undefined behavior according to C++ standard, C standard, POSIX, Go memory model and any other relevant combination of standards. The fact that the behavior is undefined means that it can lead to basically any runtime behavior, including accidental nuclear missile launch (see an example below). And it’s declared as undefined for a reason.
The correct way to express such pattern is to use atomic operations. C/C++ provide standard <atomic> header for that, Go provides sync/atomic package and there are other implementations (gcc/clang builtin atomics, Threading Building Blocks, Google perf tools, etc). Despite common misconception, atomic operations do not necessarily incur significant overheads. There are so called relaxed loads and stores that can express the pattern w/o incurring additional costs (except perhaps suppressing some optimizations that can lead to bad things). Here is the correct way to express the pattern in C++:

#include <atomic>
std::atomic<int> op_count;
... + 1, std::memory_order_relaxed);

Second, usage of atomic operations greatly improves code readability. Plain memory accesses and shared memory synchronization are hideously different operations, and that must be visible in the code. If a shared memory operation is as invisible in the code as simple “x->cnt++”, that will waste developers time trying to understand what actually happens in the code and lead to introduction of bugs during maintenance. Verbose atomic operations clearly say “Be extremely cautious here!”

Third, data race detection tools will bark at such code, and you can miss important warnings in the noise about “benign” races. Think about compiler warnings, you keep your code clean just to not miss the important ones. The same with data races, race-free code will allow you to quickly identify new bugs and perhaps even setup continuous automatic testing for data races.

OK, if you are already convinced, go test and clean up your code. If not, follow with me to the most interesting part -- I am going to show how even the most “innocent” and “benign” races can break badly.

To make it more interesting, let’s assume that we are running on IA-32/Intel64 architecture with quite strong memory consistency and 1/2/4/8-byte memory accesses are atomic. Consider we have such an “innocent” code as:

int stop;
stop = 1;  // Says other threads to stop.
// Other threads:
while (!stop) {


Compilers assume that programs are race-free (otherwise it’s undefined behavior and all bets are off). So if a program stores to a variable X, the compiler can legally reuse X’s storage for any temporal data inside of some region of code before the store (e.g. to reduce stack frame size). That is, the code can be compiled as:

foo = …;  // Calculate some temporal value.
stop = foo;  // Spill from register into the stop variable.
foo = stop;  // Restore from the stop variable into a register.
… = … foo ...;  // Use foo for some computations.
stop = 1;


Note that such transformation can’t break any legal race-free programs, because the stop variable can’t be accessed concurrently by other threads and so nobody can see the garbage value. Such transformations can work across function boundaries due to inlining. So if you have a function that solely sets the stop variable, it won’t help.
As a result of such transformation, the stop variable will have random values during program execution. In this particular case other threads will exit prematurely.

Now let’s consider that several threads write to a shared variable without proper synchronization:

int op_count;
op_count++;  // Executed by several threads.


Consider that the compiler does the same transformation as in the example above (spills some temp value into op_count), and that the temp value is a function pointer (potentially from virtual table, if you don’t use raw function pointers).
Consider that one thread spills pointer to write_file() function, and another thread spills pointer to launch_nuclear_missile() function. When the first thread restores its pointer, it will get the wrong pointer to launch_nuclear_missile(). Our “innocent” data race just lead to an accidental missile launch. Oops.

So, if a data race involves non-atomic writes, it always can go wrong.
But what if we modify shared state with atomic operations but read with plain loads? Like in the following example:

int stat1, stat2;
__sync_fetch_and_add(&stat1, x);
__sync_fetch_and_add(&stat1, y);
void print_stats(std::ostream& stream) {
 auto s1 = stat1;  // Racy read.
 auto s2 = stat2;  // Racy read.
 stream << (s2 ? s1/s2 : 0);

This can also break after all. The compiler can legally re-read the stat2 variable after the check, and it yet leads to division by zero exception.

You may say that it can’t possibly affect your code, because you have a very simple code along the lines of:

class ref_counted {
 int ref_count_;
 void ref_counted::acquire() {
   if (ref_count_ <= 0)  // Racy read.
     die(“incorrect reference count”);
   __sync_fetch_and_add(&ref_count_, 1);



Are you joking me? This can’t possibly break!
You are right. Almost.

Modern compilers work in mysterious ways. Especially at high optimization levels. If you disassemble sufficiently large program, you can find code like this (I’ve observed such code with both gcc, clang, -O0 and -O2):

# This one is from OpenSSL:
0000000000fa0390 <widefelem_diff>:
 fa0390: mov    %rdi,-0x8(%rsp)
 fa0395: mov    %rsi,-0x10(%rsp)
 fa039a: mov    -0x8(%rsp),%rsi
 fa039f: mov    (%rsi),%rdi    # Load from memory into the register.
 fa03a2: mov    0x8(%rsi),%rax
 fa03a6: mov    %rdi,(%rsi)    # Store the same value back into memory.

# These two are from elsewhere:
 a398b2: mov    -0x38(%rbp),%rcx
 a398b6: mov    (%rcx),%rdx    # Load from memory into the register.
 a398b9: mov    0x8(%rcx),%rdi
 a398bd: mov    %rdx,(%rcx)    # Store the same value back into memory.

 a2e892: mov    0x30(%rcx),%rax
 a2e896: mov    0x38(%rcx),%rdx    # Load from memory into the register.
 a2e89a: mov    %rdx,0x38(%rcx)    # Store the same value back into memory.


The code is strange but it can’t break correct race-free programs. If a code reads a variable, there must be no concurrent writes. And store of the same value can’t affect potential concurrent reads. But it can break our racy ref_counted code. Consider that a thread loads ref_count_ = 2 into a register, then another thread increments it to 3, then the first thread stores 2 back into ref_count_. Oops.

There are other literature on the peril of data races (“How to miscompile programs with “benign” data races” by Hans-J. Boehm, “C++ and the Perils of Double-Checked Locking” by Scott Meyers and Andrei Alexandrescu). But they describe more obvious dangers, I wanted to show that even the most “innocent” data races can lead to bad things.

The bottom line. Just say No to “benign” races. Even if you still think that that particular data race is 100% safe (which I doubt), it’s still formally incorrect, fragile during code maintenance and produces noise under race detection tools.
Use automatic data race detection tools, like ThreadSanitizer or Intel Parallel Inspector. Use them regularly during continuous testing (like it’s done by the Chromium project). Test your systems under realistic workloads with the tools to find more bugs. And, of course, fix the races. That will make your life much easier and will save your team lots of time.

For more complete information about compiler optimizations, see our Optimization Notice.




in your thread stop example, would the problem be solved by making the variable "stop" atomic?  If not, how would you recommend handling this?  Thanks.

randomizer's picture

Hi Jacob,

I've wrote "inside of some region of code". I didn't want to go into all the details, but it was meant to cover exactly your example. The region of code is bounded by some synchronization operations. So your example is correct and obviously should work. The compiler still can do some dirty tricks inside of the critical section; the compiler itself ensures that the current thread is not affected, and the mutex ensures that other threads are not affected.

Hi Dmitry,

Thanks for your article. In one of your examples you stated "So if a program stores to a variable X, the compiler can legally reuse X’s storage for any temporal data inside of some region of code before the store (e.g. to reduce stack frame size)."

I changed the example by guarding each write and read to the variable with a lock, so there is no data race anymore. If the compiler can still reuses stop's storage before the store (for example in "// Computation") the execution will be incorrect, so the compiler cannot do this. Am I right? Does that mean that a compiler checks whether the store is inside some lock?

int stop;

// Computation


stop = 1;

// Other threads:

while (!local_stop) {


  if (stop) local_stop = 1;



randomizer's picture

I think the code can be affected by:
"So if a program stores to a variable X, the compiler can legally reuse X’s storage for any temporal data inside of some region of code before the store (e.g. to reduce stack frame size)."

Ran into this benign race condition in our code base. I've been having a really tough time thinking about what could "catch fire" with this code. It appears like the memory address NULL_INT_ALPHA will only ever contain the value of zero, though multiple threads could be assigning to it at the same time.

unsigned int NULL_INT_ALPHA = 0;                                                
void ClampIntColor(unsigned int &r, unsigned int &a = NULL_INT_ALPHA)


  unsigned int outR = std::min(255, r);

  unsigned int outA = std::min(255, a);

  if (outR != r || outA != a)


    r = outR;

    a = outA;



Any ideas on how this could "catch fire"? I would really like to pass it on to our developers.

randomizer's picture

It would also fix the data race, but it is more expensive and may be unnecessary.

In the first example, why can't you use the increment operator of std::atomic like this

std::atomic<int> op_count;



And if neither C++11 and TBB. And say the target language is Objective-C without C++.

To avoid the reread issue, declare stat1 and stat2 as atomic, using the C++11 header, and read them using the atmoic load operations defined in that header. Or if C++11 is not available, use the TBB equivalents :-)

Nice blog. I like the pictures. A technical note: C++11 essentially prohibits compilers from introducing stores that were not written by the programmer. See 1.10 paragraph 22 of the C++ standard. Previous versions of the standard did permit such an irksome possibility.


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.