Volatile: Almost Useless for Multi-Threaded Programming

There is a widespread notion that the keyword volatile is good for multi-threaded programming. I've seen interfaces with volatile qualifiers justified as "it might be used for multi-threaded programming". I thought was useful until the last few weeks, when it finally dawned on me (or if you prefer, got through my thick head) that volatile is almost useless for multi-threaded programming. I'll explain here why you should scrub most of it from your multi-threaded code.

Hans Boehm points out that there are only three portable uses for volatile. I'll summarize them here:

    • marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.

    • memory that is modified by an external agent or appears to be because of a screwy memory mapping

    • signal handler mischief



None of these mention multi-threading. Indeed, Boehm's paper points to a 1997 comp.programming.threads discussion where two experts said it bluntly:

"Declaring your variables volatile will have no useful effect, and will simply cause your code to run a *lot* slower when you turn on optimisation in your compiler." - Bryan O' Sullivan

"...the use of volatile accomplishes nothing but to prevent the compiler from making useful and desirable optimizations, providing no help whatsoever in making code "thread safe". " - David Butenhof


If you are multi-threading for the sake of speed, slowing down code is definitely not what you want. For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:

    1. atomicity

    1. memory consistency, i.e. the order of a thread's operations as seen by another thread.



Let's deal with (1) first. Volatile does not guarantee atomic reads or writes. For example, a volatile read or write of a 129-bit structure is not going to be atomic on most modern hardware. A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it. It would likely be atomic without the volatile. The atomicity is at the whim of the compiler. There's nothing in the C or C++ standards that says it has to be atomic.

Now consider issue (2). Sometimes programmers think of volatile as turning off optimization of volatile accesses. That's largely true in practice. But that's only the volatile accesses, not the non-volatile ones. Consider this fragment:

    volatile int Ready;       

  int Message[100];

  void foo( int i ) {

  Message[i/10] = 42;

  Ready = 1;

  }


It's trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with "gcc -O2 -S" using gcc 4.0, or icc. Both will do the store to Ready first, so it can be overlapped with the computation of i/10. The reordering is not a compiler bug. It's an aggressive optimizer doing its job.

You might think the solution is to mark all your memory references volatile. That's just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it might not fix the problem. Even if the compiler does not reorder the references, the hardware might. In this example, x86 hardware will not reorder it. Neither will an Itanium™ processor, because Itanium compilers insert memory fences for volatile stores. That's a clever Itanium extension. But chips like Power™ will reorder. What you really need for ordering are memory fences, also called memory barriers. A memory fence prevents reordering of memory operations across the fence, or in some cases, prevents reordering in one direction. Paul McKenney's article Memory Ordering in Modern Microprocessors explains them. Sufficient for discussion here is that volatile has nothing to do with memory fences.

So what's the solution for multi-threaded programming? Use a library or language extension hat implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:

    • POSIX threads

    • Windows™ threads

    • OpenMP

    • TBB



For example, the parallel reduction template in TBB does all the right fences so you don't have to worry about them.

I spent part of this week scrubbing volatile from the TBB task scheduler. We were using volatile for memory fences because version 1.0 targeted only x86 and Itanium. For Itanium, volatile did imply memory fences. And for x86, we were just using one compiler, and catering to it. All atomic operations were in the binary that we compiled. But now with the open source version, we have to pay heed to other compilers and other chips. So I scrubbed out volatile, replacing them with explicit load-with-acquire and store-with-release operations, or in some cases plain loads and stores. Those operations themselves are implemented using volatile, but that's largely for Itanium's sake.  Only one volatile remained, ironically on an unshared local variable! See file src/tbb/task.cpp in the latest download if your curious about the oddball survivor.
- Arch

For more complete information about compiler optimizations, see our Optimization Notice.

43 comments

Top
Arch D. Robison (Intel)'s picture

What matters for multithreading is:
<ol>
<li>Atomicity</li>
<li>Visibility of memory operations</li>
<li>The order in which memory operations become visible.</li>
</ol>
Volatile in C and C++ flunks on all three counts. All volatile does is prevent a compiler from caching a variable, which is orthogonal to both points above. It just slows programs down. I'll go into points (1) and (2) in detail. My original post already flogged (3). As I'll show, (2) is enough to put a stake through volatile for portable multi-threaded programming.

As Tom notes, atomicity requires using the proper atomic instructions. Volatile does not address this. E.g., even if I declare x as volatile, x+=1 is not going to be compiled as an atomic increment. So special instructions outside the scope of the C/C++ standards must be used to access/modify x atomically. But that implies that <em>every</em> atomic access to x is outside the scope of the C/C++ standards. As far as the C/C++ standards are concerned, all the compiler sees is the address of x (or reference to x) being passed to routines outside the ken of the compiler. So declaring x as volatile is pointless; the compiler cannot cache loads/stores to x because in principle it does not see the loads/stores to x.

Of course for specific compiler implementations we might know that a volatile load or store of a certain size is always compiled as an atomic operation. That's how TBB implements its internal __TBB_load_with_acquire and __TBB_store_with_release operations for some platforms. But we declare only the formal parameter as pointer-to-volatile and do not declare the actual variable as volatile, because what we are doing is platform specific. Indeed, if we were squeaky-clean about it, we would not even declare the formal parameter as pointer-to-volatile, but hide this platform-specific detail completely by casting the formal parameter to pointer-to-volatile. The portable portions of code should not declare any variables as volatile. They should call __TBB_load_with_acquire and __TBB_store_with_release to do the atomic loads and stores.

Now consider point (2), where the ISO C/C++ volatile is downright counterproductive. (Except on Itanium, because of an Intel-specific interpretation of volatile.) Consider the job queue example. Let's assume the queue holds either zero or one elements. Such a queue can be implemented as a shared pointer R in memory that is either NULL or points to the queue's element. I'm simplifying the queue to make the core issue more obvious. With serious queue implementations the same issues strike with a vengance.
<ol>
<li>Thread 1 multiplies computes a matrix product M and atomically sets R to point to M.</li>
<li>Thread 2 waits until R!=NULL and then uses M as a factor to compute another matrix product.</li>
</ol>
In other words, M is a message and R is a ready flag.

Let's consider sequentially consistent machines. Never mind that memory fence issues that are critical to real machines and not addressed by volatile. I'll show that volatile is both a performance killer <em>and</em> useless for a sequentially consistent machine. Here's the key excerpt in the C99 standard:
<blockquote>The least requirements on a conforming implementation are:
<ul>
<li>At sequence points, volatile objects are stable in the sense that previous accesses are
complete and subsequent accesses have not yet occurred.</li>
</ul>
</blockquote>
Note that only volatile objects are required to be "stable". Volatile accesses have no effect on non-volatile accesses. Thus to rely on volatile in the example requires declaring <em>both</em> R and M as volatile. Declaring M as volatile slows down operations on it significantly. In general, much of multithreaded programming relies on a notion of privatization, where one thread operates on an object (such as M) and then passes it off to another thread. At any point in time, the object is being accessed only by one thread. But if we are going to depend upon volatile to get the hand-off right, we have to declare the object as volatile. That inflicts a heavy penalty on access to the object. In principle, we just have to ensure that when handing off an object, that every location associated with the object was last written with a volatile store before the hand off, and that every first read of a location by the receiving thread is done with a volatile load after the hand-off. Keeping track of that information would definitely be a pain.

But it gets worse. Consider a thread handing off a std::map object to another thread. There's no way for a thread to even know all the locations inside the implementation std::map and mark them all volatile. I suppose a program could serialize the object into a volatile buffer, hand off the buffer, and reconstruct it from the buffer. That essentially inflicts all the pain of message-passing onto shared memory programming. Shared memory programming has enough pain. What we really want are memory fences that force the std::map to be written to memory by the sending thread before the receiving thread reads it. We do not want to turn off the optimizer, but merely enforce the order of some writes and reads.

To summarize, multi-threaded programming is about atomicity and very precise control of the order in which memory operations become visible. Volatile does not address atomicity. Marking an object as volatile turns off caching of the object's value, which is a terribly imprecise and inefficient way to achieve the desired order of visibility, because all locations passed between threads would have to be marked volatile. To get the correct order efficiently requires some notion of memory fencing, which is outside the current C/C++ standards, but will be in future versions of those standards.

anonymous's picture

There are a couple of misunderstandings in your article. While it is true that using volatile alone and expecting this to make anything thread-safe is a naive (and wrong assumption), it is neither true that volatile is useless, or nearly so.

What volatile does is, it prevents the compiler to cache a value in a register and do optimisations that remove operations in which the variable is involved.
This is an important property, which is necessary in multi-threaded applications. Atomicity in loads or stores is not related to the C/C++ standard, but is a hardware feature (as long as the addressed units are no larger than register size). However, again, you miss the point here. It does not matter whether or not these are atomic.
What matters is that load-modify-store and compare-exchange functions can be made atomic by using the proper instructions (via intrinsics, assembly, or kernel functions). This is what is needed to properly synchronize data between threads. If you can't be sure that the compiler won't optimize out a variable, or hold it in a register, or performs any other smart stuff, then this doesn't work.

While it is true that volatile variables are a lot slower to access, even more so if atomic instructions are used (up to 10-15 times slower), the statement that this is "definitively not what you want with threads" shows that you really haven't understood. It is, in fact, EXACTLY what you want.
What you don't want to happen is one thread on one core/cpu increment a counter while you use the now invalid value in another thread on another cpu. What you don't want to happen is one thread freeing (or simply changing) memory that you are still accessing in another (currently waiting) thread. What you don't want to happen is two threads taking the same head element from a job queue at the same moment, performing the same work twice and finally calling delete on the same pointer twice.
All these issues can of course be safely synchronized by locking/unlocking before every access. However, THIS is what you don't want for the sake of performance.
Constructs using volatile variables in combination with atomic instructions (read up on "lockfree programming") offer a much better solution, especially in highly congested scenarios. It is a lot better to burn two dozen CPU cycles using atomic instructions on a volatile than having every access synchronized by two syscalls (lock/unlock) which will chew up several hundred to thousand cycles each.

anonymous's picture

Volatile semantics depend on the language, but volatile essentially means it's possible to reach inside an object's state-process through some external modification. The higher truth it seems you are trying to lift up is that volatile is an attempt at pruning non-determinism, but it's too coarse grained because it interferes with the assumptions optimizing compilers make.

An example of an alternative to using volatile is to use a surrogate object that manages the resource. For example, in Java, Bill Pugh's Initialization on Demand Holder Idiom uses a surrogate object to wrap a place holder for a resource that hasn't been acquired yet. It's a little strange to think of it as a surrogate object, but it helps to realize that once the resource it's guarding is used, all threads effectively see the same shadow, even the the blueprint for each thread might use a different variable name to refer to the shadow. All threads can manipulate the shadow.

Fences and Atoms are, in my eyes, kinds of surrogate objects. I'm not sure if that is a useful metaphor, though. I'm just 23 and don't have enough experience writing multi-threaded programs with complicated resource contention issues.

Also, I think your opposition to the use of volatile might be found here: http://www.ddj.com/cpp/184403766 and he discusses it more here: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
Andrei also has two C++ standards survey papers related to this: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2004/n1680.pdf
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1777.pdf

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.