Volatile: Almost Useless for Multi-Threaded Programming

There is a widespread notion that the keyword volatile is good for multi-threaded programming. I've seen interfaces with volatile qualifiers justified as "it might be used for multi-threaded programming". I thought was useful until the last few weeks, when it finally dawned on me (or if you prefer, got through my thick head) that volatile is almost useless for multi-threaded programming. I'll explain here why you should scrub most of it from your multi-threaded code.

Hans Boehm points out that there are only three portable uses for volatile. I'll summarize them here:

    • marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.

    • memory that is modified by an external agent or appears to be because of a screwy memory mapping

    • signal handler mischief



None of these mention multi-threading. Indeed, Boehm's paper points to a 1997 comp.programming.threads discussion where two experts said it bluntly:

"Declaring your variables volatile will have no useful effect, and will simply cause your code to run a *lot* slower when you turn on optimisation in your compiler." - Bryan O' Sullivan

"...the use of volatile accomplishes nothing but to prevent the compiler from making useful and desirable optimizations, providing no help whatsoever in making code "thread safe". " - David Butenhof


If you are multi-threading for the sake of speed, slowing down code is definitely not what you want. For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:

    1. atomicity

    1. memory consistency, i.e. the order of a thread's operations as seen by another thread.



Let's deal with (1) first. Volatile does not guarantee atomic reads or writes. For example, a volatile read or write of a 129-bit structure is not going to be atomic on most modern hardware. A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it. It would likely be atomic without the volatile. The atomicity is at the whim of the compiler. There's nothing in the C or C++ standards that says it has to be atomic.

Now consider issue (2). Sometimes programmers think of volatile as turning off optimization of volatile accesses. That's largely true in practice. But that's only the volatile accesses, not the non-volatile ones. Consider this fragment:

    volatile int Ready;       

  int Message[100];

  void foo( int i ) {

  Message[i/10] = 42;

  Ready = 1;

  }


It's trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with "gcc -O2 -S" using gcc 4.0, or icc. Both will do the store to Ready first, so it can be overlapped with the computation of i/10. The reordering is not a compiler bug. It's an aggressive optimizer doing its job.

You might think the solution is to mark all your memory references volatile. That's just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it might not fix the problem. Even if the compiler does not reorder the references, the hardware might. In this example, x86 hardware will not reorder it. Neither will an Itanium™ processor, because Itanium compilers insert memory fences for volatile stores. That's a clever Itanium extension. But chips like Power™ will reorder. What you really need for ordering are memory fences, also called memory barriers. A memory fence prevents reordering of memory operations across the fence, or in some cases, prevents reordering in one direction. Paul McKenney's article Memory Ordering in Modern Microprocessors explains them. Sufficient for discussion here is that volatile has nothing to do with memory fences.

So what's the solution for multi-threaded programming? Use a library or language extension hat implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:

    • POSIX threads

    • Windows™ threads

    • OpenMP

    • TBB



For example, the parallel reduction template in TBB does all the right fences so you don't have to worry about them.

I spent part of this week scrubbing volatile from the TBB task scheduler. We were using volatile for memory fences because version 1.0 targeted only x86 and Itanium. For Itanium, volatile did imply memory fences. And for x86, we were just using one compiler, and catering to it. All atomic operations were in the binary that we compiled. But now with the open source version, we have to pay heed to other compilers and other chips. So I scrubbed out volatile, replacing them with explicit load-with-acquire and store-with-release operations, or in some cases plain loads and stores. Those operations themselves are implemented using volatile, but that's largely for Itanium's sake.  Only one volatile remained, ironically on an unshared local variable! See file src/tbb/task.cpp in the latest download if your curious about the oddball survivor.
- Arch

For more complete information about compiler optimizations, see our Optimization Notice.

43 comments

Top
Arch D. Robison (Intel)'s picture

John Regehr's blog at http://blog.regehr.org/archives/28 that is worth reading on the subject.

For further evidence that volatile is a worthless brick for multithreaded programming, note that the C++0x draft (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n3000.pdf) provides proper bricks in the form of a memory ordering model (see 29.3) and atomic operations that do not rely on volatile (rest of chapter 29). There are both volatile and non-volatile overloads of the atomic operations. Unless you are dealing with memory mapped I/O, the non-volatile forms should suffice for multi-threaded programming.

anonymous's picture

I didn't mention this in my last comment but should have. Of course using volatile still doesn't protect you from CPU reordering, so barriers/fences are needed anyway! (Except on VC2005+ and Windows, where volatile variables have read acquire and write release semantics including hardware barriers!)

anonymous's picture

I like this article, makes you think.

Unfortunately I don't like the examples. They feel a bit contrived. For instance in this example:

volatile int Ready;
int Message[100];
void foo( int i ) {
Message[i/10] = 42;
Ready = 1;
}

It's worth pointing out that MSVC will order the operations correctly because they specifically treat volatile that way. For everything else, it should only be expected that the two operations won't be reordered if both Ready and Message are volatile. Anyone who understands the reordering implications of volatile and the compiler they're using would make both variables volatile, in which case it *would* be useful for multithreaded programming, assuming the compiler conforms to this expectation. GCC does... and I think any standards conforming compiler should as well, but I do not have backup for this assertion handy.

Regardless, an explicit barrier is the better solution, but making everything volatile is the "lazy" way because you can enforce ordering without a specific understanding of where it is and is not actually needed. I agree with the general premise... don't sprinkle volatile all over in the hope that it will help. It's better to identify the areas where fences and barriers are needed and use them directly. But sometimes this isn't practical... it's notoriously hard to guarantee correctness for lock-free code and many people may prefer the lazy safety of volatile (used correctly).

anonymous's picture

volatile is needed in g++ generated code. I've seen non-volatile member variables not changing values when accessed from another thread and volatile fixed the problem IMMEDIATELY.

If you don't use volatile where appropriate, you deserve what you get.

I'll post an example that demonstrates the problem when I get some time.

anonymous's picture

heh, your argument is highly compelling. You've given numerous concrete examples to support your claim, so I must admit I'm convinced.

Just out of curiosity, did you actually *read* any of the rationale in this post? It's all factual, whether or not you understood it though is a potentially different story.

anonymous's picture

A little late but I have to chime in. This is a pretty silly post. Volatile is vitally necessary for implementing the very things you say are needed for multithreading and this is sort of like saying you don't need bricks you only need walls. Made of bricks. Many people use volatile in silly manners but what C++ feature is that not true for?

anonymous's picture

I'm not sure if it's been mentioned before, but volatile has some interesting uses in multi-threaded programming. Its just that ensuring atomic or correctly-ordered accesses to primitive-typed variables is not one of them.

This article from 2001 describes how to use the volatile qualifier to automatically catch situations with race conditions:
http://www.ddj.com/cpp/184403766

However, the volatile keyword does uncover lots of codegen and optimization bugs in compilers: http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf
https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables

anonymous's picture

David, I have to disagree.

First, although I thought that it was obvious in context, let me explain what I "really meant" by shared objects. I mean mutable objects that may be accessed concurrently by multiple threads. Our codebase is only unusual in the sense that it's multithreaded. I've long considered one of the core skills of multithreaded programming to be the ability to separate shared objects from thread-local objects.

With this definition in mind, it should be obvious that saying "all objects in memory are shared" seems as facile as saying "all objects in memory are writable." Of course this is to one extent or another true, but knowing that most of our memory is writeable doesn't stop us from wanting our code to be const correct. The point is not whether the hardware is capable of writing or sharing data, the point is what the programmer's intent is. (If there's still confusion as to what *my* intent was in my last post, I used the word "shared" to mean "shared between multiple threads." Sorry for not being more clear.) When I mark an object "const" I know full well that I'm not waving a magic wand and making my object impossible to alter. But I am marking it in such a way that the compiler will catch unintentional misuse.

If you really believe that this technique kills perf, you might want to read Alexandrescu's article more closely. Your perf will suffer if you do something like make your data members public, or mark all of your access functions volatile (and forget to cast away the volatile-ness before accessing your data members). If you write properly encapsulated classes and use Alexandrescu's LockingPtr<> template, the volatile gets cast away before it has a chance to infect your codegen. The compiler is free to optimize as usual. I can't speak for all compilers and platforms, but MSVC 2008 appears to have no trouble optimizing code in the scope of a LockingPtr<>.

One easy mistake to make is to over-volatilize things. For instance, you don't want to mark every data member volatile (in fact, you never want to mark built in types as volatile, for reasons that Alexandrescu addresses in his article). But if you're used to writing concurrent code, this should just be common sense. If you want to insert ten items into a list, you don't want to put a critical section into your List::Insert function. If at all possible, you write code outside the list to acquire the lock, add all ten items, then release it. Likewise, you want to mark things volatile at the highest possible level. The presence of the volatile keyword means "accessing this data requires a lock." If you place the volatile qualifier on a top-level data structure, then once you lock the structure you can cast away the volatile and have safe, optimized access to all of its members and their members and so on.

The "in debug, fault if not locked" paradigm is in our codebase too. It complements but does not replace the "volatile correct" paradigm. One works at runtime, the other at compile time. Neither one is guaranteed to find all threading bugs, but both are helpful.

Your last point is an excellent one. Given the choice I'd much rather be able to define my own qualifiers. In the absence of that choice, I'm abusing volatile. I'm not going to argue that it's a best practice for everyone. But when your job is to turn hundreds of thousands of lines of single-threaded code into a parallel application, you'll take whatever helps. :-)

Ian

P.S. I'd be happy to continue this conversation if you have more questions or can prove me wrong about this. :-) Email me at ian dot lewis at intel dot com.

anonymous's picture

Let's assume this code is executed on an x86 Intel processor.

volatile int Ready;
int Message[100];
void foo( int i ) {
Message[i/10] = 42;
Ready = 1;
}

As stated earlier, ready = 1 can be moved message[i/10] = 42. However, suppose that the code was rewritten such that a memory fence is inserted between the two assignment statements.

void foo(int i) {
Message[i/10] = 42;
__asm{
mfence
}
ready = 1;
}

Let's further assume that the compiler will not reorder instructions around the inline assembly. If ready can be cached in a register, then the mfence has no impact. Thus, volatile is necessary, but not sufficient for writing correct, lock free, multi-threaded code. Can anyone either validate or disprove my understanding of the value of volatile for lock free, multithreading?

anonymous's picture

The problem with doing that, Ian, is that it kills performance.

In any event, unless your codebase is very unusual, it makes no sense to "identify shared objects". *All* objects in memory are shared.

Depending on what you really mean, there's likely a way to do it that doesn't compromise performance. For example, code that basically says, "we must hold lock X here because we manipulate object Y -- in a debug build, fault if we do not". Or code that says, "we should have released lock X by the time we got here -- in a debug build, fault if we did not, in a release build, release the lock".

It would be nice, however, if C++ supported a custom qualifiers that you could apply to variables just to mark them, to fail at compile time if they're not matched, and to select the desired overload. If I get some time, I'll see if I can throw a proposal together.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.