Volatile: Almost Useless for Multi-Threaded Programming

There is a widespread notion that the keyword volatile is good for multi-threaded programming. I've seen interfaces with volatile qualifiers justified as "it might be used for multi-threaded programming". I thought was useful until the last few weeks, when it finally dawned on me (or if you prefer, got through my thick head) that volatile is almost useless for multi-threaded programming. I'll explain here why you should scrub most of it from your multi-threaded code.

Hans Boehm points out that there are only three portable uses for volatile. I'll summarize them here:

    • marking a local variable in the scope of a setjmp so that the variable does not rollback after a longjmp.

    • memory that is modified by an external agent or appears to be because of a screwy memory mapping

    • signal handler mischief



None of these mention multi-threading. Indeed, Boehm's paper points to a 1997 comp.programming.threads discussion where two experts said it bluntly:

"Declaring your variables volatile will have no useful effect, and will simply cause your code to run a *lot* slower when you turn on optimisation in your compiler." - Bryan O' Sullivan

"...the use of volatile accomplishes nothing but to prevent the compiler from making useful and desirable optimizations, providing no help whatsoever in making code "thread safe". " - David Butenhof


If you are multi-threading for the sake of speed, slowing down code is definitely not what you want. For multi-threaded programming, there two key issues that volatile is often mistakenly thought to address:

    1. atomicity

    1. memory consistency, i.e. the order of a thread's operations as seen by another thread.



Let's deal with (1) first. Volatile does not guarantee atomic reads or writes. For example, a volatile read or write of a 129-bit structure is not going to be atomic on most modern hardware. A volatile read or write of a 32-bit int is atomic on most modern hardware, but volatile has nothing to do with it. It would likely be atomic without the volatile. The atomicity is at the whim of the compiler. There's nothing in the C or C++ standards that says it has to be atomic.

Now consider issue (2). Sometimes programmers think of volatile as turning off optimization of volatile accesses. That's largely true in practice. But that's only the volatile accesses, not the non-volatile ones. Consider this fragment:

    volatile int Ready;       

  int Message[100];

  void foo( int i ) {

  Message[i/10] = 42;

  Ready = 1;

  }


It's trying to do something very reasonable in multi-threaded programming: write a message and then send it to another thread. The other thread will wait until Ready becomes non-zero and then read Message. Try compiling this with "gcc -O2 -S" using gcc 4.0, or icc. Both will do the store to Ready first, so it can be overlapped with the computation of i/10. The reordering is not a compiler bug. It's an aggressive optimizer doing its job.

You might think the solution is to mark all your memory references volatile. That's just plain silly. As the earlier quotes say, it will just slow down your code. Worst yet, it might not fix the problem. Even if the compiler does not reorder the references, the hardware might. In this example, x86 hardware will not reorder it. Neither will an Itanium™ processor, because Itanium compilers insert memory fences for volatile stores. That's a clever Itanium extension. But chips like Power™ will reorder. What you really need for ordering are memory fences, also called memory barriers. A memory fence prevents reordering of memory operations across the fence, or in some cases, prevents reordering in one direction. Paul McKenney's article Memory Ordering in Modern Microprocessors explains them. Sufficient for discussion here is that volatile has nothing to do with memory fences.

So what's the solution for multi-threaded programming? Use a library or language extension hat implements the atomic and fence semantics. When used as intended, the operations in the library will insert the right fences. Some examples:

    • POSIX threads

    • Windows™ threads

    • OpenMP

    • TBB



For example, the parallel reduction template in TBB does all the right fences so you don't have to worry about them.

I spent part of this week scrubbing volatile from the TBB task scheduler. We were using volatile for memory fences because version 1.0 targeted only x86 and Itanium. For Itanium, volatile did imply memory fences. And for x86, we were just using one compiler, and catering to it. All atomic operations were in the binary that we compiled. But now with the open source version, we have to pay heed to other compilers and other chips. So I scrubbed out volatile, replacing them with explicit load-with-acquire and store-with-release operations, or in some cases plain loads and stores. Those operations themselves are implemented using volatile, but that's largely for Itanium's sake.  Only one volatile remained, ironically on an unshared local variable! See file src/tbb/task.cpp in the latest download if your curious about the oddball survivor.
- Arch

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

This article is ridiculous. It should instead be titled "Volatile is useless for doing things it is not meant to be used for, like enforcing memory ordering". Anyone that has written high performance, multi-threaded code (most likely in the embedded space), MUST use volatile. Try writing any high performance lock free synchronization without volatile. It is an absolute requirement. The programmer needs some way of telling the compiler "yo, don't put this memory location in a register, because it might be changed by some external agent". The fact that many individuals mistakenly believe that volatile will somehow result in a magical ordering of memory operations (it won't, both the compiler and the underlying HW will often reorder instructions), or somehow provide atomicity (not sure why anyone would think this), is not the problem with volatile, but rather with the communication of what volatile actually does. To properly write high performance shared memory programs, the programmer needs a way of specifying memory ordering (via barriers, etc), performing atomic operations (via intrinsics), and preventing compiler optimizations such as register allocation, dead code elimination, licr, etc, that can change the behaviour of your intended memory operations. Volatile is the ONLY option for the last one.


This article is wrong.

Given the code

int done = false;

int foo() {
while(!done) {
somecounter++;
}
}

The compiler would be free to optimize the loop out completely, and just use a jump with no check after checking once at the function entry, since it knows that the value of done is false when entered, and nothing in the function can change it. By marking the variable volatile the compiler knows that it may change outside the scope of the method, and must always check the condition.

Yes, you can have reordering problems, but a 'threading library' won't solve this either, because the compiler would be free to still reorder the check (call to library) before the usage (in the ready case), unless the compiler has specific knowledge that the library call means "do not reorder", and if that were the case, it could apply the same logic to the 'volatile' keyword.

Even if the compiler does not reorder, the CPU might, which is why you still need fences. That the volatile keyword does not ensure a fence (like it does in java), it just plain silly.


This article is simply wrong. Volatile is absolutely necessary in any situation where a compiler could make wrong assumptions about a variable keeping its value constant just because a function does not change it itself.

You can add as many mutexes and memory fences as you like, that won't change the fact that a compiler's optimizer can make wrong assumptions about variables that are wrongly not marked as volatile.


Here's my take in plain english:

You should in general not have an unprotected writer such that a reader might catch an incomplete update. that is why synchronization objects exist. by using a lock to take control of an object before a write no one by definition will be able to access the object and therefore no one will be in danger of having the value change beneath its feet. hope that makes sense.


Definitely a interesting find! However, I don’t agree with author that this re-ordeing is not a GCC bug but just a aggressive optimization for following reasons:
1. C++ standard says nothing about threading, so compiler should be pessimistic about any aggressive optimization. If you have any object/structure which is global or had their address "escape" from whatever local context they were declared in, then the re-ordering should not be allowed.
2. It does not preserve “as-if” semantics in the visible behavior of the problem (as user see a different behavior than what he described) (ref: ISO-CPP standard, Especially 1.9 Program Execution/ = 5)
3. By default compiler need to respect sequence points (e.g assignment statement) (http://en.wikipedia.org/wiki/Sequence_point). Reordering should only be possible if the compiler can see the full definitions of code and it could then determine they have no side effects, and the re-ordering is legal still being pessimistic.

Thanks
-Abhishek

B.T.W here is output from few other c++ compilers. I think I have one more reason to dump gcc now :-)

LLVM-GCC /clang (showing just the LLVM-byte code)
define void @foo(i32 %i) nounwind {
entry:
%div = sdiv i32 %i, 10 ; <i32> [#uses=1]
%tmp1 = load %struct.tx_buffer** @buffer ; <%struct.tx_buffer*> [#uses=1]
%idxprom = sext i32 %div to i64 ; <i64> [#uses=1]
%arrayidx = getelementptr inbounds %struct.tx_buffer* %tmp1, i64 0, i32 0, i64 %idxprom ; <i32*> [#uses=1]
store i32 42, i32* %arrayidx
%tmp3 = load %struct.tx_buffer** @buffer ; <%struct.tx_buffer*> [#uses=1]
%tmp4 = getelementptr inbounds %struct.tx_buffer* %tmp3, i64 0, i32 1 ; <i32*> [#uses=1]
store i32 15, i32* %tmp4
ret void
}

Intel compiler (icc)
0000000000400518 <foo>:
400518: 55 push %rbp
400519: 48 89 e5 mov %rsp,%rbp
40051c: 48 83 ec 10 sub $0x10,%rsp
400520: 89 7d f8 mov %edi,0xfffffffffffffff8(%rbp)
400523: 8b 45 f8 mov 0xfffffffffffffff8(%rbp),%eax
400526: ba 0a 00 00 00 mov $0xa,%edx
40052b: 89 55 f0 mov %edx,0xfffffffffffffff0(%rbp)
40052e: 99 cltd
40052f: 8b 4d f0 mov 0xfffffffffffffff0(%rbp),%ecx
400532: f7 f9 idiv %ecx
400534: 48 63 c0 movslq %eax,%rax
400537: 48 8b 15 8a 04 10 00 mov 1049738(%rip),%rdx # 5009c
8 <buffer>
40053e: c7 04 82 2a 00 00 00 movl $0x2a,(%rdx,%rax,4)
400545: 48 8b 05 7c 04 10 00 mov 1049724(%rip),%rax # 5009c
8 <buffer>
40054c: c7 80 90 01 00 00 0f movl $0xf,0x190(%rax)
400553: 00 00 00
400556: c9 leaveq
400557: c3 retq

Microsoft C++ compiler

void foo(int i)
{
004113B0 push ebp
004113B1 mov ebp,esp
004113B3 sub esp,0C0h
004113B9 push ebx
004113BA push esi
004113BB push edi
004113BC lea edi,[ebp-0C0h]
004113C2 mov ecx,30h
004113C7 mov eax,0CCCCCCCCh
004113CC rep stos dword ptr es:[edi]
buffer->Message[i / 10] = 42;
004113CE mov eax,dword ptr [i]
004113D1 cdq
004113D2 mov ecx,0Ah
004113D7 idiv eax,ecx
004113D9 mov edx,dword ptr [buffer (417140h)]
004113DF mov dword ptr [edx+eax*4],2Ah
buffer->Ready = 15;
004113E6 mov eax,dword ptr [buffer (417140h)]
004113EB mov dword ptr [eax+190h],0Fh
}


Hum, seems to me there is no consensus... I've always approached volatile in terms of making sure that a 32 bit value that is being shared across threads (and thusly across CPUs) does not get cached on one CPU and then modified while it is being read on another CPU. For example: let's take this variable declaration:

__declspec(align(32)) unsigned int unIndex = 0;

On some threads, I want to increment the value by one, so I use:

InterlockedIncrement((long volatile *) &unIndex);

But on some other threads, I want to just 'read' the value. Well, there is no inherent 'interlocked read' expression. I've seen people do:

unsigned int unCurrentValue = (unsigned int) InterlockedCompareExchange((long volatile *) &unIndex, (long) unIndex, (long) unIndex)

but I think that is just silly. One should just be able to do:

unsigned int unCurrentIndex = unIndex;

Unfortunately, if unIndex is NOT declared as volatile, there is a real risk that making changes to this value within one thread running on one CPU will not be properly reflected within another thread reading the value on another CPU. By declaring:

__declspec(align(32)) volatile unsigned int unIndex = 0;

As such, unIndex would be thread safe as long as we use InterlockedIncrement() (or other interlocked semantic to modify the value) and then reads are just a straight up assignment (i.e. unsigned int unCurrentIndex = unIndex). I've implemented an IPC mechanism using memory mappings and utilizing these semantics (as opposed to mutexes) to drive a rotating linked list type of mechanism and the speed gain was in the 5000x range (as opposed to using mutexes).

In that regard, I believe that volatile can provide some considerable performance enhancement when it comes to reading and writing to boundary aligned variables that are 32 bits or less on 32 bit systems, and 64 bits or less on 64 bit systems. Correct me if I am wrong.


Arch, et al.

Was there any thought by the C++ standards committee of adding an attribute, that can be combined with or without volatile, that enforces write and/or read fences? In C++ you can enforce this with protected variables modifyable only by member functions but in C you cannot. IMHO this would be a better approach. Example:

__declspec(mfence) bool ready = false; // *** not available __declspec ***

-------------
My 2 cents:

IMO writing to or reading from volatile ought to enforce memory fencing (flush/refresh barrier) and obey ordering as presented in the source.
Unfortunately, this appears to NOT be the case.

volatile UCHAR* data = 0x1234560;
volatile USHAR* status = 0x1234561;
...
while(*msg)
{
*data = *msg++;
*status = 1;
while(*status)
_mm_pause();
}

Compiler optimizations should not (must not) reorder the statements, nor assume status always 1, nor move the last 3 lines inside the while loop outside the loop.

Jim Dempsey

www.quickthreadprogramming.com


The article is good, but could be improved by talking about memory barrier semantics.

The source I found most useful is the Linux kernel memory barriers, which describe 4 barriers. In the linux source code are examples of those barriers on a wide variety of architectures, and in looking through the Linux source code, examples of where memory barriers are required. The documentation checked into the linux source code regarding memory barriers was an eye opener, and the best I can offer to the practical programmer.

All of the above assumes C (and thus C++) as your language.


thank you guys, i found everything that i needed right in this website Google referred me to this website.


Walt Howard's comment makes a good jumping-off point:
> volatile is needed in g++ generated code. I've seen non-volatile member variables not changing
> values when accessed from another thread and volatile fixed the problem IMMEDIATELY.

It fixed the problem in the 98% of cases that you were able to test. And likely in 99.999% of the remaining cases that you weren't able to test. But probably not in the 1 case that will be encountered when your product is running in that bank in Tokyo in the middle of a busy trading day and takes their network down for twenty minutes.

And since it only happens one in every gazillion times, you'll never manage to reproduce the crash in your lab to find the bug. But of course, it will almost certainly happen again six months later at the same customer who will then throw your product in the trash (if they did not indeed do so the first time).

> If you don't use volatile where appropriate, you deserve what you get.

If you are simply adding 'volatile' to variables that are shared between threads thinking that fixes your shared-data problem without bothering to understand why it may not, you will eventually reap the reward you deserve.


Pages