Are memory stores atomic on modern multicore systems?

Are memory stores atomic on modern multicore systems?

Consider the example code below,

int msg = 0;

P1 P2

msg = 1; r = msg;

Is it OK to write such code? Is there anypossibility that P2 read some "intermediate" value (i.e. neither 0 nor 1)?

I have researched on this topic for a while. However,I am still not sure if we are allowed to write the above code.

Some said we should use mutex.

Some saidthat a native-wordstore is atomic.

Some said a alignednative-wordstore is atomic.

Which one is right? Can programmers just write the above code, then assume that a native-word / aligned native-word store is atomic?

Any comment will be appreciated.

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When the size of int, which may be 2, 4, 8bytes, resides within and alignment of 2,4,8 bytes respectively. Then the store is atomic (on currentIntel Archetecutre). On earlier Intel Archetecutre 80386, 80486, maybe first Pentium, this was restricted to 2, 4 bytes.

When alignment is not naturaly aligned (aligned to size), then when the entire int resides within one cache line, then for some processors, the store would be atomic. However do not rely on this as the guaranty of atomnicity is for naturaly aligned and within one cache line for 1, 2, 4, 8 bytes (and 16 bytesunder some circumstances).

Jim Dempsey

Thanks for your reply.

Let me ask in another way,

int msg = 0
P1P2
msg = 1; r = msg;

Can we write the above code and assume that "msg = 1" can be done atomically, no matter what compilers we use and no matterwhere platforms we run?

Or perhapswe shouldrely on the guaranty provided by programming langauages or special API?

Seems atomic read / write depends on a lot of things. See the link below.

http://www.lambdacs.com/cpt/FAQ.html#Q377

"Is it OK to write such code?"
Short answer: no, use an atomic type instead, that's what they're there for.

Jim is right, of course (well, I assume), but that stuff is an atomic implementer's business, not an atomic user's.

I suggest you also use volatile

volatile int msg = 0;

The reason being, without volatile, the compiler is free to rearange or eliminate the store sequences by P1 as long as it does not affect P1. IOW without regard to the effects on P2. Likewise for the code execution in P2 without regard to P1.

Also, if msg can be updated simulteaneously (msg used for 2-way message passing) then use InterlockedExchange (or equivilent asm statement).

*** BAD

int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...

Without volatile, the P1 while loop will not see P2 returning msg back to 0

*** good

volatile int msg = 0;
P1P2
msg = 1; r = msg;
while(msg)msg = 0;
_mm_pause(); ...

Now while(msg) gets fresh copy of msg on each iteration

Jim Dempsey

Often atomicity is not enough. Often order of operations is critical too. E.g., if "msg==1" means that "there is other data that is ready to read", the write "msg=1" has to guarantee that earlier writes become visible before "msg==1" becomes visible. See http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ for a discussion of why volatile is useless for such guarantees.

Yes, atomicity and ordering are both critical.

However, I found an interesting paper "FastForward for Concurrent Threaded Pipelines" in which the authors proposed a lock-free queue and claimed the correctness of the queue only depends on atomicity.

The authors proposed the enqueue and dequeue functions as follows,

// enqueue function
put_nonblock(...) {

if (NULL == queue[head]) {

queue[head] = ptr;

head = NEXT(head);

}

}

// dequeue function

get_nonblock(...) {

if (NULL != queue[tail]) {

ptr = queue[tail];

queue[tail] = NULL;

tail = NEXT(tail);

}

}

The basic idea of FastForward is using the queue entry to idicate the empty or full condition rather than using head andtail. Therefore, head and tail are not shared variables anymore.

I quoted here what the authors said about the correctness of FastForward.

...notice that this optimization also allows a CLF implementation that only depends on the coherence protocol (and not the consistency model) to ensure mutual exclusion.

To see this, recall that stores are atomic on all coherent memory systems, thus the getter and putter will never update the same node simultaneously since the getter waits for a non-NULL value and the setter a NULL value.

I am surprised the FastForward can only rely the atomicity as the authors claimed. But I don't know what the authors claimed is right or not. Any comment will be appreciated.

They should still use atomic: if it's not needed, it probably also will not cost anything, and otherwise, well...

Meanwhile I hope nobody is thinking "volatile" anymore, because it's either a nonportable equivalent for atomic, or it isn't useful at all.

Quoting jimdempseyatthecove
The reason being, without volatile, the compiler is free to rearange or eliminate the store sequences by P1 as long as it does not affect P1.

Where do you get the info?

According to ISO C/C++ volatile has nothing to do with ordering. Even if a var is volatile, compilers are free to rearrange stores around them. For sure.

> I am surprised the FastForward can only rely the atomicity as the authors claimed

And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.

The question is senseless w/o context. There are a way too many answers to it. What is the contest (language, compiler, hardware) you are interested in?

Quoting Dmitriy Vyukov
> I am surprised the FastForward can only rely the atomicity as the authors claimed

And for queue construction they do not rely on both atomicity and ordering. This fact is also irrelevant here.
You should look at how they synchronize passing of user data rather than synchronization of queue cells.

Hey, wait, the following code is just incorrect:

if (NULL != queue[tail]) {
ptr = queue[tail];
queue[tail] = NULL;
tail = NEXT(tail);
}

... until it's C++ where the code can have any meaning... or Microsoft C and vars are declared volatile... anyway it's incorrect w/o fences.

Assumewe implement FastForwardwith C.Since the authors of FastForward did not mention anything about "volatile", let us assume that all thevariables are simple data type (i.e., no volatile or other quantifier).

FastForward use the queue entry to idicate the empty or full condition. In our case, NULL meansthe queue entry (slot) is empty. I reposted the enqueue and dequeue function here.

// enqueue function

put_nonblock(...) {

if (NULL == queue[head]) {

queue[head] = ptr;

head = NEXT(head);

}

}

// dequeue function

get_nonblock(...) {

if (NULL != queue[tail]) {

ptr = queue[tail];

queue[tail] = NULL;

tail = NEXT(tail);

}

}

Note that head and tail are NOT shared variables anymore.

The enqueue function first see thequeue entry is empty or not by using NULL as an indicator.If the queue entry is empty, then it will insert data into the queue entry and increase head.

The dequeue functiondoes a similar thing. After retrieve data from thequeue entry, ituse NULL to mark the queue entryas a empty slot.

The correctness of FastForward rely on atomic store operations as claimed by the authors.Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?

Thanks.

Maybe I've been out of the loop too long, or missing some precondition here, but I don't see how this works using ordinary C/C++. Saying memory stores are atomic on modern multicore systems isn't saying much; unless you've got a queue entry bridging two cache lines, most memory stores are going to be a cache-line at at time. Am I missing something in this scenario?

// enqueue function                Thread 1                          Thread 2
put_nonblock(...) {
  if (NULL == queue[head]) {     Finds queue[head] NULL        Instant later does the same
    queue[head] = ptr;           queue[head] = ptr1            Overwrites ptr1 with ptr2 
    head = NEXT(head);           head = head2                  head = head3
  }
}

And likewise on the queue function? Without an atomic binding of the read and the write (in this case the read and write of queue[head]) how do you avoid races?

Hmm..., FastForward is designed for single producer single consumer scenario. I forgot to mention this.

If the ptr values are going to actually point to something, you'll obviously need a release-store operation in the enqueue, and a load-acquire in the dequeue, and atomics can provide that. For simple data items, you still have to know for certain that the loads and stores are indivisible, so either you implicitly trust the compiler to do what you expect (odds are good, but not 100.00%, and a static analysis tool worth its salt will surely complain about races because by definition that's what you'll have), or you still use atomics (can be relaxed, i.e., no fences, and they won't cost you anything) or inline assembly (that's like a roll-your-own atomic). If you want to play with fire anyway, how certain are you that the compiler won't optimise away access to the circular buffer? Odds are very good, but again not 100.00% (maybe the compiler is super smart but not super super smart?), so you'll have to implicitly trust the compiler to do what you expect a second time: out-of-order stores or out-of-order loadsshouldn'twouldn't upset the algorithm, but they should still go to memory in a finite amount of time, so maybe it would be a good idea (though not optimal) to make the buffer at least volatile, if you allow any kind of optimisation by the compiler. But still, the only reason not to use atomics is if you're somehow allergic to them, because they're the right tool for the job and they won't cost you anything (you'll have to set the fence, though: relaxed for simple data, elementary fence for pointers, because C++0x seems to want to have sequential consistency as the default, which seems like selling a race bike with training wheels that you'll probably want to take off before an actual race). Dmitriy, what's your take on that?

(Added) Interestingly, with pointers to data the producer only needs to release on the store (the load is just a NULL), and the consumer only needs to acquire on the load (the store is just a NULL), so apparently default memory semantics on atomic operations are not always the final choice. Also, the producer shouldn't assume that a single compare-and-swap will be better, because that'll cost an arm and a leg on x86 and the algorithm doesn't require inter-operation atomicity here; I don't think anybody has invented single negative-compare-and-swaps for the consumer, though?

(Corrected typo after seeing #18: traning->training)

(Edit for clarity: shouldn't->wouldn't)

It should look as follows (C1x atomic primitives are used):

	size_t head;
	size_t tail;
	atomic_address queue [QUEUE_LENGTH];

    put_nonblock(...)
	{
      if (atomic_load_explicit(&queue[head], memory_order_relaxed) == 0)
	  {
        atomic_store_explicit(&queue[head], ptr, memory_order_release);
        head = NEXT(head);
      }
    }

    get_nonblock(...)
	{
      if (atomic_load_explicit(&queue[tail], memory_order_relaxed) != 0)
	  {
        ptr = atomic_load_explicit(&queue[tail], memory_order_consume);
        atomic_store_explicit(&queue[tail], 0, memory_order_relaxed);
        tail = NEXT(tail);
      }
    }
	

> The enqueue function first see thequeue entry is empty or not by using
NULL as an indicator.If the queue entry is empty, then it will insert
data into the queue entry and increase head.

It does not work that way.

> Since yousaid it's incorrect w/o fences, could you show me how ording can break the code?

For sure. Consider:

// this is part of user's code
msg->data = 47;
// and this is pup_nonblock()
if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}

Either C/C++ compiler OR hardware is free to reorder the code as:

if (queue[head] == 0)
{
queue[head] = msg;
head = NEXT(head);
}
msg->data = 47;

As you may see, single-threaded semantics are preserved by the transformation, and that's what both C/C++ compiler and hardware preserve by default.
However, multi-threaded semantics are definitely broken by the transformation - consumer will receive partially constructed message.

The same kind of transformation can occur on consumer side too (hence memory_order_consume ordering constraint). However details are much more wicked. You may think of it as:

if (queue[tail] != 0)
{
msg = queue[tail];
queue[tail] = 0;
tail = NEXT(tail);
data = msg->data;
printf("%d", data);
}

transformed to:
data = *(0x12345678);
if (NULL != queue[tail])
{
ptr = queue[tail];
if (ptr != 0x12345678)
data = *ptr;
queue[tail] = NULL;
tail = NEXT(tail);
printf("%d", data);
}

That's also perfectly correct single-threaded optimization that breaks multi-threaded code.

> C++0x seems to want to have sequential consistency as the default, which
seems like selling a race bike with traning wheels that you'll probably
want to take off before an actual race). Dmitriy, what's your take on
that?

Yeah, race bike with training wheels. LOL
What for in this world one wants atomic operations with seq_cst everywhere? I guess mutexes can be faster than that, because they do not have to be globally ordered.
And personally I do not want any defaults for atomics, on the contrary, I want to be as explicit and as voluble as possible.

"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...

Quoting Raf Schietekat

"ptr = atomic_load_explicit(&queue[tail], memory_order_consume);"
I think that should be memory_order_acquire, shouldn't it? But I still don't "get" memory_order_consume, so maybe it's just me...

It may be either way depending on user-level semantics provided by the queue.

consume is faster and enough in 99% of real-world usages.

Could you explain it, though? What's the difference in generated instructions, in performance? What makes that 1% so significantly different that we need to differentiate consume from acquire?

(Added) I'm guessing the difference is on the compiler side only, with acquire meaning that everything is flushed out first and reread after the load, whereas consume only flushes whatever has a dependency relationship?

This article might be helpful in understanding the general issues:

Memory Models: A Case for Rethinking Parallel Languages and Hardware

Solving the memory model problem will require an ambitious and cross-disciplinary research direction.

Sarita V. Adve, Hans-J. Boehm

Communications of the ACM Vol. 53 No. 8, Pages 90-101

Free for now at: http://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext

See also: http://lambda-the-ultimate.org/node/4041

- Daniel

Quoting Raf Schietekat
Could you explain it, though? What's the difference in generated instructions, in performance? What makes that 1% so significantly different that we need to differentiate consume from acquire?

(Added) I'm guessing the difference is on the compiler side only, with acquire meaning that everything is flushed out first and reread after the load, whereas consume only flushes whatever has a dependency relationship?

Yes and no.

Yes, acquire affects all subsequent operations, whereas consume affects only dependent operations.

No, it affects both compiler and hardware.

As for performance, for x86/SPARC TSO there is no difference - all loads are acquire anyway. For PPC/SPARC RMO/IA64 there is a difference - acquire includes hardware fence, while consume is implicit (the same way acquire is implicit on x86). For mythical Alpha, both acquire and consume include hardware fence.

So if you target only x86, don't store your mind with this.

If you target broader class of architectures, consume is preferable. There are special tricks like "dependency injection".

#22 "This article might be helpful in understanding the general issues:"
But not with specific issues. :-)

#23 "As for performance, for x86/SPARC TSO there is no difference - all loads are acquire anyway. For PPC/SPARC RMO/IA64 there is a difference - acquire includes hardware fence, while consume is implicit (the same way acquire is implicit on x86). For mythical Alpha, both acquire and consume include hardware fence."
Only Alpha? I didn't know that. Still, to conservatively approximate consume on pre-C++0x, you need a compiler fence, right? Do you happen to have any idea about what effect this might typically have on performance (from best to worst: real consume vs. approximated consume vs. acquire)?

Quoting Raf Schietekat

#22 "This article might be helpful in understanding the general issues:"
But not with specific issues. :-)

#23 "As for performance, for x86/SPARC TSO there is no difference - all loads are acquire anyway. For PPC/SPARC RMO/IA64 there is a difference - acquire includes hardware fence, while consume is implicit (the same way acquire is implicit on x86). For mythical Alpha, both acquire and consume include hardware fence."
Only Alpha? I didn't know that. Still, to conservatively approximate consume on pre-C++0x, you need a compiler fence, right? Do you happen to have any idea about what effect this might typically have on performance (from best to worst: real consume vs. approximated consume vs. acquire)?

> Only Alpha?

As far as I know, yes. It's able to load data via a pointer before loading the pointer. Yes, it's impossible. But it's able.


> Still, to conservatively approximate consume on pre-C++0x, you need a compiler fence, right?

Yes. Volatile load + compiler fence.

> Do you happen to have any idea about what effect this might typically have on performance

Compiler fence usually has virtually zero impact on performance.

"> Only Alpha?
As far as I know, yes. It's able to load data via a pointer before loading the pointer. Yes, it's impossible. But it's able."
It doesn't seem all that improbable if the logic to perform speculative loads has a mind of its own that doesn't know, perhaps because it is not even located inside the processor, whether the pattern it imagines originated in a local register-based loop or from values obtained from elsewhere? But now I'm just speculating myself. :-)

"> Still, to conservatively approximate consume on pre-C++0x, you need a compiler fence, right?
Yes. Volatile load + compiler fence."
Let's make that an inline-assembler load + compiler fence, just to sleep easier.

"> Do you happen to have any idea about what effect this might typically have on performance
Compiler fence usually has virtually zero impact on performance."
Is that from measurements, and how smart was the compiler about consume? It seems difficult to predict otherwise, because the number of registers to be reloaded would depend both on the processor architecture and on what typically happens in a program around a load+consume. And if you have to reload a lot of it anyway, what's the real cost of an acquire? But never mind, it was just out of curiosity.

(Added after #26) Good article, though (the one mentioned in #22, I mean). It's nice to know that part of why I still don't fully understand this stuff may be that the state of the art itself is still flawed, after 3 decades of research. :-)

Question probably for Dmitriy: do you have any opinion about perhaps having a "produce" fence that pairs with "consume"? At the very least it would serve as self-documentation, even if it would always have the same effect as "release" (whether inherently or as a conservative approximation), and atest application could verify the pairing statically and/or dynamically even if only to validate the purpose. But perhaps the compiler could also avoid spilling (sic?) registers to memory, if it knows that there's only a "consume" on something unrelated at the other end? I can't put my finger on it, but it feels :-) somehow wrong that this pairing has not been proposed for C++0x.

Leave a Comment

Please sign in to add a comment. Not a member? Join today