Spin-Locks vs. Atomic Instructions

Spin-Locks vs. Atomic Instructions

Hello all,

I have been implementing some algorithms from "The Art of Multiprocessor Programming" in TBB. I have found in this book, however, that it relies sometimes upon the Java volatile keyword. The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java.

I have found that the limited size of atomic variables is a problem with even simple algorithms. For instance, if I want to atomically alter two variables I need to pack them into a single atomic. What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?

Is software transactional memory worth looking at, performance wise? Can it really offer any better performance than spin-locks?

Do I have other alternatives to atomic/spin-locks in writing fast parallel algorithms?

Thanks,

AJ

36 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

My bet: if you take care to put the mutex in the same cache line as what you want to protect, the only significant penalty you pay relative to an atomic operation is the risk of convoying, but I have nothing to back that up (no pun intended). I see no problem of scalability (if used with my super duper spin lock algorithm), and if the protected sections are short enough, STM doesn't buy you any performance either (only ease of use). Just make your protected sections short enough by, e.g., taking a snapshot, doing any expensive operations, getting a lock, making sure there are no ABA issues, and completing the "transaction" ("optimistic locking"?). Again, like those expensive operations, all this is completely speculative: please correct any misconceptions. You (and I) should probably also look into the optimisations that can be had from techniques involving mostly-read data, where it still pays to incur a huge penalty for writes because so many more reads were so much faster (as I understand it); I'm sure Dmitriy Vyukov can provide some pointers (again).

This reminds me of my dilemma with Atomic floats, the information there might be of use for you too, AJ. Anyways I also dreamed of a near future of transactional memory architectures, but as Arch pointed out in that thread(with some articles), transactional memory is still in its infancy and likely to be years down the road before it becomes really useful when performance matters. If you just need a little boost, or scalable system its probably ok, but not if your trying to juice the machine.

Anyways my tests of random access to arrays of spinlocked values and atomic values, proved spinlocks were better by a small margin, but then again as Raf mentions I had them on the same cache line as the values.

I actually found that the data structure I wrapped with a spin_lock performed very well. No official benchmarks though.

Quoting - AJ
I have been implementing some algorithms from "The Art of Multiprocessor Programming" in TBB. I have found in this book, however, that it relies sometimes upon the Java volatile keyword. The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java.

Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it.
However in many cases an algorithm may not require sequential consistency, but Java just don't have more accurate instruments than sequentially consistent volatiles (shame on it). So if you are using TBB atomics you may replace sequentially consistent operations with acquire/release operations in such cases.
For example, on x86 sequentially consistent store is LOCK XCHG, however store-release is just plain MOV.

Quoting - AJ
I have found that the limited size of atomic variables is a problem with even simple algorithms. For instance, if I want to atomically alter two variables I need to pack them into a single atomic. What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?

I can't follow you here. volatile variables are limited in size just as atomics (otherwise they are also implemented with something like mutex). So what's the problem?

Note that:

volatile struct X
{
int m;
int n;
};

is no more than:

struct X
{
volatile int m;
volatile int n;
};

However I'm not sure that Java has structs and allows to declare them as volatile, so just ignore this part.

Quoting - AJ
What I have done so far, is use spin-locks around the variables that I would normally want to alter atomically. How well would this scale as the number of threads/cores increases?

If the data structure is under light load then it doesn't matter.
If the data structure is under heavy load then - no, spin-mutex is not scalable just as atomic.
There is just no scalable centralized shared data-structures. Period.

Quoting - AJ
Is software transactional memory worth looking at, performance wise? Can it really offer any better performance than spin-locks?

Nope. IMVHO, STM is the slowest of all techniques. If you are frequently mutating data structure then STM will offer horrible perfromance and scalability to you. If you are mostly reading data struture then some STM implementations will offer reasonable scalability by using something like SeqLock technique. But still there are techniques that can handle read mostly workloads much better.

Quoting - AJ
Do I have other alternatives to atomic/spin-locks in writing fast parallel algorithms?

You looking to the wrong direction. Atomic operations may only somehow improve bad non-scalable design.
If you need scalability (scalability not like in STM - "Our STM offers good scalability... well, we don't say it to you, but on 16 cores our STM doesn't even close to single-threaded implementation...", but real scalability - 16 times faster on 16 cores) then you better look at the things like privatization, partitioning, immutability, distribution, amortization, copy-on-write, partial-copy-on-write, lock-free reader pattern, etc.
Consider following benchmarks:
http://software.intel.com/en-us/forums/showthread.php?t=60494
Spin-mutex implementation shows scalability of 0.63 even on 2 cores (linear scalability would be 2.0), even provided that hash-map is quite distributed data structure in itself. If you will measure some inherently centralized data structure performance degradation will be much more severe.

Quoting - Dmitriy Vyukov
If you need scalability (scalability not like in STM - "Our STM offers good scalability... well, we don't say it to you, but on 16 cores our STM doesn't even close to single-threaded implementation...", but real scalability - 16 times faster on 16 cores) t

Regarding STM. Consider Deuce-STM:
http://sites.google.com/site/deucestm/Home
(notice "...lets you develop highly concurrent applications..." on the main page).
Now consider benchmark results:
http://sites.google.com/site/deucestm/documentation/benchmarks-1
On fig. 2 you may see than lock-based implementation yields 3.5*10^6 ops/sec on 1 thread, let's assume that purely single-threaded implementation w/o lock would yield 5*10^6 ops/sec. So linear scalability on 16 cores would be 80*10^6. You may see that STM based implementation on 16 cores yields only 2.5*10^6 ops/sec.
Fig. 1 is even worse. Notice how they stretch Y-axis so that STM looks somehow scalable. LOL.
STM benchmarkers ought to draw on the graphs the line that represents abstract linear scalability of the purely single-threaded implementation (w/o locks) to show feebleness of the STM. LOL.
Note that TL2 STM algorithms is quite state of the art.

#4 "Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it."
How do you figure that? Aren't the only sequentially consistent among themselves (rhetorical)? Anyway, that's what I modelled with memory semantics "mut_ord" (for "mutually ordered"), which is less strict and, because it wouldn't make sense otherwise, maybe a little bit faster than "ordered" (I didn't like "seq_cst" for "sequentially consistent" because you never quite know what it means either anyway, as this discussion shows), although not on all architectures (no x86 I don't recall that it makes any difference).

#5 "I can't follow you here. volatile variables are limited in size just as atomics (otherwise they are also implemented with something like mutex). So what's the problem?"
I'm not sure it was implied that Java volatile applies to entire objects, but I do recall some discussions about applying atomic to entire classes etc. I'd have to check what C++0x is planning to do there.

#6 "There is just no scalable centralized shared data-structures. Period."
Admitted, but there are still degrees of scalability: some mechanisms quickly degrade through convoying (locks) and livelock or something very similar to it (bad locks, bad lock-free algorithms).

On a related point, what do you think: will Rock rock?

Quoting - Raf Schietekat

#4 "Precise representation of Java's volatile is atomic variable with sequentially consistent operations on it."
How do you figure that? Aren't the only sequentially consistent among themselves (rhetorical)? Anyway, that's what I modelled with memory semantics "mut_ord" (for "mutually ordered"), which is less strict and, because it wouldn't make sense otherwise, maybe a little bit faster than "ordered" (I didn't like "seq_cst" for "sequentially consistent" because you never quite know what it means either anyway, as this discussion shows), although not on all architectures (no x86 I don't recall that it makes any difference).

I figured that from language spec. They order all memory accesses.

Quoting - Raf Schietekat

On a related point, what do you think: will Rock rock?

I can't say anything regarding Rock itself, but I think HTM feature will rock amongst hardcore low level programmers for whom it's worth doing optimizing for one particular Sun processor. I think that the main consumers will be Solaris and Sun JVM.
HTM is much better than STM because it allows overheadless perfectly scalable read-only transactions. Although here are some caveats too (at least for unmanaged world), for example, if one reads object from concurrent container, he must either include object processing into transaction or use something like reference counting, both options are not very good.
Mutable operations are also handled better to some degree. But because of it's limitations, I think that the reasonable usage is to construct double-CAS or triple-CAS from HTM, and then work in terms on these CASes.
But for now it's more of a theoretical interest for me, I am curious when AMD will finally implement HTM.

#11 "I figured that from language spec. They order all memory accesses."
I'm afraid that this is just in your imagination, although I'm always eager to be educated. :-)

(Added) I'm prepared to assume that you know internally what you're talking about, but then you should add some more knowledge barriers so that we don't have to reconstruct what you know out of thin air (sorry, couldn't help myself). My take is that Java volatiles are only sequentially consistent with regard to other volatiles (not quite what you wrote); with regard to non-volatiles, they basically only do load-acquire or release-store. This is what I tried to mimick with mut_ord memory semantics (mutually ordered with other such atomics). The implication is that if you have an algorithm in C++ with ordered (C++0x's seq_cst) atomics, it is not enough (in general) to substitute Java volatile for just those atomics (doesn't RRD model them as weaker?); the other way around, substituting C++ ordered atomics for Java volatiles is more expensive than strictly needed (although I'm still assuming that the difference may be relatively minor). Am I mistaken?

#0 "The best way to mimick this functionality in TBB, it seems, is to have a spin-lock wrapped around the variable that would have been volatile in Java."
If all "volatiles" that may interact share the same mutex, that would be sufficient, but more expensive than necessary, even disregarding the possibility that Java volatile is overkill. My patch's mut_ord_atomic is meant to be a closer match, although something more appropriate to the algorithm's needs can probably be found instead; feel free to share any benchmarking outcomes.

Quoting - Raf Schietekat
#11 "I figured that from language spec. They order all memory accesses."
I'm afraid that this is just in your imagination, although I'm always eager to be educated. :-)

(Added) I'm prepared to assume that you know internally what you're talking about, but then you should add some more knowledge barriers so that we don't have to reconstruct what you know out of thin air (sorry, couldn't help myself). My take is that Java volatiles are only sequentially consistent with regard to other volatiles (not quite what you wrote); with regard to non-volatiles, they basically only do load-acquire or release-store. This is what I tried to mimick with mut_ord memory semantics (mutually ordered with other such atomics). The implication is that if you have an algorithm in C++ with ordered (C++0x's seq_cst) atomics, it is not enough (in general) to substitute Java volatile for just those atomics (doesn't RRD model them as weaker?); the other way around, substituting C++ ordered atomics for Java volatiles is more expensive than strictly needed (although I'm still assuming that the difference may be relatively minor). Am I mistaken?

Accesses to java volatiles (along with other synchronization actions - mutex lock/unlock, thread start/end, etc) form total global order S of all synchronization actions (which agrees with program order). Any two actions of the same thread (including plain memory accesses) always connected by happens-before relation (according to program order).
This is what I was trying to say. Probably my English lets me down.
And I am still state that equivalent of java volatiles is atomic var with seq_cst operations on it.
I better just paste spec:

17.4.4 Synchronization Order
Every execution has a synchronization order. A synchronization order is a
total order over all of the synchronization actions of an execution. For each thread
t, the synchronization order of the synchronization actions (17.4.2) in t is consistent
with the program order (17.4.3) of t.
Synchronization actions induce the synchronized-with relation on actions...

17.4.5 Happens-before Order
Two actions can be ordered by a happens-before relationship. If one action happens-
before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happensbefore
y.
If x and y are actions of the same thread and x comes before y in program
order, then hb(x, y).
There is a happens-before edge from the end of a constructor of an object to
the start of a finalizer (12.6) for that object.
If an action x synchronizes-with a following action y, then we also have hb(x,
y).
If hb(x, y) and hb(y, z), then hb(x, z).
...

17.4.2 Actions
...
Synchronization actions, which are:
Volatile read. A volatile read of a variable.
Volatile write. A volatile write of a variable.
...

This is what I was trying to say.

Let's narrow it down: do you agree that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered, and that this is not allowed with C++0x seq_cst, which would make them not equivalent? If you do not agree, why not?

(Added) Or maybe they are equivalent and my patch's "ordered" is stronger than C++0x seq_cst?

Quoting - Raf Schietekat
Let's narrow it down: do you agree that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered, and that this is not allowed with C++0x seq_cst, which would make them not equivalent? If you do not agree, why not?

(Added) Or maybe they are equivalent and my patch's "ordered" is stronger than C++0x seq_cst?

I do not agree.

17.4.5 Happens-before Order
...
If x and y are actions of the same thread and x comes before y in program
order, then hb(x, y).
...

If you have normal load A sequenced before volatile load B in program order then A happenes-before B.
If you have normal store A sequenced after volatile store B in program order then B happens-before A.

"I do not agree."
Well, apparently you didn't always see them as equivalent yourself. I'm just repeating what Doug Lea wrote in "The JSR-133 Cookbook" ("Can Reorder" table, near the top)... I'm pretty sure he has read the same things you are quoting, him being part of JSR 133's Expert Group and all (in case anybody's wondering: this specification was subsequently incorporated into the Java Language Specification). I can see the apparent contradiction, and I probably should dive into the JLS sometime to get to the bottom of this, but right now I have this table, and some vague memory of another text that ascribed total order only to the volatiles among themselves and specifically complained about how expensive this is on x86 without providing the same full-fence guarantees in general so that code would be written that would run only on x86 (in case anybody from Intel is gloating right now: I think Itanium would be affected, too), or something like that (I could have just dreamt it, though).

Quoting - Raf Schietekat
"I do not agree."
Well, apparently you didn't always see them as equivalent yourself.

It has nothing to do with seq_cst atomics. It's about stand-alone seq_cst fences which are a bit strange in C++0x.

Quoting - Raf Schietekat
I'm just repeating what Doug Lea wrote in "The JSR-133 Cookbook" ("Can Reorder" table, near the top)... I'm pretty sure he has read the same things you are quoting, him being part of JSR 133's Expert Group and all (in case anybody's wondering: this specification was subsequently incorporated into the Java Language Specification). I can see the apparent contradiction, and I probably should dive into the JLS sometime to get to the bottom of this, but right now I have this table, and some vague memory of another text that ascribed total order only to the volatiles among themselves and specifically complained about how expensive this is on x86 without providing the same full-fence guarantees in general so that code would be written that would run only on x86 (in case anybody from Intel is gloating right now: I think Itanium would be affected, too), or something like that (I could have just dreamt it, though).

I think you are mixing up formal semantics and implementation details. Notice the name of the paper - "... for compiler writers". I believe that possible reorderings you mentioned do not affect formal semantics. Consider, C++0x's seq_cst loads are implemented as plain loads on x86, there is no need to issue mfences before and after loads - formal sequential consistency is still provided.

Well, I shouldn't have referred to that discussion without first reading it through, sorry, I just thought that maybe you just forgot that you really agree with me. :-)

I don't recall ever having that much trouble with quantum physics...

On the one hand you disagree with "a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered" because of the reasons you give in #17, and on the other hand those exact same reorderings in the Cookbook "do not affect formal semantics".

I used to call that a direct contradiction, before I entered the twilight zone of memory semantics.

Quoting - Raf Schietekat
I don't recall ever having that much trouble with quantum physics...

On the one hand you disagree with "a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered" because of the reasons you give in #17, and on the other hand those exact same reorderings in the Cookbook "do not affect formal semantics".

I used to call that a direct contradiction, before I entered the twilight zone of memory semantics.

Quantum physics relates to real-world so it has to be consistent. Formal memory models do not.

Some reorderings may happen on machine instruction level, but you are not able to detect whether they happen or not. So, on one hand, yes, they happen. On the other hand, you can think that they do not happen. Is it contradiction? I don't know.

So formal memory models do not have to be consistent, and you can be an expert without being able to spot a contradiction? To me that sounds more like astrology than computer science, and I refuse to accept that.

Here's another one, from a FAQ for us dummies: "Writing to a volatile field has the same memory effect as a monitor release, and reading from a volatile field has the same memory effect as a monitor acquire." Nice and consistent with my observation from the Cookbook "that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered" (as I interpreted it).

Yet you specifically "do not agree" (#17). So who is mistaken here? You can't all be right, or can you?

Quoting - Raf Schietekat
So formal memory models do not have to be consistent, and you can be an expert without being able to spot a contradiction? To me that sounds more like astrology than computer science, and I refuse to accept that.

Here's another one, from a FAQ for us dummies: "Writing to a volatile field has the same memory effect as a monitor release, and reading from a volatile field has the same memory effect as a monitor acquire." Nice and consistent with my observation from the Cookbook "that a normal load before a Java volatile load or a normal store after a Java volatile store can be reordered" (as I interpreted it).

Yet you specifically "do not agree" (#17). So who is mistaken here? You can't all be right, or can you?

Ok, let's narrow it down one more time: you are interested in machine instruction level (you are compiler writer)? Or you are interested in program semantics (you are end user, programmer)? The answer will vary depending on your choice. If you are compiler writer forget about specification, use cookbook. If you are programmer forget about cookbook, use specification. But then do not try to extrapolate rules of one world to another.

That's ridiculous: why would these points of view not be in complete agreement? Well, they could both be providing a partial view depending on what hat you're wearing, but then there should also be a description from which both can be deduced. In this case I am still assuming a priori that a Cookbook can be deduced from the Java Language Specification, and that Doug Lea's Cookbook is a correct deduction.

I think that in #17 you were misinterpreting happens-before, maybe because it has such a horribly deceptive name: even within section 17.4.5 it is made clear that this relationship is only a building block for the specification, not a final description of what reorderings are allowed, and it is not even true in general that (as you write) "you are not able to detect whether they happen or not".

Quoting - Raf Schietekat
That's ridiculous: why would these points of view not be in complete agreement? Well, they could both be providing a partial view depending on what hat you're wearing, but then there should also be a description from which both can be deduced. In this case I am still assuming a priori that a Cookbook can be deduced from the Java Language Specification, and that Doug Lea's Cookbook is a correct deduction.

I think that in #17 you were misinterpreting happens-before, maybe because it has such a horribly deceptive name: even within section 17.4.5 it is made clear that this relationship is only a building block for the specification, not a final description of what reorderings are allowed, and it is not even true in general that (as you write) "you are not able to detect whether they happen or not".

Ok, another ridiculous example: single-threaded instruction reordering by optimizing compiler (hmmm... probably it's actually the same example, but I hope the point is more clear here). Do they happen? The answer depends on who you asking. If you will ask compiler writer, then you will get "Yes, of course! We are doing tons of reordering in optimizing mode". If you will ask programmer, then you will get "No, definitely no! How it's possible that I am writing one program and compiler reorders some operations?! The execution of my single-threaded program is always sequentially consistent, so NO reordering is allowed!".
So you get 2 opposite answers. Is it ridiculous? Is is contradiction?

The compiler writer should know better than to frighten the innocent programmer with incomplete information. It's even worse if he doesn't provide a more specific answer when prompted. It's a failure to communicate effectively.

Actually, the programmer is also wrong: only the observed order of sequence points will be preserved.

Quoting - Raf Schietekat
The compiler writer should know better than to frighten the innocent programmer with incomplete information. It's even worse if he doesn't provide a more specific answer when prompted. It's a failure to communicate effectively.

As you can see, this does not cause any problems in real life over tens of years with millions of programmers. So I believe here is no problem at all. And the glue here is "documented/observable behavior", the task of programmer is to rely on documented/observable behavior, the task of compiler writer is to preserve documented/observable behavior, nothing more and nothing less.
Java specification says nothing about reordering of machine instructions in machine code, it only declares visible sequential consistency. And you are trying to combine low-level reordering on machine instruction level with high-level observable behavior.

"Ne supra crepidam sutor iudicaret." - Plinius

Apparently you want the cobbler to stick to his last: a programmer should stay in his cubicle and is not supposed to be thinking for himself other than within the boundaries set for him by his betters. But the serial-code comparison is flawed (it does not address the reorderings that are visible, let alone the possible problems in the context of multithreaded code), and you draw the wrong conclusions (I say that the compiler writer should have given more complete information, and you say that "as [I] can see" it doesn't matter), so what is the relevance, other than creating a distraction?

Of course I am trying to combine both viewpoints! Why shouldn't a user be allowed to consider what might happen if a normal store follows a volatile store, and why, and what the corresponding machine code will be? I do not subscribe to the idea that specifications should be obscure, and that knowledge should trickle down a hierarchy from high priesthood to simpleton programmers. A significant number of the latter category should be able to go directly to the specification to resolve issues, and this specification should be as transparent as it can be (I have my doubts). I see contradictions between what you say and what even a FAQ says (I referred to a FAQ for programmers as well as to the Cookbook supposedly for compiler writers), and I should be able to say that you can't both be right (meant as a euphemism). You also misinterpreted the "high-level" specification, because you confused "happens-before" with observable behaviour by invoking this part of the specification for disagreeing with my example.

(To the administrators: can we sell BBB PPV tickets? :-) )

Quoting - Raf Schietekat

"Ne supra crepidam sutor iudicaret." - Plinius

Apparently you want the cobbler to stick to his last: a programmer should stay in his cubicle and is not supposed to be thinking for himself other than within the boundaries set for him by his betters. But the serial-code comparison is flawed (it does not address the reorderings that are visible, let alone the possible problems in the context of multithreaded code), and you draw the wrong conclusions (I say that the compiler writer should have given more complete information, and you say that "as [I] can see" it doesn't matter), so what is the relevance, other than creating a distraction?

Of course I am trying to combine both viewpoints! Why shouldn't a user be allowed to consider what might happen if a normal store follows a volatile store, and why, and what the corresponding machine code will be? I do not subscribe to the idea that specifications should be obscure, and that knowledge should trickle down a hierarchy from high priesthood to simpleton programmers. A significant number of the latter category should be able to go directly to the specification to resolve issues, and this specification should be as transparent as it can be (I have my doubts). I see contradictions between what you say and what even a FAQ says (I referred to a FAQ for programmers as well as to the Cookbook supposedly for compiler writers), and I should be able to say that you can't both be right (meant as a euphemism). You also misinterpreted the "high-level" specification, because you confused "happens-before" with observable behaviour by invoking this part of the specification for disagreeing with my example.

(To the administrators: can we sell BBB PPV tickets? :-) )

Sorry, I can't read Latin.

I don't understand what you want me to say :) I can try to make some more statements, probably I will guess:

Programmer is allowed to look at the generated machine code.

The relevance of the example with single-threaded reordering: some reorderings are visible nowhere except machine code. So of little interest to anybody except compiler developers.

Normal store may hoist above volatile store in the machine code. Normal load may sink below volatile load in the machine code.

Above fact is of little interest for programmer.

Programmer may work on a compiler and hardware platform which do not do above mentioned reordering for years. And them move to the compiler and/of hardware which do. Nothing will change for programmer.

Java volatiles do provide sequential consistency for all memory accesses (including surrounding normal memory accesses).

Happens-before is the most significant determinant of observable behaviour regarding multithreading. Because it determines "allowed to observe" ("visible side effect" or "visible sequence of side-effects") and presence/absence of data races (i.e. basically everything what memory model defines regarding multithreading).

I took some time to read section 17 of the Java Language Specification. Jolly! Well, not quite. I still have the impression that many of the formal specifications are just magical incantations for their own sake, where more mundane descriptions, perhaps even in terms of fences, could suffice (an implementer would still be allowed to move those fences around and even eliminate some, of course, but that's just because he wants to lure customers, who can still write histheir programs against a more straightforward contract, unaware of what really happens underneath for the better performance to happen). I'm sure that the authors are very likeable when you meet them in person, but why did they feel the need to impose those theoretical formalisms on us non-academics?

One statement I found particularly dubious: "This is an extremely strong guarantee for programmers. Programmers do not need to reason about reorderings to determine that their code contains data races." The guarantee is about "A program is correctly synchronized if and only if all sequentially consistent executions are free of data races.", so this means that the programmer does not need to reason about reorderings if he just checks... all possible sequentially consistent executions!

I would certainly like to see some evidence about whether it is really necessary to describe the behaviour this way, that it's not just showing off. Now I'm left wondering if this emperor is wearing clothes at all.

"Sorry, I can't read Latin."
I was just trying to find out the English equivalent of a Dutch expression, found a literal onetranslation (just after the Latin), got suspicious, and traced it back to its origins.

"I don't understand what you want me to say"
For example, that the distinction between formal semantics and possible execution was artificial?

"Programmer is allowed to look at the generated machine code."
Not just machine code: I consider the conceptual fences just another way of representing program semantics; a library has to make conservative corresponding choices, but a compiler might be able to perform much more significant optimisations.

"Normal store may hoist above volatile store in the machine code. Normal load may sink below volatile load in the machine code."
Now you say that, but previously you insisted about volatiles that: "They order all memory accesses." They order only some (those in correctly synchronised code).

"Above fact is of little interest for programmer."
It's just one way to reason about this. You shouldn't prescribe how people should think if one formalism is as good as the other.

"Programmer may work on a compiler and hardware platform which do not do above mentioned reordering for years. And them move to the compiler and/of hardware which do. Nothing will change for programmer."
That still applies if you specify things in terms of conceptual fences etc.

"Java volatiles do provide sequential consistency for all memory accesses (including surrounding normal memory accesses)."
They help, but you still have to "synchronise correctly". By itself, the compiler is not smart enough to diagnose incorrectly synchronised code, so the specification does not guarantee sequential consistency, only something like happens-before consistency.

"Happens-before is the most significant determinant of observable behaviour regarding multithreading. Because it determines "allowed to observe" ("visible side effect" or "visible sequence of side-effects") and presence/absence of data races (i.e. basically everything what memory model defines regarding multithreading)."
But you forget that for "happens-before" to have real meaning, the observer must share a happens-before relationship with the other actions, otherwise it may observe that the normal write moves up across the volatile write. Maybe that's an assumption you made, but I'm not telepathic, I can only go by what you write.

Note that we're talking about Java here, where the semantics are less flexible and much of the specification still moderately human, but in C++ the situation seems more problematic, without good reason: unlike for Java, a perfectly good specification can be made in terms of mostly pre-existing concepts like "volatile" side effects (or is it necessary to be able to coalesce atomic accesses?), external-function-like compiler fences (or an equivalent like an empty GNU asm statement that clobbers all memory), and some low-level details about what fences can do. It doesn't need so much new stuff at the language level (one can write an atomics library with pre-C++0x facilities), so this feels more like assassination by committee than good design to me. It will likely get accepted, though, because the voters will either have drunk the Kool-Aid or don't want to admit that they don't understand it. Convincingly arguing against it seems a lot more difficult (and I don't feel inclined to go bang my head against the wall).

"I would certainly like to see some evidence about whether it is really necessary to describe the behaviour this way, that it's not just showing off. Now I'm left wondering if this emperor is wearing clothes at all."

Propose your simpler specification.

""I don't understand what you want me to say"
For example, that the distinction between formal semantics and possible execution was artificial?"

It's not artificial. Example with single-threaded optimizations: formal semantics is than NO reordering possible (further than allowed by partial order of expressions), possible executions mock at it (just look at the compiled code).

""Programmer is allowed to look at the generated machine code."
Not just machine code: I consider the conceptual fences just another way of representing program semantics; a library has to make conservative corresponding choices, but a compiler might be able to perform much more significant optimisations."

Ok, is there ANY reason to include into specification the statement that normal stores may hoist above volatile store? Other than additional complication.

""Normal store may hoist above volatile store in the machine code. Normal load may sink below volatile load in the machine code."
Now you say that, but previously you insisted about volatiles that: "They order all memory accesses." They order only some (those in correctly synchronised code)."

These 2 statements are not conflicting. Notice "in the machine code" part.

""Above fact is of little interest for programmer."
It's just one way to reason about this. You shouldn't prescribe how people should think if one formalism is as good as the other."

I don't think that the "formalism" with fences (described in the cookbook) is as good as the specification. What comes to mind: fences say nothing about global order, Sparc RMO model is not the weakest so reasoning based on is not suitable for weaker machines (your program will crash on weaker machines), Sparc membars are not the bast way to describe semantics (for example, #LoadLoad|#LoadStore is actually much stronger than required load-acquire - it's terrible for the abstract memory model to force stronger ordering), there are more modern ISAs based on acquire/release primitives (ld.acq does NOT implements '#LoadLoad|#LoadStore' - use stronger fence!!!). I believe there are much more reasons.

""Java volatiles do provide sequential consistency for all memory accesses (including surrounding normal memory accesses)."
They help, but you still have to "synchronise correctly". By itself, the compiler is not smart enough to diagnose incorrectly synchronised code, so the specification does not guarantee sequential consistency, only something like happens-before consistency."

If there is something "not correct" in the program, then who cares? Java just had to provide semantics for "not correct" programs, but the reason is not that one must write not correct programs, but just because one don't want "non correct" applet downloaded from web format his harddrive. C++09 does not cares about "non correct" programs at all - it's just UB.

""Happens-before is the most significant determinant of observable behaviour regarding multithreading. Because it determines "allowed to observe" ("visible side effect" or "visible sequence of side-effects") and presence/absence of data races (i.e. basically everything what memory model defines regarding multithreading)."
But you forget that for "happens-before" to have real meaning, the observer must share a happens-before relationship with the other actions, otherwise it may observe that the normal write moves up across the volatile write. Maybe that's an assumption you made, but I'm not telepathic, I can only go by what you write."

And how then you interpreted it? If program contains single volatile access somewhere than it's totally sequentially consistent?

"Note that we're talking about Java here, where the semantics are less flexible and much of the specification still moderately human, but in C++ the situation seems more problematic, without good reason: unlike for Java, a perfectly good specification can be made in terms of mostly pre-existing concepts like "volatile" side effects (or is it necessary to be able to coalesce atomic accesses?), external-function-like compiler fences (or an equivalent like an empty GNU asm statement that clobbers all memory), and some low-level details about what fences can do. It doesn't need so much new stuff at the language level (one can write an atomics library with pre-C++0x facilities), so this feels more like assassination by committee than good design to me. It will likely get accepted, though, because the voters will either have drunk the Kool-Aid or don't want to admit that they don't understand it. Convincingly arguing against it seems a lot more difficult (and I don't feel inclined to go bang my head against the wall)."

Ok, propose your full specification. Then we will see which is simplier.
C++09 exactly learns from "simplier" specifications, one of which is constantly patched, and the other everybody complains too simple to be useful.

"Propose your simpler specification."
I think that hardly anything at all is needed at the language level (KISS), just some essential details about memory locations, starting and stopping a program, etc. Sugar for thread-local storage is also nice to have, although all this still perpetuates the myth that threads are the user's way to concurrency, not just the underlying infrastructure. Most of "1.10 Multi-threaded execution" should be removed though, or relegated to a non-normative appendix. A supporting library can easily use existing properties, like compiler fences (the equivalent of calling a function whose implementation is unknown, mandating the compiler to make everything externally visible, so that inter-thread actions have access to all reachable state and cannot be erroneously optimised away, although maybe a volatile qualification in an atomic might help as well), to make this a hardware problem, and then it can use hardware instructions to tell that hardware how to behave. It has already been demonstrated that way, and it works well with the idea of C/C++ as a language close to the hardware. I think that the burden is on whoever wants to make those big changes to prove that they are required (and consistent with the possibility of code compiled or assembled from other languages, not subject to the same formal specifications, existing in the same program).

"It's not artificial. Example with single-threaded optimizations: formal semantics is than NO reordering possible (further than allowed by partial order of expressions), possible executions mock at it (just look at the compiled code)."
It hardly appears "formal" to require that execution should appear, as far as side effects are concerned, to occur in program order between sequence points. BTW, I am not advocating to specify the allowed compiled code, or exact placement of fences, just a straightforward specification instead of one that appears to be floating in the air, one that is written for programmers instead of predigested for an arguably far more limited audience. Did I miss anything in the pre-multithreading specification about code transformation into data-flow graphs and a formal specification of the set of well-formed executions? I don't think so. Some people need to reason about the code in those terms, but they should probably also be capable of reasoning in terms of an integration of the programmer's view and the compiler writer's view. Compiler vendors can still form their own A-teams to do all kinds of magic behind the scenes, based on all manner of elaborate graph theories and theorem provers, to shave off some more time here and there, but I as a user don't want to be bothered with those theories, and probably neither does a debug build.

"Ok, is there ANY reason to include into specification the statement that normal stores may hoist above volatile store? Other than additional complication."
Not specifically, but you made a big point of (erroneously) saying that they can't. It should be straightforward to shift between reasoning in terms of happens-before and in terms of reordering.

"These 2 statements are not conflicting. Notice "in the machine code" part."
There is no essential difference between source code and machine code in this regard.

"I don't think that the "formalism" with fences (described in the cookbook) is as good as the specification. What comes to mind: fences say nothing about global order, Sparc RMO model is not the weakest so reasoning based on is not suitable for weaker machines (your program will crash on weaker machines), Sparc membars are not the bast way to describe semantics (for example, #LoadLoad|#LoadStore is actually much stronger than required load-acquire - it's terrible for the abstract memory model to force stronger ordering), there are more modern ISAs based on acquire/release primitives (ld.acq does NOT implements '#LoadLoad|#LoadStore' - use stronger fence!!!). I believe there are much more reasons."
How do you mean that "fences say nothing about global order" (surely the language has to be executed at some point)? Maybe the Cookbook is oriented toward a particular machine model, but is there any reason why that objection could not (easily) be met by substituting more specific abstract fences, e.g., let's call it "anchored acquire" (where the acquire specifies the variable on which the corresponding release must have occurred), and giving a similar list of conservative approximations? But what situation (in Java!) would be able to make use of something less powerful than #LoadLoad|#LoadStore, if #LoadLoad is apparently already implicitly used as anchored (ld.acq, cheaper than an unanchored #LoadLoad, right?): at first sight Itanium will be actually more expensive because #LoadLoad|#LoadStore seems to require adding mf (semi-rhetorical question)?

"If there is something "not correct" in the program, then who cares?"
As long as the compiler does not reject programs that are not correctly synchronized by its own definition, I would like to know that the behaviour is still defined, so that I can reason about what it will do. As a human, I may even be able to rule out possible races where even elaborate data flow analysis normally associated with optimising builds would not suffice (I concede that it may be more likely that a tool will find a human error than that a false positive occurs, but still...). Maybe the observing code does not participate in the main logic and can be cheaper without correct synchronisation (pure guess)? In short, I don't want to have to use a tool to prove correctness and hold my hand just to write some basic multithreaded code, even though I still want to have the option to use one like your own if I find myself in deep water.

"Java just had to provide semantics for "not correct" programs, but the reason is not that one must write not correct programs, but just because one don't want "non correct" applet downloaded from web format his harddrive. C++09 does not cares about "non correct" programs at all - it's just UB."
Sounds worrisome... What is "UB"?

"And how then you interpreted it? If program contains single volatile access somewhere than it's totally sequentially consistent?"
No, but I don't want to have to prove that the program is correctly synchronised just to create the illusion that its entire execution is sequentially consistent. Or at least I don't want a recommendation to correctly synchronise to also be a gag order about non-compliant code.

"Ok, propose your full specification. Then we will see which is simplier.
C++09 exactly learns from "simplier" specifications, one of which is constantly patched, and the other everybody complains too simple to be useful."
I gather that if fate had wanted me to play a role in the C++0x memory model, somebody would have told me to put down James Reinders' book (that's just for dramatic effect: I don't even remember whether I had it yet) and scoot over to where this was being voted into the working draft (that you can take literally). Also, a year and a half has passed since then, so I don't hold much hope for anything to come out of this, other than maybe some mental exercise and the opportunity to vent. So here goes:

Has anybody demonstrated any significant performance benefit of the present convoluted proposal relative to a simple one that uses the already existing full compiler fences (opaque calls to empty functions)? If not, what is the point, other than enabling the compiler vendors to say, like Rene, when he was caught in a compromising situation with Yvette, in the delightful "'Allo 'Allo!": "You stupid programmer! Don't you see that evaluation A is not dependency-ordered before evaluation B, because, even though A performs a release operation on M, and B performs a consume operation on M, it does not read any value written by any side effect in the release sequence headed by A, and neither is there an evaluation X that carries a dependency to B with A dependency-ordered before X? So how could you possibly think that A inter-thread happens before B?"? What is the point of making contracts this difficult (while none of STL is officially thread safe yet, in N2800 anyway, not even evaluating vector::operator[]), if not to hide behind them? Who benefits, other than the lawyers? Why should I not be bored by now with semantics (for all the right reasons), and only concerned with data structures and algorithms?

But if somebody can promise me a fair chance at influencing the outcome, I'll write up a counterproposal in full detail. Or is there something that already documents why it would not work, after all?

Here's a flipped situation that N2800 curiously provides itself (in 29.1 Order and Consistency):

[ Note: The requirements do allow r1 == r2 == 42 in the following example, with x and y initially zero:

// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);

// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);

However, implementations should not allow such behavior. - end note ]

Even disregarding the matter of that exhortation's appropriateness (how would you know that your code will survive in the wild if you let it grow up in a pampered environment?), it is probably entirely superfluous, or should be: why would any implementation even accidentally arrive at this result? This seems like just a manifestation of the craziness of the rules (and I don't mean crazy like a fox). In my book, writing to an atomic, even a relaxed one, would be like accessing a volatile, which would disallow speculation and hence out-of-thin-air outcomes (well, I hope?). That would rule out any other outcome than r1 == r2 == 0, and I don't understand by what twists and turns of perverted logic one was posited: how would two threads possibly interact (through the coherent cache and/or context switches!) for speculative execution in the multiverse to come up with this self-fulfilling prophecy and commit to its outcome? I can imagine speculative local writes, with support to prevent circular reasoning, but surely these writes would not be set free in an environment where such support is absent? Does anybody know a hardware company that might have somebody to enlighten us?

At some point I'll have to tune it down again, of course, but right now I'm saving on therapy sessions, and I'm hoping it's less boring this way. If not, please just tell me. Meanwhile, I have of course condemned myself to go deeper into this, anyway, and the scientific method dictates that I should try and find good reasons for this state of affairs, so I shall spend some more time reading, including some papers that were referred to me (or was I referred to them?). Hmm, hopefully at the end of all this I won't have to come in twice a week instead...

Leave a Comment

Please sign in to add a comment. Not a member? Join today