Memory Semantics and C++0x

Memory Semantics and C++0x

"Formal definitions of the memory model were rejected as unreadable by the vast majority of programmers." - Pthreads standard, as quoted in article mentioned here

I would like to review some issues related to the upcoming revision of the C++ standard, by looking at some of the relevant texts. A good first choice seems to be "Threads Cannot Be Implemented As a Library", a must-read seminal article (or in my case must-reread, because it is an important reference in another article that I'm reading now for the first time). For brevity, I will try to limit my comments to what immediately concerns me now: a potential middle ground between the naivety exposed in this article, and unreadably rigorous formality.

In "4.1 Concurrent modification", a speculative code transformation is outlined to show that the concept of "race" should be defined. However, when used with C++0x's relaxed atomic operations, formally there is no "race" (by its definition in C++0x), but nothing has really changed, and relaxed atomic operations are still considered problematic. In my proposed reimplementation of the TBB atomics library, I have used the qualifier "volatile" specifically to disallow speculative writes, relying on the current specification that describes the behaviour of a program partly in terms of how it treats "volatile" variables (details not repeated here). So, doesn't that solve the problem (and without any language change), or can it still reoccur on the hardware side? If so, what hardware would that be, and, e.g., would such hardware not already be problematic for "volatile" accesses that have nothing to do with multithreading?

In "4.2 Rewriting of Adjacent Data", a clear argument is made to define "memory locations", an unavoidable change.

In "4.3 Register promotion", it is made clear that such promotion has to be restricted, but I see nothing that should worry a programmer: determining which promotions are still safe seems strictly a compiler writer's problem.

In "5.1 Expensive Synchronization: An Example", a plea is made... to allow races!

So, enough to make you worry about using multithreading now, but only requiring simple changes, I would say, and not a hint yet of threatening that the behaviour of programs containing races is undefined (quite to the contrary).

32 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Will it really make significant sense if it will be 29.1 and not 1.10? Somewhere semantics of atomics and fences must be described, right? Since we are talking about industrial language backed by ISO. Or you want to leave semantics of atomics and fences formally undefined as it is now with TBB atomics? What is your goal?
I think you may ask your question over comp.programming.threads, there are some people directly related to C++0x memory model hanging around...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I want the truth!" "You can't handle the truth!" - Lt. Daniel Kaffee and Col. Nathan R. Jessep

"Will it really make significant sense if it will be 29.1 and not 1.10?"
Normative or non-normative: big difference.

"Somewhere semantics of atomics and fences must be described, right?"
I don't know yet. Does it also describe what happens if you compile legacy code that uses POSIX threads? Does it also describe what happens if you link with assembler code, with FORTRAN, or use it as Objective-C++? Aren't we supposed to be able to take responsibility for that based on less totalitarian rulings, that are actually intelligible, and that don't break down so easily?

"Since we are talking about industrial language backed by ISO. Or you want to leave semantics of atomics and fences formally undefined as it is now with TBB atomics?"
It seems better not to speak of it than to legislate that "delete_all_files(); abort();" is a legitimate execution of the sieve example in the article mentioned above, or of maybe most multithreading code currently in use, including many programs based on TBB, because that's what "undefined" means. Things should be robust enough without any need to prove the absence of the proposed standard's specific definition of data races. Using a library like TBB should not cause undefined behaviour as long as it is not completely reimplemented on top of C++0x primitives. It is not "leaving semantics undefined" if some things don't require a revision, and a revision that blows its own horn without recognising that is suspect to me.

"What is your goal?"
I don't like what I'm reading, and if this emperor is not wearing clothes somebody should say something, somebody should be devil's advocate before sanctificationcanonisation can proceed. I think that the current version is not fit for ratification, because it contains an unwarranted and significant regression regarding innocuous races, as much as the standard decrees that there is no such thing. I also particularly resent the condescending lie about spurious failure of try_lock() being boasted about in the preparatory paper I'm currently reading, although I see that N2800 contains a potentially valid excuse for that.

"I think you may ask your question over comp.programming.threads, there are some people directly related to C++0x memory model hanging around..."
I may do that later.

(Correction 2009-03-24) See in the text.

Quoting - Raf Schietekat
"I want the truth!" "You can't handle the truth!" - Lt. Daniel Kaffee and Col. Nathan R. Jessep

My fear number one is that we'll never see a C++0x... we have just 9 months left...

My fear number two is since we have only 9 months left, C++0x is going to be rushed out and cause more harm than good with it ever more unintelligiblewording, andundeterminedbehavior cases.

My fear number three is a lousy C++0x will make C++ even more arcane, and compilers even morenitpickey (talking to you GCC), and drive everyone nuts, thus sending C++ to hell, so we'll never see a C++1x or C++2x.

Honestly everyday that passes makes programming friendly languages more and more attractive, I'm really beginning to pray that D, or even C# or Objective-C replace C++.I mean writing correct platform independent C++ that "actually does something" is already about as hard as programming can get nowadays, writing straight C or assembly is the more sensible option most of the time.

Quoting - robert.jay.gould

My fear number one is that we'll never see a C++0x... we have just 9 months left...

My fear number two is since we have only 9 months left, C++0x is going to be rushed out and cause more harm than good with it ever more unintelligiblewording, andundeterminedbehavior cases.

My fear number three is a lousy C++0x will make C++ even more arcane, and compilers even morenitpickey (talking to you GCC), and drive everyone nuts, thus sending C++ to hell, so we'll never see a C++1x or C++2x.

Honestly everyday that passes makes programming friendly languages more and more attractive, I'm really beginning to pray that D, or even C# or Objective-C replace C++.I mean writing correct platform independent C++ that "actually does something" is already about as hard as programming can get nowadays, writing straight C or assembly is the more sensible option most of the time.

+1 to this post. I wonder when the C++ community is going to focus on developers productivity and ease of usage.

Quoting - shachris23

+1 to this post. I wonder when the C++ community is going to focus on developers productivity and ease of usage.

The same day when Java/C#/Haskell/Erlang communities will start focusing on high performance, access to the underlying hardware, access to the underlying OS features, systems programming, etc.
There are just different languages for different purposes...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I revisited "Threads Cannot Be Implemented As a Library" (see higher) because of a reference to it in the very first paragraph of "Foundations of the C++ Concurrency Memory Model". I have not yet read it in full, but here are some comments already.

One thing that immediately stands out even in the abstract is the complete turnaround with regard to data races: "race-free", "We give no semantics to programs with data races.". Not that I think that data races are an essential part of multithreading, far from it, but it seems rather disturbing that the behaviour of a program should become undefined because of a technique as described in the previous article, let alone an accidental slip-up against the unintelligible rules that define what is a "race", or something that happens in external code not subject to the C++ standard. Literally "undefined" means that the program might decide to start deleting all your files, clearly not an acceptable situation. In the abstract are some inviting phrases like "intuitive race definition" and "simple model", but I've already had a look at N2800 (the latest working draft for C++0x), and I beg to differ.

Is there really a machine that would allow r1=r2=1 in Figure 1 if, by declaring X and Y as even relaxed atomic (implemented to leverage the semantics of "volatile"), at least the compiler would refrain from otherwise entirely appropriate speculative writes? The example does identify a race if no atomics are used, but it would be nice to be assured at this point that this is exclusively due to compiler optimisation. I would like to think that hardware speculation is done in a multiverse, where each speculative universe is isolated from all others, and choices are only committed when tests are resolved based on non-speculative data. This would preclude r1=r2=1, because that would require inter-universe infection across the coherent cache or across a context switch. Is this a valid assumption, or do some machines really behave that way?

On pp. 2-3, the article presents "atomic" as a synonym for "in a single total order" or implicitly "sequentially consistent" unless explicitly qualified as "low-level". It designates as an "unfortunate overload" its use "in the sense that no other individual thread can see a partially updated value". Actually, atomic literally means "indivisible" (atoms were conceived in ancient Greece as the elementary building blocks of matter), and this of course refers to the behaviour of a read-modify-write sequence (no other operation can come between them), whether this is encoded in a single machine instruction, or as a sequence of separate instructions that succeed or fail as a whole, like the A(tomic) in an ACID transaction. I myself fail to see how "indivisibility" could possibly have anything to do with the relative order in which various observers may see the outcomes of a set of single and/or atomic operations, so I find the newspeak in this article confusing and annoying.

In "3. Making Trylock Efficient" on p. 4, trylock() is said to potentially spuriously fail as a "simple solution" to avoid "introducing the complexity of different types of synchronization or a happens-before relationship to define a data race". My comments about this in an earlier posting should suffice.

(Clarification 2009-03-25) I do also see its use "in the sense that no other individual thread can see a partially updated value" as part of the real meaning of "atomic" (together with indivisible operation sequences), even though on modern hardware with its wide-enough data buses this typically only requires attention to proper alignment. Note that, strictly speaking, this describes isolation instead of atomicity, but atomics provide both (entangled) properties.

(Second clarification 2009-03-25) The "sequence of separate instructions that succeed or fail as a whole" (on architectures that work this way) that I likened to "the A(tomic) in an ACID transaction" is of course only superficially related, because in the case of atomics only the last instruction tries to write anything (a write+commit combo that may succeed, or fail, in which case the user has to start all over).

(Added 2009-03-28) More comments to be expected (interesting article).

Quoting - Dmitriy Vyukov

The same day when Java/C#/Haskell/Erlang communities will start focusing on high performance, access to the underlying hardware, access to the underlying OS features, systems programming, etc.
There are just different languages for different purposes...

high performance: 90% of the high performance libraries for anything from threading, to databases, to text processing, to hardware graphics, are written in C. I actually think its more like 98% but I'll leave it at 90% just in case.

hardware: I've worked on ordinary computers, their hardware is accessed best through C or assembly, same goes for mobile devices, embedded devices, game machines, etc... I've worked on over a dozen different computing devices, and have yet to see one that provides hardware interfaces in C++, that are not a wrapper of their C API.

OS features: You mean Windows? I think you actually have easier less error prone access through C#, or again C. Macs, they provide a nice Objective-C interface over their C interface, the C++ interface? They literally threw it away. Or Unix/Posix? that's C too. Can't think of an OS that relies mostly on C++, can you?

Systems Programming: Ok as sad as it is about half of the systems that keep our world in order (in this day ad age) are written in COBOL... sigh... the other half is mostly C and Java.

Myverdictis if people want real control and/or performance they use C, and not C++. C++ is too crazy, hard, and platformdependentfor real performance. Your basically left betting on the compilers at that point, since the same code's performance is compiler dependent. If you want easy, you use Java, C#, or Objective-C.

If you want both you should use C++, because it was supposed to be nicer than programming in C, but with just as good a performance, but that promise is being broken, its getting ever crazier and undefined. And now all of the sudden they are saying that threaded code/constructs people have been using for decades are broken? I mean how can they say Posix threads is broken, even Windows threading API works fine!

Ok, so thecommitteewants to drive us nuts by saying we can no longer rely on our code because some imaginary hardware, andoverzealouscompilerwriters(GCC as usual comes to mind here) might screw us over by over optimizing illogical/irrational code constructs?

Now the real issue is they may not be saying this at all. But for myself, and I probably speak for many developers in the trenches, their wording isunintelligible, they SHOULD write specs in a way that doesn't confuse their audience, or they will loose their audience, C++ was never meant to be an academiclanguage, but they keep trying to make it so. If I wanted academic I'd use Haskell, its about as fast as C++ anyways.

#3 "My fear number one is that we'll never see a C++0x... we have just 9 months left..."
How do you figure that? We have 6 years, 8 months and 1 week until the end of '0F (as the joke goes)!

#3 "My fear number two is since we have only 9 months left, C++0x is going to be rushed out and cause more harm than good with it ever more unintelligible wording, and undetermined behavior cases."
Was anything learnt at all from "The Rise and Fall of CORBA", with its terribly clever C++ mapping (emphasis on terribly), see last page?

#3 "My fear number three is a lousy C++0x will make C++ even more arcane, and compilers even more nitpickey (talking to you GCC), and drive everyone nuts, thus sending C++ to hell, so we'll never see a C++1x or C++2x."
I hear this a lot, that C++ is going to go away, and it disturbs me. I want C++ to be there "for the ages", on top of C, on top of the hardware. It's languages like COBOL and Pascal that may come and go, not C++ (warts and all, just not too many, please). But I'll leave it at that as far as C++ vs. everything else goes.

#7 "And now all of the sudden they are saying that threaded code/constructs people have been using for decades are broken? I mean how can they say Posix threads is broken, even Windows threading API works fine!"
I must point out that the problem is not quite imaginary, I'm only questioning the form the proposed solution is taking.

Implementations appear to have converged oncharacteristics that make it possible to write correct multi-threaded applications, though largely, we believe, based onpainful experiences rather than strict adherence to standards.

This single line of the paper on Threading Libraries is where the real issue is IMHO. What's wrong with learning from experience and THEN standardizing that? Why do people think standard need to come first? I mean a "standard" literally means the "standard way people do stuff", so thecommitteeshould be basing the standard on the hard earn knowledge of hardware makers, library makers and programmers, not ignoring what works and has worked, to replace it with some imaginary scenario where we have to become paranoid of every single line of code that was perfectly valid, but might no longer be because implementations are now undefined behavior. They should make successful implementations of into defined behavior.

Quoting - robert.jay.gould

high performance: 90% of the high performance libraries for anything from threading, to databases, to text processing, to hardware graphics, are written in C. I actually think its more like 98% but I'll leave it at 90% just in case.

hardware: I've worked on ordinary computers, their hardware is accessed best through C or assembly, same goes for mobile devices, embedded devices, game machines, etc... I've worked on over a dozen different computing devices, and have yet to see one that provides hardware interfaces in C++, that are not a wrapper of their C API.

OS features: You mean Windows? I think you actually have easier less error prone access through C#, or again C. Macs, they provide a nice Objective-C interface over their C interface, the C++ interface? They literally threw it away. Or Unix/Posix? that's C too. Can't think of an OS that relies mostly on C++, can you?

Systems Programming: Ok as sad as it is about half of the systems that keep our world in order (in this day ad age) are written in COBOL... sigh... the other half is mostly C and Java.

Myverdictis if people want real control and/or performance they use C, and not C++. C++ is too crazy, hard, and platformdependentfor real performance. Your basically left betting on the compilers at that point, since the same code's performance is compiler dependent. If you want easy, you use Java, C#, or Objective-C.

If you want both you should use C++, because it was supposed to be nicer than programming in C, but with just as good a performance, but that promise is being broken, its getting ever crazier and undefined. And now all of the sudden they are saying that threaded code/constructs people have been using for decades are broken? I mean how can they say Posix threads is broken, even Windows threading API works fine!

I meant "C/C++", sorry for that, usually I am trying to clearly distinct "C", "C++" and "C/C++".
C and C++ are basically equal in this context. As far as I understand C++09 memory model, atomics and threading API will be directly employed by next C standard (AFAIK people from C group was especially invited to C++ memory model working group, and that's why we see all those atomic_exchange_explicit() ) .

Btw, Symbian is OS with C++ interface, and Android is OS with Java interface.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - robert.jay.gould
Now the real issue is they may not be saying this at all. But for myself, and I probably speak for many developers in the trenches, their wording isunintelligible, they SHOULD write specs in a way that doesn't confuse their audience, or they will loose their audience, C++ was never meant to be an academiclanguage, but they keep trying to make it so. If I wanted academic I'd use Haskell, its about as fast as C++ anyways.

Ok, C++09 specification aside for a moment. How many of your colleges are able to read, understand and write synchronization which rely on low-level atomics with fine-grained memory ordering constraints? How do you think how many of the C++ overall are able to do this? How many developers are able to write from scratch at least primitive double-checked initialization, clearly explain why they write it this way, and why it is guaranteed to work?
I can't understand why you and Raf are blaming the specification. It's not the specification that is complicated, difficult, unintelligible, non-intuitive, etc-etc. It's the domain itself (thread interleavings, absence of global order, mutual ordering, reorderings, control-dependencies, data-dependencies, atomicity, etc-etc).
One can make complicated specification for simple thing (bad). One can make simple specification for simple thing (good). One can make complicated specification for complicated thing (good). But one can NOT make simple specification for complicated in itself thing (impossible and basically what you and Raf are proposing).
Relaxed memory models are somehow similar to the theory of relativity (time is property of location). Is there simple descriptions of theory of relativity which are accessible to housewifes?
C++ memory working group put ennormous effort into making specification at least not over complicated (further than the domain itself), into making it self-consistent and integral, into providing high-level ("what") specification and not low-level ("how") specification.
Making specification on low-level ("how") is generally a bad thing. For example, if it will be based on "fences" (which define possible local reorderings around them), then what to do with implementation which does not provide correct ordering but then has some post-mortem means to recover from inconsistency (trick actually used on Alpha)? User anyway will not be able to look at the assembly code and figure out what is happening. Or what to do with absence of global order? In the presence of absence of global order model anyway does not reduces to the simple "fences + sequential consistency" model.
In any case, if you do not need/want/able to use low-level memory model, C++09 also specifies simple subset - sequentially consistent operations or even just mutexes (basically POSIX). You are always allowed to stay on that level.
Blaming is easy. Orders of magnitude simplier specifications for relaxed memory models (as well as theory of relativity and quantuum physics) are welcome.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat
Literally "undefined" means that the program might decide to start deleting all your files, clearly not an acceptable situation.

Ok, and what happens when you calling virtual method of the already destroyed object? LOL. Defining results of data races as undefined does not add nor cut anything from current C++. If you care about your files you may program in sandboxed javascript.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

#10 "I meant "C/C++", sorry for that, usually I am trying to clearly distinct "C", "C++" and "C/C++".
C and C++ are basically equal in this context. As far as I understand C++09 memory model, atomics and threading API will be directly employed by next C standard (AFAIK people from C group was especially invited to C++ memory model working group, and that's why we see all those atomic_exchange_explicit() ) ."

Time to have a look at N2800's "29 Atomic operations library [atomics]".

Alarm bells immediately go off when I look at the synopsis: where is the header file? "17.6.2.3 Headers" only mentions in "Table 14 - C++ headers for C library facilities", and there is not an (or another <(std)atomic(s)> variant) to be found in the entire document. Where is the C++ template going to fit? Is it an afterthought, maybe? I don't see contents inside , for a good reason, and I expect the same for atomics: any C functions should only be available if a specific header is included, and vice versa. Why should C++ be a slave to C's need to use names like "memory_order_release" because C does not have namespaces (if not as a ploy to impose the default memory_order_seq_cst)?

What might possibly be the difference between flag and bool? We're not deemed worthy of a rationale (only of a silly joke about atomic objects being "neither active nor radioactive"). Should we guess, or just blindly accept? I never saw a need to differentiate. On what machine is providing fewer operations necessary and sufficient to give access to a lock-free implementation?

"// For each of the integral types:" forgets that bool is also an integral type.

How does addition for address types take into account the size of the referent: the template can do that, but why should it publicly inherit from atomic_address? Will the operations available on atomic_address have their original meaning, or will they be overloaded to suddenly know about the referent size? In what way isn't that confusing? What happens for void*, which does not have a specific template specialisation?

Where are floating-point atomics?

In "29.1 Order and Consistency [atomics.order]", what might "regular (non-atomic)" possibly mean? Is it another instance of equating "atomic" with "sequentially consistent"?

I still don't know what a "consume operation" is supposed to mean.

"Implementations shall not move an atomic operation out of an unbounded loop." Why should implementations be able to move an atomic out of any loop? Are they allowed to do that with a volatile variable? If not, what would be the use of allowing things to be done to an atomic that would not be allowed with a volatile?

In "29.3.1 Integral Types [atomics.types.integral]", why are there only variants of compare_exchange, which really means compare_store? Why aren't all the other combinations available as well?

Why are there no atomic operations that do not return the previous value, which can easily be implemented as locked instructions on x86?

I have not combed through all the details yet, though.

I still have the same objection against memory_order as a non-constant parameter, where only a constant value makes sense: this can be enforced as a template parameter, but apparently C++ is not to benefit from features that are not also available in C. Conversely, weak/strong can easily, depending on the underlying architecture, either be ignored, or be used naturally in the loop that synthesises strong from weak, so whyever should it be an affix where memory order is not even required to be constant?

And of course I also object against sequential consistency as the only possible default (a reflection of the draft's imposition of sequentially consistent semantics throughout), instead of being able to configure an atomic based on its usage, or even to always require explicit semantics.

So it's safe to say that I'm not entirely happy with the current draft, and I think that C++ users would benefit from a revision before this becomes the standard.

#11 "Ok, C++09 specification aside for a moment. How many of your colleges are able to read, understand and write synchronization which rely on low-level atomics with fine-grained memory ordering constraints?"
"What's an atomic?" :-)

"How do you think how many of the C++ overall are able to do this?"
Once anybody gets going in the world of atomics, I don't see why simple release/acquire should be such a challenge. Or even relaxed atomics. Just pick the right tool for the job, and it takes a very particular job to actually require what C++0x is planning to make the default.

"How many developers are able to write from scratch at least primitive double-checked initialization, clearly explain why they write it this way, and why it is guaranteed to work?"
How about the following (I failed to come up with a reason why I would need anything beyond release/acquire):

static mutex m;
static atomic s_apT;

T* getInstance() {
  T* l_pT;
  if(NULL==(l_pT = s_apT.load())) { // match store()
    mutex::lock l(m); // only one thread should create
    if(NULL==(l_pT = s_apT.load())) { // match store()
      s_apT.store(l_pT = new T()); // don't publish until construction complete
    }
  }
  return l_pT; // object data acquired or locally created
}

"I can't understand why you and Raf are blaming the specification. It's not the specification that is complicated, difficult, unintelligible, non-intuitive, etc-etc. It's the domain itself (thread interleavings, absence of global order, mutual ordering, reorderings, control-dependencies, data-dependencies, atomicity, etc-etc)."
Both, I guess. But it's like, e.g., lift over an airfoil: you don't really need Navier-Stokes and circulation theories to make sense of it, that's just gobbledygook to impress people, without imparting any useful insight like the causal relationship mainly from lower pressure to higher airspeed instead of the other way around (mainly, because they reinforce each other).

"One can make complicated specification for simple thing (bad). One can make simple specification for simple thing (good). One can make complicated specification for complicated thing (good). But one can NOT make simple specification for complicated in itself thing (impossible and basically what you and Raf are proposing)."
If you write a program in assembler using fences, would you still use these theories? That's your prerogative, of course, but why impose them? Why shouldn't anybody be allowed to construct a cantilever-spar cable-stayed bridge if he so prefers, based on general physics, instead of being forced into a suspended-deck suspension bridge because the theorists decided that all bridges should be like that, so surely the behaviour of any other type of bridge must be "undefined"?

"Relaxed memory models are somehow similar to the theory of relativity (time is property of location). Is there simple descriptions of theory of relativity which are accessible to housewifes?"
A symptom of a derogatory attitude, perhaps?

"C++ memory working group put ennormous effort into making specification at least not over complicated (further than the domain itself), into making it self-consistent and integral, into providing high-level ("what") specification and not low-level ("how") specification."
I don't want effort, I want results. And why should it even be high-level: maybe I do want access to all that basic physics can offer, instead of getting a CAD system that can only design suspended-deck suspension bridges and is clueless about any other design. (I'm big on metaphores.)

"Making specification on low-level ("how") is generally a bad thing. For example, if it will be based on "fences" (which define possible local reorderings around them), then what to do with implementation which does not provide correct ordering but then has some post-mortem means to recover from inconsistency (trick actually used on Alpha)? User anyway will not be able to look at the assembly code and figure out what is happening."
I may have created confusion by using the word "fence" where I meant conceptual fence, not necessarily identifiable as such on a particular architecture, i.e., I would consider "release" a basic operation. In my defence, I have to point out that it is rare for any text to discuss what a fence actually is, or more to the point, what varieties exist. I think it should be possible to reason in terms of conceptual fences, which can then be conservatively mapped to whatever the machine provides, even if the opposite direction is too hard, like a disassembler is never able to fully reconstruct the original program even if the code was not deliberately obfuscated.

"Or what to do with absence of global order? In the presence of absence of global order model anyway does not reduces to the simple "fences + sequential consistency" model."
Well, then it would simply be "fences + something else". Sequential consistency may be desirable, but it is not a panacea for instant insight.

"In any case, if you do not need/want/able to use low-level memory model, C++09 also specifies simple subset - sequentially consistent operations or even just mutexes (basically POSIX). You are always allowed to stay on that level."
I don't think that is what this discussion is about.

"Blaming is easy. Orders of magnitude simplier specifications for relaxed memory models (as well as theory of relativity and quantuum physics) are welcome."
What is so terrible about simply using what already exists: compiler fences (opaque function calls or special assembler code) and an atomics library that generates some conservative approximations. How much is to be gained (literally: how much performance can the compiler create) by forcing the programmer to buy into a model that can do some things, but that also decrees that whatever else he wants to do, and previously successfully did, is now undefined? It's the difference between "here are the physics, and if you want you can buy a CAD system of your choice to help you figure it all out" and "we won't tell you about the physics, you'll just have to buy into our system and choose something within its range of solutions".

#12 "Ok, and what happens when you calling virtual method of the already destroyed object? LOL. Defining results of data races as undefined does not add nor cut anything from current C++. If you care about your files you may program in sandboxed javascript."
There was a time that only doing dangerous things produced undefined program behaviour (I may be romanticising), but now it's everything outside the arbitrary subset that was chosen. You don't need that totalitarian viewpoint to avoid an elementary mistake like calling an object being or already having been destroyed. The current draft is panicky and hysterical, because, e.g., when two threads can write to the same int at the same time, only the value of the int should be undefined, not the behaviour of the program.

Quoting - Dmitriy Vyukov
How many of your colleges are able to read, understand and write synchronization which rely on low-level atomics with fine-grained memory ordering constraints? How do you think how many of the C++ overall are able to do this? How many developers are able to write from scratch at least primitive double-checked initialization, clearly explain why they write it this way, and why it is guaranteed to work?

This is THE problem, I'd say about a third of my co-workers have been using C/C++ for over a decade (myself included), and another third for half a decade, and none of us could get this right in a our first 20 attempts without feeling the compiler or hardware might betray us, so we're allparanoid. A language that makes its user paranoid is fundamentally doing something wrong.

Relaxed memory models are somehow similar to the theory of relativity (time is property of location). Is there simple descriptions of theory of relativity which are accessible to housewifes?

How about that model where you put a driver in a "super-duper-ultra-turbo" fast car, and when he stops his watch's time will be different from the coach's watch, or the model where an astronaut (or cosmonaut) that travel to another star at the speed of light and when they come back everyone they knew is old or dead? Everyone knows these examples, and they are more than enough for everyone to understand that such aphenomenonexists, and more or less the consequences, of course not enough to do the math, but our atomics model should be like this too. The average programmer with 5 years (even this much required experience is crazy) of C/C++ under their belt should haveabsolutelyno problem using this sort of stuff, but that's not the case.

Quoting - Raf Schietekat
#12 "Ok, and what happens when you calling virtual method of the already destroyed object? LOL. Defining results of data races as undefined does not add nor cut anything from current C++. If you care about your files you may program in sandboxed javascript."
There was a time that only doing dangerous things produced undefined program behaviour (I may be romanticising), but now it's everything outside the arbitrary subset that was chosen. You don't need that totalitarian viewpoint to avoid an elementary mistake like calling an object being or already having been destroyed. The current draft is panicky and hysterical, because, e.g., when two threads can write to the same int at the same time, only the value of the int should be undefined, not the behaviour of the program.

Ok, consider following code:

char placeholder [sizeof(object)]; // alignment aside for a moment
object* g_obj; // = 0

// thread 1
g_obj = new (placeholder) object;

// thread 2
if (g_obj)
{
if (g_obj == &placeholder)
g_obj->virtual_func();
}

How would you define behavior of the program? Provide formal specification which will completely define behavior.

Note that the loaded value is completely defined here - we know that it's *right* pointer. The problem is deeper here, it's not only about separate undefined values. Anyway, if program contains some undefined values, does it make sense to reason about it behavior further? Undefined value will be used is some way in the program, so basically uncertainty will propagate through the program like the plague.

Well, yes, specification might provide 100 page description of all possible combinations of threads interaction and involved races and for some of them define behavior, for some of them define several possible behaviors, and for some of them still say just UB (see example above). Hmmm... was not you saying that specification must not be overloaded?
And what to do with virtual functions, etc? How would you define behavior which involves virtual functions provided that their implementation is implementation defined?

If you are loading int and then pass it to the printf, then we may say that the output value will be undefined. And what if we are loading char*? You must be insane if you want to cope with this in the specification. Ok, probably not, show your variant of the spec. What I see todate is that IMHO you want to underspecify significant moments, overspecify insignificant moments and bind abstract specification to some particular implementation...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - robert.jay.gould

This is THE problem, I'd say about a third of my co-workers have been using C/C++ for over a decade (myself included), and another third for half a decade, and none of us could get this right in a our first 20 attempts without feeling the compiler or hardware might betray us, so we're allparanoid. A language that makes its user paranoid is fundamentally doing something wrong.

I am not sure I get you. So since the matter is too complicated in itself you proposing just to eliminate it from the specification and provide POSIX-like mutex-based simple model, right?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - robert.jay.gould
How about that model where you put a driver in a "super-duper-ultra-turbo" fast car, and when he stops his watch's time will be different from the coach's watch, or the model where an astronaut (or cosmonaut) that travel to another star at the speed of light and when they come back everyone they knew is old or dead? Everyone knows these examples, and they are more than enough for everyone to understand that such aphenomenonexists, and more or less the consequences, of course not enough to do the math, but our atomics model should be like this too. The average programmer with 5 years (even this much required experience is crazy) of C/C++ under their belt should haveabsolutelyno problem using this sort of stuff, but that's not the case.

Ah, I see. So by formal specification you mean just some examples which indeed show that the matter itself is brain-damaging complicated and totally dis-coordinates with prior knowledge. Catch the formal specification:
"Different threads may see actions in different orders. If several threads write to the variable then it may contain garbage. Program order is not respected as observed by other threads."

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Ah, I see. So by formal specification you mean just some examples which indeed show that the matter itself is brain-damaging complicated and totally dis-coordinates with prior knowledge. Catch the formal specification:
"Different threads may see actions in different orders. If several threads write to the variable then it may contain garbage. Program order is not respected as observed by other threads."

There we go, see it can be explained in simple terms. Not saying that this is enough for a standard, or a formal documentation of any type, but just that it can be explained in terms the even the layperson can understand.

#17 "How would you define behavior of the program? Provide formal specification which will completely define behavior."
That's silly: g_obj is a non-atomic that's being concurrently written and read (so thread 2 might theoretically read a partially updated value, even though that's unlikely to happen on current hardware), and it also might be written before the constructor has completed (so virtual_func() might be called before its time, and it being a virtual function makes things infinitely worse). Why would anyone need the proposed formal semantics to see that? The difference is that with the proposed formal semantics, I would give up reasoning, because I don't understand them (who does?).

"Note that the loaded value is completely defined here - we know that it's *right* pointer. The problem is deeper here, it's not only about separate undefined values. Anyway, if program contains some undefined values, does it make sense to reason about it behavior further? Undefined value will be used is some way in the program, so basically uncertainty will propagate through the program like the plague."
That must depend on what the undefined value can cause to happen, but I don't need a specification to predigest my conclusions there, especially not in the form of blanket hysteria.

"Well, yes, specification might provide 100 page description of all possible combinations of threads interaction and involved races and for some of them define behavior, for some of them define several possible behaviors, and for some of them still say just UB (see example above). Hmmm... was not you saying that specification must not be overloaded?"
There's no reason why an alternative specification would be 100 pages long. I still don't know what UB means. Remind me about what I said regarding "overloaded specification" and where?

"And what to do with virtual functions, etc? How would you define behavior which involves virtual functions provided that their implementation is implementation defined?"
Now that's a legitimate example of undefined behaviour of the entire program.

"If you are loading int and then pass it to the printf, then we may say that the output value will be undefined. And what if we are loading char*? You must be insane if you want to cope with this in the specification."
Nobody needs new semantics to know that printing an undefined int is merely useless and printing an undefined pointer as a C string is somewhat more problematic.

"Ok, probably not, show your variant of the spec. What I see todate is that IMHO you want to underspecify significant moments, overspecify insignificant moments and bind abstract specification to some particular implementation..."
I'm just testing the waters: reading documents, and testing whether anyone is responsive to my objections so far. If the answer there is negative, it would be pointless to proceed further, for lack of resources. The current proposal may not underspecify significant moments, but it definitely overspecifies insignificant moments (like printing an undefined integer).The current proposal both underspecifies (what happens when a program is linked with non-C(++) code?) and overspecifies (run for your lives: it's the terrible Undefined Integer!). I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

(Correction 07:40 UTC) See text.

(Correction 2009-03-31) Well, both would be underspecifications, actually. So then what am I supposed to be underspecifying?

Quoting - Dmitriy Vyukov

I am not sure I get you. So since the matter is too complicated in itself you proposing just to eliminate it from the specification and provide POSIX-like mutex-based simple model, right?

Not, exactly that. I'm saying that managing threads in "general terms" should be no more complicated than the POSIX-like mutex-based model. Place a big lock around the critical code and don't let anyone else touch it until you're done.

Most times this solution is perfectly acceptable. So I'd say locked code should not be affected at all by new standards, everyone (with a degree of competence and some experience) can understand locked code in its current state, and its a proven model. Any new feature that breaks this model is evil.

Lockless code/algorithms/containers/etc... is something the "masters" tell us common programmers to not try to write lockless constructs, because its very tricky, and even the "masters" get it wrong many times, so we've tried not to touch this stuff for many years (but I think there should be a way to democratize this too, although its probably not realistic).

However simple atomics as they happen to exist, with their "standard" quirks by empiric accident and/or design, at this point in time, have been in our (the common programmer's) field for a while, and we commonly rely on them for our tasks, and we have plenty of folklore about how they should behave. I think we don't want people messing with this folklore.

So the current accidental specs of atomics should not be taken away from us, they should just make this empiricalatomic the standard for C++0x.And if there is a need for a new tool make a new tool call them Bosons or Fermions for example.

Quoting - robert.jay.gould
"Not, exactly that. I'm saying that managing threads in "general terms" should be no more complicated than the POSIX-like mutex-based model. Place a big lock around the critical code and don't let anyone else touch it until you're done.
Most times this solution is perfectly acceptable. So I'd say locked code should not be affected at all by new standards, everyone (with a degree of competence and some experience) can understand locked code in its current state, and its a proven model. Any new feature that breaks this model is evil."

This model is not broken. Mutex-based synchronizationis is not affected by C++0x standard.

"Lockless code/algorithms/containers/etc... is something the "masters" tell us common programmers to not try to write lockless constructs, because its very tricky, and even the "masters" get it wrong many times, so we've tried not to touch this stuff for many years (but I think there should be a way to democratize this too, although its probably not realistic)."

Don't rely on memory barriers for synchronization... Only if you don't aware of Relacy Race Detector!

"However simple atomics as they happen to exist, with their "standard" quirks by empiric accident and/or design, at this point in time, have been in our (the common programmer's) field for a while, and we commonly rely on them for our tasks, and we have plenty of folklore about how they should behave. I think we don't want people messing with this folklore.
So the current accidental specs of atomics should not be taken away from us, they should just make this empiricalatomic the standard for C++0x.And if there is a need for a new tool make a new tool call them Bosons or Fermions for example."

The standard exactly makes those emphirical atomics the standard, i.e.:

std::atomic x;

x.compare_exchange_strong(cmp, xchg);

And exactly as you said, there is a new thing - "atomics with fine-grained memory ordering", i.e.

std::atomic x;

x.compare_exchange_weak(cmp, xchg, memory_order_acq_rel, memory_order_relaxed);

If you are trying to read formal language specification and fails to understand it, probably you better start with Hans Boehm non-formal articles:
http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2480.html
Here is complete list:
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/

Note that simple sequentilly consistent model and relaxed model are clearly separated:

Simpler rules for programmers

Based on this definition, as shown in N2338, it becomes a theorem that programs using locks and atomic variables with default ("sequentially consistent") ordering to protect other shared variables from simultaneous access (other than simultaneous read access) behave as though they were executed by simply interleaving the actions of each thread. We expect that this is the rule that will be taught to most programmers.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"#17 "How would you define behavior of the program? Provide formal specification which will completely define behavior."
That's silly: g_obj is a non-atomic that's being concurrently written and read (so thread 2 might theoretically read a partially updated value, even though that's unlikely to happen on current hardware), and it also might be written before the constructor has completed (so virtual_func() might be called before its time, and it being a virtual function makes things infinitely worse). Why would anyone need the proposed formal semantics to see that? The difference is that with the proposed formal semantics, I would give up reasoning, because I don't understand them (who does?)."

Everyone who is not doing just cowboy programming needs formal specification to get explicit answer whether a program does not work or does work now and will be working on new compiler, hardware, OS, etc. That's formal needed for.

""Note that the loaded value is completely defined here - we know that it's *right* pointer. The problem is deeper here, it's not only about separate undefined values. Anyway, if program contains some undefined values, does it make sense to reason about it behavior further? Undefined value will be used is some way in the program, so basically uncertainty will propagate through the program like the plague."
That must depend on what the undefined value can cause to happen, but I don't need a specification to predigest my conclusions there, especially not in the form of blanket hysteria."

Do you see any sense or value in more detailed specification wrt this aspect? Frankly I do not. Even if I load just int which I am going to use to index statically allocated array, I can check that index does not exceed bounds, however I can not check that index is semantically correct. So what? Yes the program will not be crashing, but will transfer money to my account instead of your.
However if you are programmng against some particular platform then you indeed can figure out semantics of data races - they are typically defined for platform. But then how this relates to the international standard?

Btw, what about future platforms on which any data races will cause hardware interrupts, just as now integer overflows and unalgned data accesses? Have you thought about them?

UB is Undefined Bahavior.

"If you are loading int and then pass it to the printf, then we may say that the output value will be undefined. And what if we are loading char*? You must be insane if you want to cope with this in the specification."
Nobody needs new semantics to know that printing an undefined int is merely useless and printing an undefined pointer as a C string is somewhat more problematic.

What semantics are you talking about??? Standardbasically says that accessing undefined string is UB. Just in more general terms, just how abstrast standard must do.

"Ok, probably not, show your variant of the spec. What I see todate is that IMHO you want to underspecify significant moments, overspecify insignificant moments and bind abstract specification to some particular implementation..."
I'm just testing the waters: reading documents, and testing whether anyone is responsive to my objections so far. If the answer there is negative, it would be pointless to proceed further, for lack of resources. The current proposal may not underspecify significant moments, but it definitely overspecifies insignificant moments (like printing an undefined integer).The current proposal both underspecifies (what happens when a program is linked with non-C(++) code?) and overspecifies (run for your lives: it's the terrible Undefined Integer!). I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

(Correction 07:40 UTC) See text.

(Correction 2009-03-31) Well, both would be underspecifications, actually. So then what am I supposed to be underspecifying?

Effects of calling external non C++ code must be defined by patform which provides such means. This is not resposibility of the C++ standard, it defines how C++ program behaves. Simple.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat
I don't see why reasoning in terms of conceptual fences or whatever they might be called (that can be conservatively implemented on various architectures, or first guide program transformation) would be unsuitable.

Ok, and how you would define simple example where one thread makes store-release and another thread makes load-acquire in terms of fences? Will not you revert to terms "happens-before" and "sequenced-before" (or analogs, or just replicating meaning of these terms everywhere)?

How would you define following code in terms of fences:

int data;

atomic flag;

// thread 1:

data = 1;

flag.store(1, release);

// thread 2:

if (flag.load(relaxed))

flag.store(2, relaxed);

// thread 3:

if (flag.load(acquire) == 2)

assert(data == 1);

?

Please, provide sketch of definitions "based on fences". Probably I am missing something, how this can be defined much simplier in terms of fences?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

#25 "Ok, and how you would define simple example where one thread makes store-release and another thread makes load-acquire in terms of fences? Will not you revert to terms "happens-before" and "sequenced-before" (or analogs, or just replicating meaning of these terms everywhere)?"
To describe the same facts, two descriptions are not going to be so radically different that they do not use common elements, of course. I'm only criticising that the proposed specification seems more suitable for automatic tools and egghead theoreticians (only meant affectionately, of course) than real-life programmers, once a proud race of self-reliant people, now reduced to pitiful shadows of themselves, unsure whether the code they once wrote and confidently used in production still passes muster with the new standard (one tiny little benign race somewhere in a corner and the whole program nevertheless comes to a screeching halt, but only if you're lucky!), and getting hooked like a drug addict on that nifty data-flow analysis tool where the intention is to interpret the clues to make changes in the program until one day hopefully the race-free status icon turns green.

"How would you define following code in terms of fences:

[...]

Please, provide sketch of definitions "based on fences". Probably I am missing something, how this can be defined much simplier in terms of fences?"
I plead guilty to newspeak with regard to "fences", and I will try to avoid using the word for a while.

I'm not sure, because I don't quite understand the specification, but I would feel a lot less disempowered if I could reason along the lines that thread 1 conceptually writes flag={release(data),1}, then thread 2 writes flag={relaxed,2}, and then thread 3 fails to synchronise because it sees the wrong label (so the assert fails). Would "consume" semantics in thread 2 save the day, maybe, and what exactly would be the toll in performance?

How much more satisfying would it also be if programmers were at least deemed worthy to know about the mechanics of it all, so that there is at least hope of taking control of our own destiny beyond the confines of the virtual reality presented to us. What does it really mean to perform a release or an acquire, and exactly what goes wrong when thread 2 intervenes (perhaps running on a core close to the one running thread 1, seeing a preview of the update on flag and being able to interfere with the cache coherence protocol in such a way that it can get its new value of flag out in front of the data released by thread 1)? What hardware works that way, what do the protocols look like, in principle? How would a "consume" be implemented to restore causal reasoning? Now there's just formal semantics with no link to anything tangible, so either you join a cult of people who like nothing more than to recite abstract incantations, or you just accept that the world you thought you knew has changed into something unrecognisable and incomprehensible, where you're like the farmer who can no longer collect his own seeds for the next season but has to buy them from the GMO oligopolists. (Too much? I find it wonderfully therapeutic, though...)

Multithreading is of course a ridiculously flawed approach to parallelism, because so much can go wrong. It would be far better to have solutions that free us from having to consider all possible interactions, or at least to only use conduits that have limited scope, like queues that take only self-contained elements from here to there like science-fiction wormholes between universes, not to mention entirely new paradigms like functional languages, free of user-visible side effects. Maybe metadata could be elevated in the language to be able to express not only invariants, but what synchronisation mechanisms are used with each specific variable, instead of considering either no state transfer (relaxed) or full-state transfer (release/acquire), and I wouldn't be surprised if this could even be used to improve performance.

But meanwhile we have C++, and we have to find a way to make its users actually understand it again.

#26 "But meanwhile we have C++, and we have to find a way to make its users actually understand it again."
My advice to the committee: don't spring this on the hoi poloi after your next druid convention yet (note how I'm epically bridging locations if not eras), first find something that resonates with the masses, not just with the vendors of design tools that will do heavy-duty program flow graph crunching because the human element is not in control anymore. Take some lessons from moribund CORBA: at least get sufficient experience with a reference implementation (maybe "sequentially consistent" is just a red herring?), and only then make it a standard. Does it need to be like what in my country would be called a "program law" (everything and the kitchen sink in one big text, rendering the parliament powerless, also with regard to individual items that the government wants to smuggle through because they otherwise wouldn't pass)? Do a TR for multithreading (except for no-brainers like legislating memory locations), and instead give us labeled loops already (like in Java), and try/finally (synchronous destruction of local objects doesn't work, unless with the significant clutter of a friend declaration and member references to function-local variables, just to be able to do, e.g., exit tracing and postcondition checking (cleanup would remain c/o RAII), which now forces single-return programming guidelines upon innocent cubicle dwellers all over the world, even though it's not even exception-safe).

#27 "maybe "sequentially consistent" is just a red herring?"
One of the correct choices TBB made, I think, is to give tbb::atomic release/acquire semantics. I extended the range of possibilities, but I didn't keep release/acquire by default merely for backward compatibility. I think that this default atomic is sort of like the natural dual of a lock, which also has acquire/release semantics, except the other way around: a lock is active holding a mutex, with interesting stuff happening between acquire and release, while an atomic would be "active" tunnelling state from one end (before the release-store) to the other (after the load-acquire), after which time you can safely lose interest. No seq_cst anywhere in sight, and nobody is missing it (well, I'm not).

I also decided to make atomic configurable with a default-semantics policy (which unfortunately required a prefix instead of a template argument, because of technical reasons related to C++). It is far safer to reason about an atomic like that. So now an atomic has a default policy rel_acq, which implies default release for a store, acquire for a load, and rel_acq when store and load are combined. You can override the default policy, and a relaxed_atomic would be relaxed about all its operations, while an ordered_atomic would make all its operations ordered (note that mut_ord will probably be the equivalent of C++0x's seq_cst, while ordered, somewhat heavier, will be sequentially consistent with any counterparty). Even though it remains possible, normally you would not override an operation's policy-directed default semantics, which eliminates a lot of room for mistakes, takes away some of the clutter, and makes it far easier to reason about what an atomic is doing.

The example overrides what would normally be a default-policy (rel_acq_)atomic with some explicitly relaxed operations, so this means that the user is alerted that something not altogether safe is going on, so he would not easily be fooled.

Together with my other criticisms against the current proposal for atomics, I think this demonstrates that we may not be getting the best deal if the current proposal goes to print as-is.

Before I possibly continue my comments about "Foundations of the C++ Concurrency Memory Model" (#6), here are already some observations about another text that references it, and that was also recommended to me: "Thread Basics", "a much more introductory paper [that] explains the underlying philosophy".

About the introduction:

"For present purposes hardware threads are similar enough to multiple cores, which are similar enough to multiple processors on separate chips, that we don't make the distinction."
And yet the distinction is so very important to any decision on whether to relax Independent-Reads-Independent-Writes sequential consistency, as discussed in the previous article. I always get suspicious when too much reality seems to get sacrificed for an abstraction.

"Even "experts" have often advocated contradictory approaches."
Without examples or even just references, that's somewhat of a cheap shot.

"To us, this is an easier model to explain. And programs built to be correct under this model will execute correctly on a genuinely parallel implementation. (Unfortunately, programs that are incorrect under this model may fail in much more mysterious ways on multicore implementations.)"
So it's a simplification of what can happen, but "correct" programs under the simplification are also "correct" in the unrestricted case? How can that be? Where is it explained? Isn't it deceptive to first only look at a simplified situation?

About "The problem with pure sequential consistency":

"And, as we will see below, insisting on pure sequential consistency generally does not provide us with any real advantage."
Whatever does that mean?

About "Data races and a more restrictive model":

"We then guarantee sequential consistency only when the program avoids data races."
That's certainly a lot nicer than saying that the presence of a data race makes for "undefined" program behaviour. But what does it mean? That a data race is like a catastrophe between sequential consistency and executions that are not simply interleaved? I'm rather underwhelmed at this point, but maybe I missed something? Also, how valuable is this result supposed to be, and why?

About "Why we haven't lost anything important":

"At first glance, the restriction of sequential consistency to data-race-free programs seems like an unnecessary complication for multithreaded programming, which is certainly already a sufficiently difficult problem. However, there are other important reasons to discourage data races. In fact, this restriction affects almost exclusively programs that we genuinely want to discourage."
This should be interesting... but after reading the section I'm not satisfied: I'm looking for the crucial difference between data races that make the program misbehave and ones that don't. Sure, it is generally a good thing to avoid them altogether, but why should they be so uniformly disastrous?

"Even if it is a synchronization variable, the increment may consist of two atomic accesses to c as above, instead of a single one to both read and write c, thus allowing the same interleaving as above."
This makes no sense: why wouldn't a "synchronization variable" have an atomic increment? A Java volatile is a strange creature, but I see no reason to replicate its quirks elsewhere.

"Programs with data races almost never produce consistently correct results across various hardware and compiler platforms. (The literature sometimes talks about "benign data races". Most of them are not benign in this sense.) Thus it makes little sense to sacrifice performance to accommodate such programs that are almost certainly incorrect."
Now that's devious: I feel that "undefined behaviour" is right around the corner. And who is talking about sacrificing performance: the example in the first article reviewed here, by the same author, is about better performance c/o a benign data race!

About "Avoiding data races at higher levels of abstraction":

"In the general case, it is the library's responsibility to specify what combinations of concurrent calls will introduce data races."
It's funny you should say that, because N2800 doesn't even specify whether std::vector::operator[] is safe, let alone concurrent use of the reference so obtained.

About "A few Java specifics":

"Hence it cannot allow arbitrary behavior for data races."
That's nice. So will the compiler reject the program because it cannot prove that it is race-free? No, apparently "very sophisticated authors" may still use them. So why shouldn't this be allowed in C++ as well, without rendering behaviour of the entire program "undefined" (which means that one would be lucky if the program just crashes)?

About "A few C++0x specifics":

"Following established convention, we will use the term C++0x to refer to the next C++ standard, in spite of the fact that we do not expect it to be formally adopted as an ISO standard until 2010 or 2011."
We can only hope that it won't be rushed out in 2009...

"Since C++ is not designed to provide protection against untrusted code, it guarantees nothing in the event of a data race. Any program allowing a data race produces "undefined behavior"."
And that's crazy: why the radicalism, why shouldn't there be intermediate guarantees and prohibitions that allow the programmer to take responsibility? Why wouldn't any damage be local to an affected variable?

"In the absence of data races, and provided that certain low-level library facilities are not used (see below) it again provides sequential consistency. Also like Java, this is not immediately apparent from the official language description. In this case the language description is made more complicated not by the treatment of data races, but by those low-level library facilities."
Again, why is "sequential consistency" so all-important that it should have such a radical impact, when in fact it does both too much (often standing in the way of performance with, e.g., reference counting) and too little (see "Avoiding data races at higher levels of abstraction") to be a panacea? Or am I confused by the use of seq_cst for what is decreed to be the default mode for atomics (although it seems hard to imagine that whoever chose this name didn't want to bundle it all up)? It does sound nice (who would want to be "inconsistent"), but experience has taught that it comes with a severe performance penalty.

"std::lock_guard _(mtx);"
Is that really a recommendable "name"? Wouldn't it be more self-documenting to name it "anonymous", and less dangerous (losing that tiny little underscore breaks the protection)?

"It currently appears that the C++0x model will be followed in some other environments as well."
Bush was once re-elected. And languages have to last longer than U.S.A. presidents.

About "Some more examples":

"Hence a compiler is not allowed to transform the code as we have suggested here, in spite of the fact that hardware conditional move instructions often make the last version faster, or it might allow a containing loop to be vectorized."
I had not seen this example before, but it seems rather deceptive: first the code is deemed not to have a data race based on a specific initial state because the model wants to condemn all data races and therefore has to have a minimal concept of data race to be at all acceptable, and then the optimised code supposedly introduced a race because the specific initial state is not taken into consideration to optimise away all code in Thread 1. While at first it seems satisfying that the compiler is also punished, it's still the user who pays (with missed performance). So what does he get in return, really? Similarly for the loop example that follows, where the performance penalty is even greater. I'm leaning toward allowing the compiler to perform the optimisations, but this certainly merits further study (how about an example from a real program?).

About "A note on implementation of synchronization operations (optional)":

"It is included here only in the hope that we might help reconcile the presentation in the rest of this paper with the reader's preconceptions of the subject."
Much appreciated...

"(Architectures that do not require such alignment usually cannot guarantee this kind of atomicity, since a single load or store instruction may require multiple accesses to physical memory.)"
That's confusing: it seems to imply that a processor couldn't guarantee indivisible aligned accesses if it also allowed unaligned accesses, which I believe several do (at a performance penalty). But I trust it's just unfortunate wording.

"A load instruction is said to become visible at the point at which it stops being affected by later stores."
Now what could that possible mean... in English? This also means that the meaning of the next bullet point is unclear, and the one after that an apparent contradiction (why would an architecture whose stores become visible in a single total order need a fence to order stores?).

"This description is far too sloppy to serve as the basis of a real implementation"
I'll say!

"The fence preceding the assignment to the synchronization variable x_init prevents this reordering."
Now I'm very confused, what with the earlier statement "stores become visible in a single total order". Or am I just biased by the SPARC hierarchy PSO
"The fence following the x_init is not necessary in this example, but is required in the general case to prevent reordering of a synchronization store with a subsequent synchronization load."
But very likely we're not interested in preventing such reordering, only in release semantics. The second fence is not the same beast at all: a #StoreLoad (in SPARC parlance) is far more expensive than a #StoreStore. In what world would this not be the case, and how doesn't that make the example mostly irrelevant? (At this point it would be very interesting to explain how fences are implemented in hardware.)

"(Real implementations would typically avoid the repeated xchg operations while waiting for a lock, since they create large amounts of unnecessary communication between processor caches.)"
Is this about backoff strategies?

"Note that lock()/unlock() operations only require one fence each, not two like a synchronization store operation. This relies on an argument that the difference is not observable in the absence of data races."
I only have difficulty understanding why this might need any consideration at all. To me sequential consistency is a mostly unnecessary and costly burden, and memory_order_seq_cst as a default likely a mistake.

When standards people say it is undefinable, they sometimes really mean I don't understand it. - Paul E. McKenney

I stumbled upon a slides presentation (dated 2008-02-01) by a certain Paul E. McKenney (IBM Distinguished Engineer). He apparently also doesn't get it (the sequential consistency thing), but then he's only a lowly practitioner ("And I have been doing parallel C for about 17 years"), not an exalted academic ("People listen to academics, even when we practitioners think that they shouldn't :-)"). He also seems to think that now it's too late to do anything anymore ("But starting in 2005 might have produced an alternative to sequential consistency"). Sigh...

Apparently some people (one of them the aforementioned Paul E. McKenney, Ph.D.) actually think it's a good thing to be... inconsistent (sounds awful, doesn't it?), and they're calling it "Relativistic Programming". I'm not sure that the opposite of something that I don't entirely trust is necessarily better, but it now seems worth having a look.

Here's my intuition: People are ill equipped to track the kind of dependencies that are involved in the make-up of a sequentially consistent program as C++0x would have it (we're really not even very good at following a single train of thought). We need not just crutches, but heavy machinery to help us there, and if the alternative is the hell and damnation of "undefined" program behaviour due to a single little innocent mistake, that's not what I would consider a happy place. Moreover, computers don't like it, and they're telling us by dragging their feet when running programs that require it.

So why exactly are we being coerced in the direction of "sequential consistency" again? Even now, on the brink of ratification, I have yet to see anything to convince me that this is it. Maybe after everything is cast in stone, somebody will condescend to our level and give a satisfactory explanation for the rest of us, but I'm not holding my breath, because I think that the simplicity of sequential consistency is merely deceptive. And of course then it may well be too late.

Instead, I'd rather drop this heavy burden, and use an environment that better serves its purpose as the interface between my human mind and the computer's mechanical brain, where brittleness should be mitigated rather than exacerbated. Let me worry about a manageable scope only, and then hand it over by doing a release on a baton variable that somebody else will then use to acquire my state, as in a relay run. If I ever care about a common timeline (which won't be very often), I'll use a dedicated device, but otherwise I shouldn't have to be bothered about it.

I may well be mistaken, but now I know that this wouldn't be just because I don't have a Ph.D. to my name (although I again admit that this is just my intuition, so you should make up your own mind independently).

Leave a Comment

Please sign in to add a comment. Not a member? Join today