store_with_release() and load_with_acquire() on VC8

store_with_release() and load_with_acquire() on VC8

On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

Thanks!

26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Quoting - e4lam
On VC8, I see that __TBB_store_with_release() and __TBB_load_with_acquire() are both implemented with _ReadWriteBarrier(). Having just learned about memory barriers and such, I'm have a question about this. Can __TBB_store_with_release() use a _WriteBarrier() barrier instead and similarly _ReadBarrier() for __TBB_load_with_acquire() ?

No, they can't.
Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier.

However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

"No, they can't."
I would say that the answer is yes, but maybe you know something that I don't (or that I have forgotten again)?

"Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier."
Can you quote the specification for these functions (maybe _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() are all just compiler fences?), and clarify what you mean exactly with "hoist" and "intermix" (maybe "hoist" for C++ vs. execution and "intermix" for C++ vs. machine code?)?

"However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere."
Even if they only affect the compiler without causing any specific instruction to be emitted (on a specific architecture, notably x86!), their cost and/or effect may not be zero, because they could, at least conceivably, be preventing an optimisation reordering that would otherwise corrupt the program, so I wouldn't call them "useless" (that may be clear to you, but you have to keep your audience in mind when you write such things). By the same logic, perhaps a weaker compiler fence might allow a "partial optimisation" to still occur (subject to testing), so indiscriminately putting the strongest compiler fence everywhere might not be appropriate, even if it would be a conservative approximation (conserving correctness, I mean).

Quoting - Raf Schietekat

"No, they can't."
I would say that the answer is yes, but maybe you know something that I don't (or that I have forgotten again)?

"Read barrier is a kind of orthogonal to acquire barrier. While acquire barrier prevents all memory accesses (i.e. both reads and writes) to hoist above the load, read barrier prevents reads on one side of the barrier to intermix with reads on the other side of the barrier. The same for write barrier."
Can you quote the specification for these functions (maybe _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() are all just compiler fences?), and clarify what you mean exactly with "hoist" and "intermix" (maybe "hoist" for C++ vs. execution and "intermix" for C++ vs. machine code?)?

Of course:
http://www.google.com/search?q="_readbarrier"+"_writebarrier"

Since here is a link for official documentation, please ignore my "hoist" and "intermix" at this point.

Quoting - Raf Schietekat
"However, IMHO, fine-grained precise compiler fences are mostly useless, because they affect only compiler, so have basically zero run-time cost. So IMHO it's Ok to put the strongest full compiler fence everywhere."
Even if they only affect the compiler without causing any specific instruction to be emitted (on a specific architecture, notably x86!), their cost and/or effect may not be zero, because they could, at least conceivably, be preventing an optimisation reordering that would otherwise corrupt the program, so I wouldn't call them "useless" (that may be clear to you, but you have to keep your audience in mind when you write such things). By the same logic, perhaps a weaker compiler fence might allow a "partial optimisation" to still occur (subject to testing), so indiscriminately putting the strongest compiler fence everywhere might not be appropriate, even if it would be a conservative approximation (conserving correctness, I mean).

I am quite skeptical regarding their practical usefulness. I would be interesting to see some (at least synthetic) show-case for fine-grained compiler fences where finer-grained fence makes significant difference over coarser-grained fence. May you construct a one?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

The specification from Microsoft is quite unsatisfactory (so is it a compiler fence, or isn't it? and will _ReadWriteBarrier() keep a write before a read?), but the mentioning of specific hardware architectures at least seems to imply that on specific architectures any necessary machine instructions will be issued.

I have no ambition to demonstrate any real difference, let alone a significant one, but how are you going to prove a negative...

Quoting - Raf Schietekat

The specification from Microsoft is quite unsatisfactory (so is it a compiler fence, or isn't it? and will _ReadWriteBarrier() keep a write before a read?), but the mentioning of specific hardware architectures at least seems to imply that on specific architectures any necessary machine instructions will be issued.

I have no ambition to demonstrate any real difference, let alone a significant one, but how are you going to prove a negative...

Yes, the documentation is unsatisfactory.
_ReadWriteBarrier() will keep a write before a read.
_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx).

I can't prove the opposite. Proving negative things are usually more problematic because I must test ALL cases, and you must find just one...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

"_ReadWriteBarrier() will keep a write before a read."
How could that possibly be useful without a hardware fence?

"_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx)."
Ah, look: "The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only." Obviously in the documentation about these functions/intrinsics themselves you don't have such a statement... assuming this one is correct, of course. So, here we have the heaviest fence of all, with just the generic name MemoryBarrier() for your confusion, to be avoided if at all possible, but the documentation doesn't tell you that, and there's no reference in sight to a cheaper alternative for use where needed... Not very nice at all. So how should one implement __TBB_store_with_release() and __TBB_load_with_acquire() so that it doesn't break down on other architectures than x86/x64?

"I can't prove the opposite. Proving negative things are usually more problematic because I must test ALL cases, and you must find just one..."
If you think there's no cost anyway, then that's all the more reason to be conservative instead of avoiding the use of those functions/intrinsics.

Quoting - Raf Schietekat

"_ReadWriteBarrier() will keep a write before a read."
How could that possibly be useful without a hardware fence?

I am aware of at least 3 practical use cases:
1. Interaction between a thread and a UNIX signal handler.
2. Interaction between threads running on the same processor.
3. Interaction between arbitrary threads when hardware fences are provided by other means.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat

"_Read/_Write/_ReadWriteBarrier() are compiler only fences (see http://msdn.microsoft.com/en-us/library/ms684208%28VS.85%29.aspx)."
Ah, look: "The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only." Obviously in the documentation about these functions/intrinsics themselves you don't have such a statement... assuming this one is correct, of course. So, here we have the heaviest fence of all, with just the generic name MemoryBarrier() for your confusion, to be avoided if at all possible, but the documentation doesn't tell you that, and there's no reference in sight to a cheaper alternative for use where needed... Not very nice at all. So how should one implement __TBB_store_with_release() and __TBB_load_with_acquire() so that it doesn't break down on other architectures than x86/x64?

Just mark the variable as volatile. That's all.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy Vyukov

I am aware of at least 3 practical use cases:
1. Interaction between a thread and a UNIX signal handler.
2. Interaction between threads running on the same processor.
3. Interaction between arbitrary threads when hardware fences are provided by other means.

Really?
1. Maybe, but I don't know what the issues are here.
2. Can probably be disregarded because obsolete.
3. You wouldn't be able to meaningfully combine them with _ReadWriteBarrier(), is what I'm saying.

Quoting - Dmitriy Vyukov

Just mark the variable as volatile. That's all.

I'll pretend I didn't see that.

(Added) Literally: don't do that unless it's well encapsulated and won't infect the rest of the program with Microsoft-onliness.

(Added) And why would the compiler add machine instructions without applying the accompanying compiler fence? That makes no sense at all.

Quoting - Raf Schietekat

Really?
1. Maybe, but I don't know what the issues are here.
2. Can probably be disregarded because obsolete.
3. You wouldn't be able to meaningfully combine them with _ReadWriteBarrier(), is what I'm saying.

Well, what can I say... I am a bit confused... I can go in deep details regarding each point... however, Raf, don't you trolling on this?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy Vyukov
Well, what can I say... I am a bit confused... I can go in deep details regarding each point... however, Raf, don't you trolling on this?

That's an unfair assumption, but of course you're not obliged to continue this.

Quoting - Raf Schietekat

That's an unfair assumption, but of course you're not obliged to continue this.

In short:
1. You need only compiler fences here, basically you need to 'strip' hardware part from a fence. Since a thread part and a signal part are executed on a single OS/hardware thread, there is no issue of hardware ordering.
2. It's not obsolete. You can bind two or more threads to a single processor, which is somehow reasonable for low-level parallelism support libraries like TBB. Then you need only compiler part of fences too.
3. Me and not only me are indeed able combine them in a meaningful way. Check out Joe Seigh's SMR+RCU:
http://lkml.indiana.edu/hypermail/linux/kernel/0505.1/0252.html
or David Dice et et Asymmetric Dekker Synchronization:
http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization...
or my Asymmetric Reader-Writer Mutex:
http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571c6137

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat

I'll pretend I didn't see that.

(Added) Literally: don't do that unless it's well encapsulated and won't infect the rest of the program with Microsoft-onliness.

(Added) And why would the compiler add machine instructions without applying the accompanying compiler fence? That makes no sense at all.

Since you consider MS volatiles as a replacement for MS _ReadWriteBarrier(), MS-onliness is not an issue at all. Anyway for now (until C++0x) on every platform you will have to fall onto platform-specific level, so I do not see how you can get better than that anyway.

MS volatiles provide both compiler and hardware ordering. Only hardware fences do not make any sense. MS guys understand this.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

See, how else would I have obtained those specific links without trawling the whole Internet? :-) Thanks, I'll do some reading tonight, and maybe tomorrow some more trolling.

Thanks for the replies!

Dmitriy, sorry for the delayed response.

#13 No, I don't see it. Or maybe it's a misunderstanding. I'm not aware of any bidirectional machine-level memory fences, so why would there be compiler-level ones? Isn't the real meat in the atomic operation, sided by necessarily asymmetric fences, on one side or both? That would go for 1 and 2. I couldn't find any mention of "compiler fence" in the first two references for 3, and in your own example the uses of _ReadWriteBarrier() are even commented as either acquire or release, so why not use _ReadBarrier() and _WriteBarrier() instead?

#14 As for my reaction to the MS-specific treatment of "volatile", that's just because it's so much easier to infect code by changing the meaning of an existing keyword than by the use of a new construct that would cause a compilation error elsewhere.

I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...

Quoting - Raf Schietekat
Dmitriy, sorry for the delayed response.

#13 No, I don't see it. Or maybe it's a misunderstanding. I'm not aware of any bidirectional machine-level memory fences, so why would there be compiler-level ones?

As for hardware bidirectional fences, check out membar #LoadLoad, membar #StoreStore on SPARC RMO, and SFENCE, LFENCE on x86.
I believe they are actually same useful and same widespread as uni-directional fences.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat
Isn't the real meat in the atomic operation, sided by necessarily asymmetric fences, on one side or both? That would go for 1 and 2. I couldn't find any mention of "compiler fence" in the first two references for 3, and in your own example the uses of _ReadWriteBarrier() are even commented as either acquire or release, so why not use _ReadBarrier() and _WriteBarrier() instead?

The compiler barrier must be in the same place where you would normally put #StoreLoad fence. In my asymmetric mutex you can find that place by "no explicit #StoreLoad" comment.

Barriers that are commented as acquire and release are different barriers, they are not relevant for the discussion.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat
#14 As for my reaction to the MS-specific treatment of "volatile", that's just because it's so much easier to infect code by changing the meaning of an existing keyword than by the use of a new construct that would cause a compilation error elsewhere.

Agree.
It may make porting of MSVC code to other platforms quite problematic.
The better way would be to finally implement something along the lines of std::atomic<>.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat
I still haven't found an accessible discussion about how those operations actually work. For example, if one thread does a release-write, why would that be more costly than just a compiler fence if the read-acquire happens to be on the same core even if that wasn't known before it so happened? Well, that's just out of curiosity at this point...

The best description to date is "Asymmetric Dekker Synchronization" by David Dice et al.

It's not about elimination of release/acquire fences, it's about elimination of #StoreLoad style fence (MFENCE). Release/acquire fences can be eliminated too, though. that's done in Linux kernel RCU. Check out:
http://lwn.net/Articles/253651/
You may see how Paul McKenney use asymmetric synchronization to eliminate even release/acquire fences from reader side, compiler fences are still in place.
The technique basically allows you to "strip" hardware part from some fences, and leave only compiler part, then compensate hardware part by something else.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

#18 Er, why did I write that? I have no idea... sorry, please ignore.

#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

But maybe I should drop the subject: my atomics proposal seems to be dead and buried, and revving up with a disengaged clutch is said to be bad for the engine...

Quoting - Raf Schietekat

#19 Except that there is no compiler barrier where the #StoreLoad comment is, and the acquire and release barriers are the same... oh well.

It's a known issue :)
I am in process of writing lengthy detailed description of asymmetric synchronization, however I do not know how long it will take... probably months... and in process I may completely lost any interest, so it may actually not appear at all :(

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat

#20 So, if compiler-only fences are such a good idea, why haven't I seen them in TBB or C++0x, and why haven't they come up in a discussion here before? An oversight to be corrected?

Humm... in order to correct this I would suggest you to re-read C++0x draft, especially the part related to std::atomic_signal_fence() (which was previously called std::compiler_fence()) :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Raf Schietekat

#21 Superior performance through an entirely different approach may be (highly) preferable, where feasible, but I was still wondering about the implementation of compiler-only vs. machine-level memory fences etc. Perhaps the hardware could somehow dynamically detect that everything occurs on the same core, and avoid cache-related external chatter?

Hardware indeed avoids cache-coherence related traffic for data accessed from one core.
However there are still overheads related to instruction ordering - pipeline dump, store buffer dump.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Login to leave a comment.