Additions to atomic<T>

155 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

Store and load are as they should be (you don't need an exclusive store to clear the monitor). Thanks for drawing my attention to the fact that ARM dropped the ball on clearing the monitor for abrupt control flow changes (how about a recall?); it then falls on the code taking control (in this case the Linux kernel) to correct that, because there's nothing a user can do to absolutely guarantee correctness, is there, except perhaps do silly things to improve the odds at the expense of performance? 64-bit support can be done without a lock (I'll work on it over the weekend); if you want to try this now then reverse the strex result flag test (I had not noticed yet that it flags failure instead of success) and disable the 64-bit test in test_atomic (TBB only needs 64-bit atomics if a pointer is that size). Sign extension issues are because "cmp" has no width argument, it always compares entire registers. Reliance on g++ intrinsics is not a net gain for TBB.

(Added) Please use new gcc_arm.h snapshot for testing (also disables 64-bit test in test_atomic), on top of previous diff.

Attachments: 

AttachmentSize
Download gcc_arm.h17.82 KB

Thank you, that makes sense if "old" compilers have to be supported.

Besides: I managed to run TBB on ARM using only the cmpxchg implementations. However, release builds tend to freeze after a couple of seconds, debug builds just need some more time. I might be related to the issue I mentioned previously. Am I right that the standard TBB does not provide a way of specifying operations for atomic stores for all sizes (1,2,4,8) ?

I am out to check with Raf's patch now, but it needs some more work.
For instance, why do I have to specify type and size of atomic operations for various types separately?:
__TBB_M(1,int8_t ,"b")
__TBB_M(2,int16_t,"h")
__TBB_M(4,int32_t,"" )
The first argument should always be sizeof() the second here, shouldn't it?

--L

Regarding cmp, you are totally right. I was mislead by your code acutally, because it allowed size modifiers to append the cmp mnemonic. Now I suppose that was just to break compilation and catch problematic cases.

Thanks for providing an update, I will see how far I get.

Regards!

Alexey Kukanov (Intel)'s picture

Yes you are right; in TBB 2.0 and 2.1 the stores are assumed to be atomic (except for the64-bit store on IA-32, which is implemented explicitly), while fences are missed.

Both type and size are specified because (at least, in the vanilla TBB) the size digit becomes a part of a function name by using ## in the macro.

Raf, here comes an updated version for ARM.
- it compiles without having the assembler claiming about not-allowed registers, etc
- it allows TBB to run my example without freezing
- no 64 bit support included, but I consider it useful (ABA prevention things)

So in case you need it I vote for using your approach in TBB (at least for the ARM side).

--L

Attachments: 

AttachmentSize
Download gcc_arm.h17.82 KB
Raf Schietekat's picture

Thanks, and sorry for the bad proofreading (good catch on the test for exclusive-store failure!); I'll try to do better for the 64-bit code still to be added. I did restore the double duty of "intended", however, and rolled back the "memory" clobber (that's what "m"(*(T*)ptr) is for, because clobbering "memory" may prevent some valid optimisations, theoretically at least).

As I do not know enough about the internals of gas+gcc I was afraid that an unused/unreferenced output parameter ("m"(*(T*)ptr)) might be ignored. Due to the bad experience with the 'stock' version I used the memory clobber for safety. But I totally agree that it lacks the possibility to specify a distinct address. So if the distinct output is really working, it is clearly the better solution.
Now going to play with Realview Compiler and TBB...

--L

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some traction on PA-RISC/HP-UX/g++ (most of the tests and examples should work), added support for ARM/g++ (64-bit support ready for testing) and {ESA/390,z/Architecture}/g++, parallel_for examples now show up on PowerPC-based Mac OS X (x86-specific flags now only selectively added).

Question: why does (only) the polygon overlay example remain black on x86/Ubuntu?

Attachments: 

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Scalable memory allocator now works on PA-RISC/HP-UX/aCC.

(Added) Please note that I have abandoned my goal of porting TBB itself to aCC, a poster child for open source.

Attachments: 

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some corrections and ideas for the scalable memory allocator, and tracking some changes by Intel's TBB team.

Corrections include: alignment guarantees, portability.

The ideas are about decreasing memory overhead (adding bins for sizes 2 and 4, eliminating most of the overhead for allocation sizes of about 8000 to 16000 bytes, potentially decreasing overhead because of the alignment needs for large objects, or eliminating those needs altogether). Because it is not (always) obvious how this will affect performance and whether it is even effective, preprocessor switches control which changes are active, and you are invited to evaluate the results (see __TBB_USE_... in src/tbbmalloc/MemoryAllocator.cpp). The __TBB_USE_MAGIC option is incomplete: first the safe probing should be validated for this to be a viable approach.

Attachments: 

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

Some new atomic functionality (compare-and-swap with spurious failure) but with a twist (different signature, pedantic orthogonalism).

Spurious failure isn't as bad as the name would indicate: it just means that the function does not have to keep trying (spinning) until it can come up with a better excuse than that the other threads got in the way, i.e., the observed value does not have to differ from the expected value. On some architectures this does not matter (x86 gets/simulates a lock and either succeeds or fails because the observed value is actually different), on others it does, and where it matters one might as well unravel the compound loop of spinning on a spinning compare_and_swap, which also should allow for a more rational backoff policy (or so I hope).

The twist is to use a signature

bool compare_and_store(value_type &comparand, value_type intended, bool wait = true);

where the observed value is exchanged for the expected value in "comparand", instead of being the return value (TBB style) or being exchanged for the intended value as is often done (it seems more user-friendly to me to keep the "comparand" continuously updated, with x86 and ESA/390-z/Architecture as a precedent), and the "wait" parameter determines whether spurious failure is disallowed (the default mimics compare_and_swap), and to make this part of an orthogonal {compare,fetch}_and_{store,add,and,or,xor}() family of operations (just for kicks, I also added {add,and,or,xor}_and_fetch() with "prefix" functionality; the one with store would probably not make sense even though it does the same as operator=), with compare_and_swap now somewhat superfluous but not yet deprecated (pending evaluation, but some uses have already been replaced with a specific expectation to improve performance).

Here's a small illustration in pseudo code, ignoring backoff, memory semantics, etc., showing a default implementation for one of the other operations:

int fetch_and_add(int addend) {
 int comparand = load(); // expected/observed, initial value not critical for correctness
 while( !compare_and_store( comparand, comparand+addend, irrelevant ) );
 return comparand;
}

As a mental exercise, try to rewrite this with the observed value either returned or replacing the intended value.

So, what do you think?

Attachments: 

Dmitry Vyukov's picture

I think it's great!
IIRC, CAS that updates comparand called something like 'IBM-style'.

What I am missing in current C++0x is functions like atomic_{add,and,or,xor}, w/o fetch at all.
On x86 architecture 'lock' prefix can applied to instructions like add,
and, or, xor. And in some algorithms this is exactly what is needed. I
don't need previous value, I don't need 'CAS-loop', I just need atomic
operation.

Btw, if you want to make interfaces more C++ish and have more guarantees about static dispatch, you can consider following interfaces:

// how this is made in current C++0x (or Java):
bool compare_and_store_weak(value_type &comparand, value_type intended);
bool compare_and_store_strong(value_type &comparand, value_type intended);

or:

// or just static dispatch based on overloading:
bool compare_and_store(value_type &comparand, value_type intended); // strong version
bool compare_and_store(value_type &comparand, value_type intended, bool /*fake*/); // weak version

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Arch D. Robison (Intel)'s picture

I like the compare_and_store syntax.

On the spelling issue, I'd be inclined to follow the C++0x lead and use the _strong suffix for the strong case, and leave off the suffix for the weak case.My guess is that most algorithms can get by with the weak case, and that algorithms depending on the strong case are the unusual ones that need to call out their unusualness.

I suspectthat once serious C++0x implementations start appearing, they will optimize "atomic_fetch_or(&x,c)" with an unused resultto "LOCK OR" instruction. It's too good an opportunity to pass up. Too bad the C++0x "axioms" feature isn't powerful enough to express this transform.

Raf Schietekat's picture

I was wondering what default to use: the one that provides the current behaviour (for least opposition) with the least risk of unexpected results (no spurious failure assumed, perhaps because of blind conversion, tested on x86, executed on POWER/PowerPC etc.), or the one that arguably makes the most sense per se. I will happily change the default then (wait = false), caveat convertor.

I don't see what benefit omitting the fetch part would bring on current hardware. Maybe special hardware can use that, perhaps together with coalescing (or merging, or whatever it is called), but if the value has to be handled on the processor anyway? Or is it just about saving a register on (sparsely outfitted) x86, and how real a benefit is that? And what algorithms would it be used for? I was hesitant to push orthogonalism to the limit (ad nauseam), without seeing a real benefit myself, but now, with this validation, there's no stopping me anymore...

As for C++0x, I still have this pipe dream that, if I can show them a working implementation, they will amend their current ugly API proposal. Would this convince anyone, you think (something you couldn't do without template arguments for memory semantics):

template
int fetch_and_add(int addend) {
 int comparand = load();
 while( !compare_and_store( comparand, comparand+addend ) );
 return comparand;
}

It looks cool to me, but maybe it is inferior to specialisations that keep the fence outside the loop?

Arch D. Robison (Intel)'s picture

Raf_Schietekat:I don't see what benefit omitting the fetch part would bring on current hardware.

The benefit is that x86 has instructions that do atomic AND, XOR, and OR as long as no (user-visible) fetch is required. Once the fetch requirement is added, x86 needs a compare-and-swap loop with its attendant overheads.

spin_rw_mutex.cpp usesboth atomic OR and atomic AND to manipulate some flag bits.

Raf Schietekat's picture

I've been regretting all day that I didn't take just a little longer to think about what I road.

No-fetch and/or: no red-face smiley available?

With the sample of code, I was mixing up two things, one of them misstated. The first is a straight question about how costly a repeated fence operation is compared to a single fence, especially in a loop. The second is that static parameters apparently have all the flexibility required (unlike an affix), while they are also guaranteed to be "optimised", whereas with runtime parameters this seems far less certain to say the least: any runtime variation of the value is almost surely a bug that I would like the compiler to rule out for me, and how would the compiler be able to provide equally efficient compiled code, especially if it has to come up with it ASAP (in a debug build)?

Dmitry Vyukov's picture
MADadrobiso:

I suspectthat once serious C++0x implementations start appearing, they will optimize "atomic_fetch_or(&x,c)" with an unused resultto "LOCK OR" instruction. It's too good an opportunity to pass up.

But will they optimize TBB's atomics? ;)

I think that first releases of C++0x compilers won't optimize anything. So I think that we will see first atomic optimizing compilers only around 2011. And there are more serious things than "LOCK OR". For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW. So one often will have to write sequentially consistent atomic RMW followed by sequentially consistent fence. And w/o optimization on x86 it will result in locked RMW followed by mfence, however intended x86 code sequence is just locked RMW (because it effectively includes trailing sequentially consistent fence). This can basically double cost of synchronization primitives. C++0x memory model and atomics library are so general so they just have to be backed up by heavily optimizing compiler.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:I was wondering what default to use: the one that provides the current behaviour (for least opposition) with the least risk of unexpected results (no spurious failure assumed, perhaps because of blind conversion, tested on x86, executed on POWER/PowerPC etc.), or the one that arguably makes the most sense per se. I will happily change the default then (wait = false), caveat convertor.

I think you can include into header files something like:

#ifndef I_DO_AWARE_OF_CHANGED_SEMANTICS_OF_COMPARE_AND_SWAP
#error "Ah-ah-ah. It seems that you don't yet aware of changed semantics of compare_and_swap. RTFM!"
#endif

What do you think?
:)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
something you couldn't do without template arguments for memory semantics:

template
int fetch_and_add(int addend) {
 int comparand = load();
 while( !compare_and_store( comparand, comparand+addend ) );
 return comparand;
}

Hmmm.... Why I can't write:

int fetch_add(int addend, memory_order mo) {
 int comparand = load(memory_order_relaxed);
 while( !compare_exchange( comparand, comparand+addend, mo ) );
 return comparand;
}

?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
The first is a straight question about how costly a repeated fence operation is compared to a single fence, especially in a loop.

It depends on architecture. On x86/Sparc it doesn't make any sense, just because you can't variate any fences.
On PPC/Itanium in makes sense, and if you don't want people will be watching your implementation and saying "Oh, cr@p! I have to rewrite it manually!" :) you must take fence away from cycle.
Some numbers running in my head, acquire is 25 cycles and release is 40 cycles on PPC, but I really not sure about that numbers.
It's used practice in high performance synchronization algorithms. For example RCU in Linux kernel tracks additional quiescent epochs just to remove acquire fence from mutex acquisition, and release fence from mutex release. SMR+RCU makes the same thing. My asymmetric rw-mutex makes the same thing (i.e. read_lock/read_unlock doesn't contain any memory barriers, even acquire and release).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

One paragraph corresponds to each post from Dmitriy.

"For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW." I didn't use an mfence instruction for ordered RMW operations (like fetch_and_add), nor did TBB before me, so what's different in C++0x? Note that "Intel 64 Architecture Memory Ordering White Paper" 318147.pdf says "Locked instructions have a total order", or is this not valid for Intel-32 anyway (I think somebody said that it was despite the name)? What exactly changed from 486/Pentium onwards (see "Intel 64 and IA-32 Architectures Software Developers Manual/Volume 3A: System Programming Guide, Part 1")? What about non-Intel implementations?

It's probably overkill here.

See my correction: the technical argument is about runtime vs. static parameters, and how they're treated by the compiler (compile-time assurance that they're actually static, and optimisation). What do you think?

"On x86/Sparc it doesn't make any sense, just because you can't variate any fences." What do you mean? And what's different on PPC/Itanium? Are fence instructions fixed in duration regardless of the current state of the processor (I thought maybe the second one wouldn't have anything to do and so might be fairly cheap)? Are you saying that after a number of instructions not needing a fence it will be implicit anyway? Where is that documented, and can it be used here?

Dmitry Vyukov's picture
Raf_Schietekat:
"For example, sequentially consistent atomic RMW in C++0x are not so strong as x86 locked RMW." I didn't use an mfence instruction for ordered RMW operations (like fetch_and_add), nor did TBB before me, so what's different in C++0x? Note that "Intel 64 Architecture Memory Ordering White Paper" 318147.pdf says "Locked instructions have a total order", or is this not valid for Intel-32 anyway (I think somebody said that it was despite the name)? What exactly changed from 486/Pentium onwards (see "Intel 64 and IA-32 Architectures Software Developers Manual/Volume 3A: System Programming Guide, Part 1")? What about non-Intel implementations?

Following code doesn't work under current C++0x rules:

std::atomic x, y; // both initially 0 
void thread1(){
x.fetch_add(1, std::memory_order_seq_cst);
int r1 = y.load(std::memory_order_relaxed);
} 
void thread1() {
   y.fetch_add(1, std::memory_order_seq_cst); 
   int r2 = x.load(std::memory_order_relaxed); 
} 

Doesn't work in the sense that (r1 == 0) && (r2 == 0) is possible output.

Because C++0x's sequentially consistent atomic RMW is not so strong as x86 locked RMW, basically C++0x seq_cst atomic RMW doesn't include full fence. So sometimes you will have to insert additional seq_cst fence after seq_cst atomic RMW, just to satisfy C++0x formal rules.

Note that C++0x's seq_cst atomic RMWs form total order, and they both release and acquire. But the example is still doesn't work.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
See my correction: the technical argument is about runtime vs. static parameters, and how they're treated by the compiler (compile-time assurance that they're actually static, and optimisation). What do you think?

Personally I'm not concerned with C-interoperability, and I like advanced C++, so personally I am not objecting to memory ordering as template parameter.
But C++ committee is concerned with C-interoperability.
Also note that current C++0x atomics interface can be implemented in smart way, so that it will provide guarantees of static dispatch. Something like

typedef int memory_order;

struct memory_order_relaxed_t {
 operator int () const {
 return 1;
 }
};

struct memory_order_acquire_t {
 operator int () const {
 return 2;
 }
};

extern memory_order_relaxed_t memory_order_relaxed;
extern memory_order_acquire_t memory_order_acquire;

// general implementation that uses dynamic dispatch
int atomic_exchange(..., memory_order);

// static specialization for memory_order_relaxed_t
int atomic_exchange(..., memory_order_relaxed_t);

// static specialization for memory_order_acquire_t
int atomic_exchange(..., memory_order_acquire_t);


Now, if I specify memory ordering statically (likely), then implementation will be chosen statically. If I will specify memory ordering dynamically, then implementation will be chosen dynamically.
It's a kind of killing two birds with one stone.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
"On x86/Sparc it doesn't make any sense, just because you can't variate any fences." What do you mean?

I mean that on x86 atomic RMW is always sequentially consistent, load is always acquire, store is always release. So one doesn't have physical possibility to move fence out of the cycle. Fence will always be a part of operation.

Raf_Schietekat:
And what's different on PPC/Itanium?

On PPC/Itanium one is able to separate fence and operation. So at least it's possible to move fence out of the cycle.

Raf_Schietekat:

Are fence instructions fixed in duration regardless of the current state of the processor (I thought maybe the second one wouldn't have anything to do and so might be fairly cheap)?

I think it's highly architecture dependent. I can only say about Pentium 4 processors, where I was conducting some measurements. I conclude that on Pentium 4 atomic RMW and mfence always take the same time, no matter what state of processors currently is.

Raf_Schietekat:

Are you saying that after a number of instructions not needing a fence it will be implicit anyway? Where is that documented, and can it be used here?

I don't get you. Can you elaborate?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Dmitriy, again merging a few messages...

1. What exactly is the difference between C++0x acq_rel and seq_cst if not making the example work?

2. I think that C interoperability is the root of all evil (you can count on evil to have two roots!). It virtually doomed CORBA all by itself even if Microsoft had not been around to shun it (I still feel far from confident, and I wrote a working implementation...). Upward compatibility is great, downward compatibility is a giant mistake. IMHO anyway.

OK, I remember now that you wrote before about how this can still provide static dispatch. But then it's still a quality-of-implementation question, and what would be a plausible scenario to ever allow runtime dispatch for memory semantics (as in a language can give you more by allowing you to do less)?

3. SPARC: "A compare-and-swap operation does not imply any memory barrier semantics." So it's like PPC/Itanium.

BTW, in the patch I never (intentionally) put a fence operation inside a loop.

To elaborate: I was wondering what you meant by "quiescent epochs just to remove acquire fence from mutex acquisition".

Dmitry Vyukov's picture
Raf_Schietekat:
1. What exactly is the difference between C++0x acq_rel and seq_cst if not making the example work?

seq_cst operations participate in total order over all seq_cst operations. Period.
This fact by itself doesn't make the example work. Personally for me it's quite counter-intuitive. I was talking about this issue with committee members, they counting that it's correct behaviour. Operations are local. Fences are global. For me it's still counter-intuitive.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
SPARC: "A compare-and-swap operation does not imply any memory barrier semantics."

Is it relates to TSO mode?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"seq_cst operations participate in total order over all seq_cst operations. Period." Yes, that seems to be what the proposed standard text says. So is this like Java volatile, where volatiles are mutually sequentially consistent, but where you still need a causal chain for other operations to be synchronised? Isn't Java being criticised for this (too little to be helpful, too much to be economical), or did I imagine that? What does it "do" for a program? And I thought that this example was specifically why they chose seq_cst as the default semantics? Anyway, this is not what I implemented for "ordered". So what now...

The quote is from http://www.sparc.org/standards/SPARCV9.pdf; I saw nothing in "A.9 Compare and Swap" referring to the memory model in use.

Dmitry Vyukov's picture
Raf_Schietekat:
The quote is from http://www.sparc.org/standards/SPARCV9.pdf; I saw nothing in "A.9 Compare and Swap" referring to the memory model in use.

8.4.4.3 Total Store Order (TSO)
The rules for TSO are:
 Loads are blocking and ordered with respect to earlier loads.
 Stores are ordered with respect to stores.
 Atomic load-stores are ordered with respect to loads and stores.
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
randomizer:
Raf_Schietekat:
The quote is from http://www.sparc.org/standards/SPARCV9.pdf; I saw nothing in "A.9 Compare and Swap" referring to the memory model in use.

8.4.4.3 Total Store Order (TSO)
The rules for TSO are:
 Loads are blocking and ordered with respect to earlier loads.
 Stores are ordered with respect to stores.
 Atomic load-stores are ordered with respect to loads and stores.

I forget to mention, that in practice Sparc nearly always run in TSO mode. Other modes (RMO, PMO) don't even have to be supported.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
To elaborate: I was wondering what you meant by "quiescent epochs just to remove acquire fence from mutex acquisition".

What exactly to elaborate?
RCU waits additional hardware level quiescent epoch to allow RCU critical section acquire/release operations to be implemented w/o any fences at all.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"I forget to mention, that in practice Sparc nearly always run in TSO mode. Other modes (RMO, PMO) don't even have to be supported." http://www.linuxjournal.com/article/8212 says "Solaris on SPARC uses total-store order (TSO); however, Linux runs SPARC in relaxed-memory order (RMO) mode.".

"What exactly to elaborate?" The first "quiescent epochs" related to computing that shows up in Google is this discussion (I gave up after a few pages). With RCU or "Linux kernel" it's the only hit.

Dmitry Vyukov's picture
Raf_Schietekat:"I forget to mention, that in practice Sparc nearly always run in TSO mode. Other modes (RMO, PMO) don't even have to be supported." http://www.linuxjournal.com/article/8212 says "Solaris on SPARC uses total-store order (TSO); however, Linux runs SPARC in relaxed-memory order (RMO) mode.".

This must be a mistake.
See /linux-2.6.25.1/include/asm-sparc/system.h:

/* XXX Change this if we ever use a PSO mode kernel. */
#define mb() __asm__ __volatile__ ("" : : : "memory")
#define rmb() mb()
#define wmb() mb()
#define read_barrier_depends() do { } while(0)
#define set_mb(__var, __value) do { __var = __value; mb(); } while(0)
#define smp_mb() __asm__ __volatile__("":::"memory")
#define smp_rmb() __asm__ __volatile__("":::"memory")
#define smp_wmb() __asm__ __volatile__("":::"memory")
#define smp_read_barrier_depends() do { } while(0)

membars are not used at all!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
"What exactly to elaborate?" The first "quiescent epochs" related to computing that shows up in Google is this discussion (I gave up after a few pages). With RCU or "Linux kernel" it's the only hit.

Sorry, "quiescent epoch" is not precise term. Search by "quiescent period".
Or better see Paul McKenney's RCU page:
http://www.rdrop.com/users/paulmck/RCU/

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"This must be a mistake." Thanks, I've added a note to gcc_sparc.h to investigate further.

Thanks also for the link, I'll have a look. Still, the author doesn't even know what memory model is being using for Linux...

Dmitry Vyukov's picture
Raf_Schietekat:"This must be a mistake." Thanks, I've added a note to gcc_sparc.h to investigate further.

I've consulted with Paul McKenney, and he have resolved the issue.
32-bit Linux runs Sparc in TSO mode. And 64-bit Linux runs Sparc in RMO.
64-bit headers (include/asm-sparc64/system.h) looks like:

#define mb()
membar_safe("#LoadLoad | #LoadStore | #StoreStore | #StoreLoad")
#define rmb()
membar_safe("#LoadLoad")
#define wmb()
membar_safe("#StoreStore")
#define membar_storeload()
membar_safe("#StoreLoad")
#define membar_storeload_storestore()
membar_safe("#StoreLoad | #StoreStore")
#define membar_storeload_loadload()
membar_safe("#StoreLoad | #LoadLoad")
#define membar_storestore_loadstore()
membar_safe("#StoreStore | #LoadStore")

So I think you can relaxed somewhat fences for the case (#if __linux__ || __linux).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Now that I would not have expected... But where is the commitment that 32-bit Linux will forever run in TSO, because 64-bit seems to have a "this is the future" ring to it? Otherwise, what would be a safe way to determine this (a macro doesn't help if a program is executed on a later kernel)? Until this is clear I think I'll just leave it as it is.

Raf Schietekat's picture

About 8.4.4.3, can you explain why it is not as follows:
"Thus, TSO ensures that:
- Each load instruction behaves as if it were followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore.
- Each store instruction behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore.
- Each atomic load-store behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore and followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore."

Also, it begs the question where #StoreLoad comes into the picture, i.e., it seems that TSO implements the rel_acq policy of the patch (more or less), but it seems that "ordered" has to be explicitly added (but where exactly? all the relations both before and after the ordered instruction?). That would also mean that on SPARC it would indeed make sense to wonder about keeping a fence instruction out of the loop (see 30263829).

Dmitry Vyukov's picture
Raf_Schietekat:Now that I would not have expected... But where is the commitment that 32-bit Linux will forever run in TSO, because 64-bit seems to have a "this is the future" ring to it? Otherwise, what would be a safe way to determine this (a macro doesn't help if a program is executed on a later kernel)? Until this is clear I think I'll just leave it as it is.

Rule of Least Surprise is Unix Philosophy :) And it will be quite radical change...

I don't know. There is a PSTATE.MM register that determines memory model, but afaik it only accessible from privileged mode.

Btw, theoretically they can change and endianess of data. But I thing it's not things to change...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
- Each store instruction behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore.

If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)

Why there is no LoadStore? I think that it's non needed. Load is either (1) non needed to be orderer wrt store, or (2) will be implicitly orderer by data/control dependencies. Try to come up with counter-example.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
- Each atomic load-store behaves as if it were followedpreceded by a MEMBAR with a mask of #LoadStore|#StoreStore and followed by a MEMBAR with a mask value of #LoadLoad|#LoadStore."

Each atomic load-store behaves as if it were followed by a MEMBAR with a mask of #LoadStore|#StoreStore|#LoadLoad.
I think it's enough. RMW is load and store at the same time, so:
Loads can't hoist above RMW because of #LoadLoad.
Stores can't hoist above RMW because of #StoreStore.
Loads can't sink below RMW because loads are followed by #LoadLoad.
Stores can't sink below RMW because stores are followed by #StoreStore.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
Also, it begs the question where #StoreLoad comes into the picture, i.e., it seems that TSO implements the rel_acq policy of the patch (more or less), but it seems that "ordered" has to be explicitly added (but where exactly? all the relations both before and after the ordered instruction?). That would also mean that on SPARC it would indeed make sense to wonder about keeping a fence instruction out of the loop (see 30263829).

I believe that Sparc TSO is effectively the same as x86. So ordered store maps to XCHG, ordered load maps to plain load, orderer RMW maps to RMW (w/o any membars). So on TSO there are just nothing that can be moved out of the loop.

On RMO membars definitely can be moved out of the loop.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
randomizer:
Each atomic load-store behaves as if it were followed by a MEMBAR with a mask of #LoadStore|#StoreStore|#LoadLoad.
I think it's enough. RMW is load and store at the same time, so:
Loads can't hoist above RMW because of #LoadLoad.
Stores can't hoist above RMW because of #StoreStore.
Loads can't sink below RMW because loads are followed by #LoadLoad.
Stores can't sink below RMW because stores are followed by #StoreStore.

However, I am not 100% sure here. There is no implied #StoreLoad membar, which is the most critical (and expensive). So can loads hoist above RMW?

Implementation of XCHG in Linux Kernel (for TSO) suggests that the answer in No, loads can't hoist above RMW.

static inline unsigned long xchg_u32(__volatile__ unsigned long *m, unsigned long val)
{
#ifdef CONFIG_SMP
__asm__ __volatile__("swap [%2], %0"
: "=&r" (val)
: "0" (val), "r" (m)
: "memory");
return val;
#else

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)"

It's still distracting (as evidenced by the need to prove instead of being able to just see it), and doing the same in RMO will not work: 2 reasons why this is an inappropriate way of presenting things.

"Why there is no LoadStore? I think that it's non needed. Load is either (1) non needed to be orderer wrt store, or (2) will be implicitly orderer by data/control dependencies. Try to come up with counter-example."

I always wondered why there was a #LoadStore at all, but thought that maybe there might be implementations with delayed loads that would need it, so I already replaced a comment about it with an actual use of #LoadStore (see next version), which should at the worst be harmless. Not being able to come up with a counter-example is no proof, of course, so I'll just leave it like that.

If things were presented as I suggest, it would be easier to see that SPARC RMW is only "rel_acq", not "ordered" like x86 RMW, and the question of loads "hoisting" above RMW would not even come up (sure they can "hoist", no Dekker's with just xchg_u32()...). Right or wrong?

BTW, about C++0x seq_cst: could it be that the strange semantics are inspired by the desire to have only one #StoreLoad per instruction?

(Added) Here's an example of what can go horribly wrong with "D.7.2 Dekkers Algorithm": "This code does not guarantee that any processor can enter (that requires a retry mechanism which is omitted here), but it does guarantee mutual exclusion, which means that it is impossible that each processor finds the others lock idle ( = 0) when it enters cthe ritical section." (sic).

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README.atomic for applying the patch.

New tbb::atomic operations add, tbb_and, tbb_or, tbb_xor.

Note that the internal implementations were already in place, they were just never expressed in tbb::atomic.

(Corrected on 2008-09-21) "Spurious failure" now the default, added {compare_and_,}{subtract,increment,decrement} for orthogonality.

Attachments: 

Raf Schietekat's picture

So... how about relaxed<(release/acquire)

Dmitry Vyukov's picture
Raf_Schietekat:"If StoreStore membar associated with *EVERY* store, then it doesn't make any sense whether membar is followed or preceded :)"

It's still distracting (as evidenced by the need to prove instead of being able to just see it), and doing the same in RMO will not work: 2 reasons why this is an inappropriate way of presenting things.

I agree that it's not very intuitive definition.
I just know that on Sparc TSO store is store-release, and load is load-acquire. I don't what they were thinking about when writing those definitions. Maybe it's how TSO is implemented in hardware, maybe it's the most efficient way to achieve required for TSO guarantees, and they don'e need formal proofs of correctness, because they just know that it will work as expected...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
I always wondered why there was a #LoadStore at all, but thought that maybe there might be implementations with delayed loads that would need it, so I already replaced a comment about it with an actual use of #LoadStore (see next version), which should at the worst be harmless. Not being able to come up with a counter-example is no proof, of course, so I'll just leave it like that.

For a mathematician it's not a proof, but for an engineer it's a proof... Joke :)
If you will try hard to came up with counter-example, and will see that you are not able to do it, then at least it will give you some degree of confidence... This is how I get confidence in this question :)

Informally. Loads are invisible to other threads. Other threads can't determine whether load was before store or after. In order to other threads will be able to determine whether load was before or after store, there must be another store which depends on load (whether data dependent, or control dependent). Because this store is dependent on load, load can't sink below this store. Then this store can't also sink below atomic store-release, because store-release contains #StoreStore. So transitively load can't sink below atomic store-release... or it will be just indifferently whether load sink below or not (because other threads don't have means to determine when load was executed).

Consider:
r1 = g_x; // can't sink below store to g_y because of data dependency
g_y = r1;
#StoreLoad
g_flag = 1;
#StoreLoad

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
If things were presented as I suggest, it would be easier to see that SPARC RMW is only "rel_acq", not "ordered" like x86 RMW, and the question of loads "hoisting" above RMW would not even come up (sure they can "hoist", no Dekker's with just xchg_u32()...). Right or wrong?

*IMVHO* Dekker's algorithm can be implemented on Sparc TSO only with atomic RMW's, no need for additional #StoreLoad barriers.
Sources of Linux Kernel somehow confirms this point. Atomic RMW's in Linux Kernel are not augmented with fences. But I believe that they want sequential consistency soever.

Raf_Schietekat:
BTW, about C++0x seq_cst: could it be that the strange semantics are inspired by the desire to have only one #StoreLoad per instruction?

As I understand semantics of seq_cst are inspired by PPC. But I'm not sure, they are very secretive on this :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

"Other threads can't determine whether load was before store or after." Not immediately, obviously, but why would that be a requirement? Given 2 threads A and B, globals a and b both initially clear, and lock l, if each thread acquires the lock, writes its ID into its own global, probes the other global, clears its own global and releases the lock, shouldn't both probes be clear when they compare notes later on? Yet without #LoadStore as part of a release fence, one might have seen the other thread's ID, which seems rather weird.

"*IMVHO* Dekker's algorithm can be implemented on Sparc TSO only with atomic RMW's, no need for additional #StoreLoad barriers." I'm officially confused: why then doesn't 8.4.4.3 document any implicit #StoreLoad?

"As I understand semantics of seq_cst are inspired by PPC." There I suppose it would mean only one SYNC per operation as opposed to two, as is currently implemented here for "ordered"; the other side can be just LWSYNC. "But I'm not sure, they are very secretive on this :)" Why? Am I mistaken?

Pages

Login to leave a comment.