Additions to atomic<T>

Additions to atomic

Arch D. Robison (Intel)'s picture

The discussion in http://software.intel.com/en-us/forums//topic/58670raises the point that TBB's atomic operations are lacking:



  1. A way to express an unfenced atomic load

  2. A way to express a full fence (i.e. fence with both acquire and release semantics)

(2) can be expressed pathetically as atomic::fetch_and_add(0).


I'm wondering about opinions on syntax and generality for the additions suggested above. The simplest approach would be to add:



  1. T atomic::unfenced_load() const

  2. void atomic::fence() const

Another approach would be to add T atomic::load() const, and allow M to specify acquire, release, full fence, or unfenced. But my opinion is that drags inmuch pointless complexity, because it adds "full_fence" and "unfenced" to the possibilities for M that is a template parameter for other atomic ops too, and the "release" variant for a load seems of dubious value.


Would the "simplest approach" be enough?


- Arch


155 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

I've actually been working on just such a proposal (which compiles but needs some more work after stalling on bug 124), because the C++0x people are getting ready to converge on some far less elegant syntax than what TBB currently has (kudos for that!), and TBB is lacking in the memory-semantics department. I would stick to the templated-operations approach (no separate names, or function arguments like the current C++0x proposal, which will hopefully be optimised away?), with default memory semantics for each kind of operation, and perhaps, e.g., load specialised as only a declaration but without a definition. Let the compiler worry about the complexities, I say, as long as the programmer experience is pleasant enough. Just be careful about equating a full fence with one "with both acquire and release semantics", because it is not the same: even for IA-32 and Intel 64 (I have never quite figured out how these two relate), SFENCE+LFENCE (or LFENCE+SFENCE) is not an MFENCE (but I guess TBB targets more processors than have MFENCE as a cheaper alternative than CPUID for full serialisation?), and the current proposal for C++0x actually distinguishes acq_rel (I don't know why in that order) and seq_cst (full serialisation).

A probably major difference of opinion is that C++0x chooses a full fence as the default for load and store, because of one infamous example, whereas Intel seems totally committed to ordered loads on the one hand and ordered stores on the other, with no way to break free even if the programmer wants to (and a locked increment actually implying a full fence!), and I still have not convinced myself that I understand all the implications. It's actually a schizofrenic situation within the C++0x scene, because they now have this full-fence default, even though there are probably some who consider raw/unfenced/relaxed to be the main usage model for atomics.

Should I contribute my current version now? It's still a work in progress, though, even if it currently builds just fine. My opinion is that the main objective should be to improve what C++0x will impose on us all, and time is short.

Arch D. Robison (Intel)'s picture

You are right that a full fence is not the same as "acquire;release" or "release;acquire".That latter can be reordered to be the former, and the former allows "x; acquire; release; y" to be reordered as "acquire; x; y; release". I should have been more specific and said that "full fence" is the same as atomic execution of acquire+release.


In my opinion, LFENCE and SFENCE are regrettable beasts from the precambrian explosion of memory consistency models, before modern acquire/release models evolved.


From the viewpoint of the C++ 200x draft, Intel IA-32 and Intel 64processors effectively have seq_cst semantics for all LOCK-prefixed operations (per white paper), acquire semantics for ordinary oads, and release semantics for ordinary stores. Itanium has seq_cst for its fetch-and-add, exchange, and compare-and-swap.


For increment, you can break free with atomic::fetch_and_increment or the elease> variant. Wedon't have a fenceless variant in TBB because it seems to have limited use. Except for unfenced/raw loads, I don't see major utility of unfenced atomic operations. Counters for statistics seems to be the repeated example.


If were to add to support fenceless variants, we could add a variant, template members for load and store. There's something to be said for adding a named "load" and "store" member templates to allow fencing variants. My concern is that a lot of these variants are silly. E.g., what use is a store-with-acquire or load-with-release?


I'm curious to see your proposal.

Dmitry Vyukov's picture

If I understand you correctly (maybe it's just my english), you misinterpreting std::memory_model_acq_rel. It's a full fence. It's not acquire+release in the sense like SFENCE+LFENCE on x86. Yeah, name 'acq_rel' is quite confusing.
And std::memory_model_seq_cst is not full fence. It's full fence + global operation order. So default memory order in C++0x is not full fence, it's full fence + global operation order.

About relaxed atomics. There is extremely important case (except statistics :) ) - reference counting with basic thread safety:
void rc_acquire(rc_t* obj)
{
rc->counter.increment(memory_model_relaxed);
}

Also Paul McKenney in rationale for memory model for C++0x provides following example:

for (size_t i = 0; i != mailbox_count; ++i)
{
if (msg_t* m = atomic_xchg(&mailbox[i].head, 0, memory_model_relaxed))
{
// executing stand-alone fence only here if got message, not for every check
fence_acquire();
process(m);
}
}

This is also rationale for stand-alone fences.

About default memory order in C++0x.
Well, yes, seq_cst is not intended for usage in high-performance synchronization algorithms. Anyway there is rationale for making seq_cst default. As I understand, first you prototype algorithm with seq_cst. When you see that it's working with seq_cst, you optinally and selectively start replacing seq_cst with weaker type of fences in critical points (on fast-path). On slow-path you can leave seq_cst. And undoubtedly seq_cst is extremely easier to reason about.


You can see Paul McKenney's proposal for memory model for C++0x:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
He concludes:

"This is a prioritized list of differences of this model versus ISOMM:
1. Provide standalone memory fences. This is crucial to avoid the introduction of unnecessary
ordering constraints. Our proposal has been to introduce only three forms of ordering constraints,
but an alternative is to include all possible fences (LoadLoad, LoadStore, StoreLoad & StoreStore)
to allow exploitation on hardware architectures that provide such primitives.
2. Do not require ordering on atomic operations over what is specified by their ordering constraints.
In particular, allow full reordering of raw atomic operations, and allow load-acquire operations to
be reordered ahead of preceding store-release operations.
3. Allow atomic operations to be removed if they are found to be redundant based on sequential
program analysis.
4. Define a mechanism to enable ordering of dependent memory operations, both through control
flow or data flow dependencies.
5. Define a mechanism to allow ordering of atomic operations without introducing any hardware
primitives."

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

About Arch's message, paragraph by paragraph:

A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?

Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).

IA-32/Intel 64 also has seq_cst for XCHG (implicit lock signal), other than the more explicit serialisation instructions.

As long as any processors out there allow unfenced anything, it would seem presumptuous not to allow them to take advantage of it. I think that the statistics example may have been overused (it does seem like grasping at straws), because the bread-and-butter rationale would be reference-counted pointers (see another recent thread that commented on their slowness, which may or may not be caused by IA-32/Intel 64 imposing full-fence semantics?). Dmitriy also mentioned this, I see, and it seems a very strong reason.

The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).

I've been tinkering with my code some more, but it seems that I don't have all the facts together yet (and the good weather is beckoning me to get outside), so most is still to be done. The end result should be simplicity, of course: the user gets some sensible default semantics, or chooses his own, and the code tries to get that result the cheapest way it can on the particular architecture.

That covers Arch's message. I wrote most of this before Dmitriy posted his message, so...

With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.

(Corrected) Removed some things in last paragraph.

Dmitry Vyukov's picture
Raf_Schietekat: About Arch's message, paragraph by paragraph:
A full fence defined this way leaves out sequential consistency, strictly speaking (memory_model_acq_rel vs. memory_model_seq_cst), which may or may not become visible on some processors (which ones?), though not on IA-32/Intel 64. Which processors might that be, and what are the performance considerations for making this (farfetched?) difference? Dmitriy?



I'm not ready to answer about processors, but there are definitely some processors without total store order (I think that it's SPARC not in TSO mode, maybe PPC, maybe ARM).
Full fence without sequential consistency can lead to very counter-intuitive results. I think that seq_cst added mainly to provide clear and easy-to-reason semantics. And to provide tool for prototyping.

See page 11 in presentation "The Future of Concurrency in C++":
http://www.justsoftwaresolutions.co.uk/files/future_of_concurrency.pdf
You can replace release and acquire fences with acq_rel (full) fences - it will not change the output.
As you can see output is very counter-intuitive.

See page 12 where seq_cst is used. Result is intuitive.



Raf_Schietekat:
Why regrettable, if they are hidden in locks and atomics? It seems a bigger problem that now IA-32/Intel 64 does not allow a way out to provide cheap raw/unfenced/relaxed atomic operations, especially involving locked operations, which now impose a full fence (also see below).



Agree. But I don't think that we can change anything here :)


Raf_Schietekat:
The question is whether to allow all combinations orthogonally, or to impose any perceived wisdom on the user through undefined template specialisations. The challenge is to do this across all architectures (do they have incompatible sweet spots or can they agree?).




AFAIK, Current C++0x standard prohibits only store-acquire, store-acq-rel, load-release and load-acq-rel.
I don't know rationale behind this.


Raf_Schietekat:
[...] fences may exist in the following combinations: [...] unfenced/reload/flush/reload+flush as another sequential consistency-related family.



Please describe semantics of unfenced/reload/flush/reload+flush in more detail.

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
With a bit of elementary-particle physics, fences may exist in the following combinations: unfenced/acquire/release/release+acquire as one causality-related family, and unfenced/reload/flush/reload+flush as another sequential consistency-related family. In addition to what their counterparts in the first family do, a reload will not move up across a flush, which gets rid of relativity (mixing my metaphores), but I don't know yet if that is all that is different. Then there's the matter of what it means to have a device that unidirectionally prevents reordering of anything (does it make sense at all?), the meaning of LoadStore, rationale for stand-alone fences, a lot of other things and whether they are relevant or just there to confuse me... So I'm still wondering about such matters, and Dmitriy just sabotaged my attempt to limit my scope.




For now I stick to following approach.
First of all, whenever possible I use C++0x interfaces, semantics and names. For example, unfenced/naked vs relaxed. It seems that for now C++0x is sufficiently stable wrt memory model and atomics. Current C++0x draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2588.pdf

In my personal opinion, C++0x provides too restrictive semantics. Luckily it's not only my opinion, so I also take as
basis proposals for C++0x memory model/atomics from Peter Dimov, Alex Terekhov, Paul McKenney:
http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2237.pdf
I hope that they think enough before making proposal, so I can not think :)

Initially I was going to implement atomics with templates:
atomic_store(&x, 1);
I was thinking that it's the only efficient method. And following approach will invietably add unnecessary overheads:
atomic_store(&x, 1, memory_model_something);

Finally I figure out how to implement atomics with C++0x interface (atomic_store(&x, 1, memory_model_something)) in efficient and comfortable way. Here is code sketch:



/*** Pay attention to inheritance ***/

// fake root
struct memory_order {};

/***** full fences *****/

// full fence + total order
struct mo_seq_cst_t : memory_order {};
// full fence
struct mo_acq_rel_t : mo_seq_cst_t {};

/***** release fences *****/

// classic release
struct mo_release_t : mo_acq_rel_t {};
// release not affecting stores
struct mo_release_load_t : mo_release_t {};
// release not affecting loads
struct mo_release_store_t : mo_release_t {};

/***** relaxed fences *****/

// does not order memory
struct mo_relaxed_t : mo_acq_rel_t {};


extern mo_seq_cst_t memory_order_seq_cst;
extern mo_acq_rel_t memory_order_acq_rel;

extern mo_release_t&
nbsp; memory_order_release;
extern mo_release_load_t memory_order_release_load;
extern mo_release_store_t memory_order_release_store;


/*** Implementation for MSVC/x86 ***/

// Only store - other operations stripped


class atomic32
{
public:
typedef unsigned value_type;

explicit atomic32(value_type v = value_type())
: value_(v)
{
}

void store(value_type v, memory_order = memory_order_seq_cst) volatile
{
_InterlockedExchange((long*)&value_, v);
}

void store(value_type v, mo_release_t) volatile
{
_ReadWriteBarrier();
value_ = v;
}

void store(value_type v, mo_relaxed_t) volatile
{
value_ = v;
}

private:
value_type volatile value_;

atomic32(atomic32 const&);
atomic32& operator = (atomic32 const&);

/*** forbidden ***/
void store(value_type v, mo_acq_rel_t) volatile;
};

Basic rule: I always provide implementation for seq_cst. Then I provide specialized implementations, if they can be implemented in more effective way on current architecture (release). Then I move forbidden combinations to private section (acq_rel).
All other things are handled by inheritance. I.e. store with mo_acquire_t is also forbidden. Store with mo_release_store_t uses implementation for store with mo_release_t.

Also I add a bunch of compiler fences (affects only code generation):

/***** compiler fences *****/

struct co_acq_rel_t : mo_acq_rel_t {};
struct co_acquire_t : co_acq_rel_t {};
struct co_release_t : co_acq_rel_t {};

I'm not sure about bidirectional fences (store-store (sfence), load-load (lfence) ). For now I just comment them out.

What do you think?

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

That p. 11 example is not to the point for illustrating acq_rel, I would say. As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before). In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.

Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?), so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.

What is the context for your atomics proposal?

I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).

Maybe you don't need to use the memory-order argument in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.

Arch D. Robison (Intel)'s picture

With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.

Dmitry Vyukov's picture
MADadrobiso: With respect to the reference-counting example, my impression is that it is typically unsafe to leave out the fence, because the increment of the reference count typically implies that a thread is acquiring rights to read a shared reference-counted object.




Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.
Provided so-called 'strong thread-safety' (thread permitted to acquire reference to object even if it doesn't have one, very uncommon case) it unsafe to remove fence, because thread actually acquiring access to object.


Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

A shared pointer cannot protect its referent, because its control ends when operator* or operator-> returns, so a shared referent needs to protect itself. The only valid consideration is how often any fence implicit to reference count manipulation is wasted, which is probably often enough to be a concern: just reading from the referent, destroying the pointer, ...

Dmitry Vyukov's picture
randomizer:
Provided so-called 'basic thread-safety' (thread permitted to acquire reference to object *only* if it aready has one, very common case, for example boost::shared_ptr) it's ok to remove fence, because thread doesn't actually acquiring anything.



In this case mechanism which transfers object between threads (producer-consumer queue) must execute acquire fence, because this mechanism gives access to object to thread.

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture
Raf_Schietekat:
That p. 11 example is not to the point for illustrating acq_rel, I would say.



It illustrates difference between acq_rel and seq_cst.

Raf_Schietekat:

As for allowing all combinations orthogonally vs. imposing the designer's wisdom I might be convinced by the argument that sometimes a language gives you more by the things it does not allow you to do (but I can't say now where I've heard that before).




I think you are right. This will eliminate at least a bit of complexity.
But I am thinking about special cases. For example sequence lock requires load with memory_model_release_wrt_loads (or msync::slb in Terekhov's list).


Raf_Schietekat:

In the second family I mentioned, reload would mean "forget any cached reads" (not actually reload, but the best word I could find was what it would imply later on), and flush would mean "write all dirty entries to memory now and wait until finished"; I don't know how useful flush and reload might be by themselves, though.



It looks quite unusual and confusing...


Raf_Schietekat:
Oh boy, Alexander Terekhov's list again... I tried to ignore that for lack of a legend (what do the entries all mean?)




There is a legend. Imho the semantics are clear...


Raf_Schietekat:

, so I obviously don't know whether it's useful at all. So now it really is all back on the table. One problem is that too much detail might be too confusing and lead to bugs.



I think, that it's better to dive into as much detail as possible, and then try to climb as high as possible. It's the only way to choose the right altitude.


Raf_Schietekat:

What is the context for your atomics proposal?



None. It's just my library. Similar to TBB.


Raf_Schietekat:

I dislike the C++0x proposal because it looks too much like plain C, and I remember how an active desire (driven by IBM, I think) to be compatible with plain C ruined another standard, the C++ CORBA mapping (which otherwise might have had things like a sequence template class, and not have suffered debilitating leaks by using something like auto_ptr sources and sinks instead of plain pointers). Here it's not as problematic, but it just looks ugly, and elegance is a real goal, and I would hate it if the current C++0x atomics proposal goes to press like this (it's my civic duty to oppose it).

Maybe you don't need to use the memory-order argument
in a switch or anything, but it will still be part of the call (right?), whereas a template argument is completely invisible. Is there any reason not to use a templated function like TBB does, where template specialisation could do the same as having different overloads? There's a whole lot of things that can be done with template metaprogramming... I was thinking of having architecture-specific traits for the operations, and using those to select appropriate "packaging" implementations for the different memory semantics.



First of all, I not a C fan too. I like templates, template metaprogramming etc.
The main reason to make memory model as parameter - compliance with C++0x. It's The Principle Of Least Surprise. It doesn't metter whether it's better or worse, it's what user expects and knows.
At the present I don't see anything considerable what templates can give here. Handling of "inheritance" between fence types will be more complicated with templates.
Please elaborate about "things that can be done with template metaprogramming" wrt atomics/fence types. Can you provide some examples?


Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

The p. 11 example does not illustrate acq_rel, because either the acq or the rel part has nothing to do with the operations, and acq_rel would even be forbidden in C++0x. With RMW operations on two atomics where sequential consistency or not makes a difference, I would be interested, now I'm not.

I don't see what's so confusing about something that might be a bit unusual (yes, I made it up myself).

I don't understand Alexander Terekhov's list. There might be a legend, but is it self-contained, with examples etc.? Or is it just me? I seriously doubt that! I think I've worked hard enough trying to understand this stuff, more than most people, and this makes no sense to me, all by itself.

Yes, get the details, but then there's a standard to be made that's not a recipe for bugs, and that starts with some level of understandability.

Well, at this point I'm not sure I want to go ahead with proposing an atomics library alternative. For starters, I'm unsure about the real cost of these arguments (can you give some insight in both the C++0x proposal, where I can imagine that thorough optimisation with constant propagation and dead-code elimination will do the trick, and in yours, where I just don't know), which is the only decisive argument for changing things this late in the game. And if you accept that the memory semantics are passed as an argument, then there's a subset (the atomic template) that actually makes sense, and is a superset of TBB atomic, which is currently lacking in functionality (bitwise operators and memory semantics), and which will soon be a squashed mosquito on the windshield of the new C++ standard. It's just that there's a whole load of redundant rubbish API thrown on top, and, given the choice, some programmers will undoubtedly use it. As for the use of templates, all complexity should be hidden as much as possible from the normal user and even from the porting user, of course, but it will hardly be more complicated than what TBB has now. But I have not yet started doing anything remotely sophisticated on this, and I think I might just play around with this a bit without committing or even seriously considering to produce a result. What's the use, anyway: even if I can show something really elegant (which is still a big question), how could it compete? Even if I'm not setting myself up for failure, I would be setting myself up for disappointment, wouldn't I?

Arch D. Robison (Intel)'s picture

My inclination for TBB is:



  • Make the default for atomic read-modify-write operations the equivalent of C++ 200x's seq_cst. For Intel architectures, this is just a matter of changing the documentation, not the implementation.

  • Make atomic::operator= retain its "release" semantics and atomic::operator T retain its "acquire" semantics. As with Raf's class, we can add new member functions "load" and "store" to deal with the other cases. If I was doing TBB from scratch, I might make the deaults sequentially consistent. But changing atomic::operator= to sequential consistency now risks a huge performance impact on existing code.

  • Not add the C++ 200x acq_rel semantics. They seem to offer little gain on most processors, and many hazards. My impressionis that acq_rel would noteven exist in C++ 200x except for the sake of the Power architecture.

  • Add the C++ 200x relaxed option. This discussion has convinced me there is sufficient justification, even if there is not currently an efficient way to implement it on Intel architectures.

So we would end up with four options: sequentially consistent, acquire, release, and relax.


TBB 2.1 is pretty much frozen, so these changes will be have to be made after that. I'm swamped on last-minute bug fixes for 2.1, so I have not been able to give this forum much time :-(


Another example of a non-TSO processor is Itanium. Ordinary Itanium stores are not TSO. Itanium st.rel stores are TSO.

Raf Schietekat's picture

Of course I couldn't help myself, and now I'm well along the way with an implementation that supports those extra memory semantics (and bitwise operators), even if only for use with TBB. It seems simpler to push memory semantics and templates down into the platform-related files: to port, implement some required template specialisations, plus as many others as may benefit performance.

Dmitriy, maybe I'll study Alexander Terekhov's list again later (no promise, though); I was too quick in brushing it aside this time: there's more information now than when I first saw it (I think), plus your opinion that it is worth considering.

A general question: What might not be evil about having function arguments (C++0x) instead of template arguments (TBB) for memory semantics? Shouldn't the compiler help check that they are fixed at compile time, and make programmers jump through hoops to deviate from that, and isn't that what template arguments enable the compiler to do? Dmitriy, would you agree that this is a considerable advantage with template arguments, of higher value than The Principle Of Least Surprise (which also depends on what your reference is)? Sometimes a language gives you more by not allowing you to do some things (in this case letting the semantics vary haphazardly). Here it sits between specifically named operations (C style, difficult to customise) and complete run-time laissez-faire. Oh yes, and could the compiler really get rid of function-argument overhead as compared to template arguments?

"If I was doing TBB from scratch, I might make the de[f]aults sequentially consistent." I don't see why that would be such a good idea, as compared to education. Quite surprising, as well, coming from Intel. It is rather fool proof, I know, but with such a default, code will likely be overburdened with load and store until you won't be able to see the forest for the trees anymore, and any significant deviation from the pattern would be obscured instead of standing out, encouraging bugs again. There's another option that has not been considered yet: an atomic template argument that determines how the operations behave by default (relaxed for reference counters etc., default acquire/release for messaging, ordered for special applications); why just take over one-size-fits-all Java volatile semantics if C++ allows extra freedom (where it makes sense)?

"Not add the C++ 200x acq_rel semantics." Then I feel compelled to be the champion of the little guy (IBM)!

Raf Schietekat's picture

To my own general question: And if the choice is not required to be fixed at compile time, there can be no compile-time checking of that fixed value! Templated operations are definitely the way to go then.

Progress report: I think I'm just about finished for linux_ia32.h, except that I got a compile-time (!) error about compare_and_swap in queuing_mutex.cpp. One solution might be to assume success (with failure, an spurious StoreStore barrier might be issued, which might cause some unnecessary delay on non-TSO platforms and require some explanation on TSO platforms), another to implement double semantics for compare_and_swap (one for success, one for failure) as on C++0x (which has default failure semantics based on success semantics that are equivalent with assuming success except for potentially better performance). The easiest would be the first option, which requires just changing false->true in a handful of places and adding a comment, but it might be biased against non-TSO platforms (is there a real performance cost?), so any C++0x interest would then be biased against it.

After resolving this issue (not a strict requirement), I should be able to do some final cleanup and contribute the solution (some work will/may then be required to re-port (most of) the other platforms).

(Added) Oh yes, there's still that idea of having configurable default semantics...

Raf Schietekat's picture

Here's something that might serve as the basis of an implementation with more memory semantics choices (and bitwise operators), based on tbb20_20080408oss_src (most recent stable release). It seems to work on Linux for IA-32, but a lot of tests need to be added to validate that (I just did the normal build all and neglected unit testing for now).

Template metaprogramming buffs might want to check the syntax (I just improvised a bit, it seems more like hacking than programming anyway).

I didn't go ahead with configurable default semantics because partial template specialisation and default template arguments don't seem to mix, not on my GCC anyway.

The biggest test is how well it can be ported to other architectures. Apparently the Intel compiler for Itanium doesn't even need compiler fences with volatile, but for others the code still needs a solution for not overloading the code with fences if the basic operation is already ordered (mac_ppc.h), etc.

I also contributed this in the official way to TBB.

Worthless? Wonderful? Somewhere in-between?

Attachments: 

Dmitry Vyukov's picture

Sorry for late reply - I really don't have much time...

I think that question "memory order as templates vs function parameters" actually more is a matter of personal taste.
In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters).
For end-user it's really equally.
For implementor it's almost equally. With parameters I can use "inheritance" of memory orders. For templates it's a bit harder.
In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce.
Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case.
Prohibition of "bad" combinations of operation/memory_order (load-release, store-acquire) must be prohibited in either case. It's equally easy to implement with templates and parameters.

As for default parameters. I think that it's bad idea in this case. memory_order_seq_cst is the strongest order so it's the only can be made default. I really don't want it this way:

counter.store(value);

I want it this way:

// memory_order_release: synchronizes with acquire operation in function foo()
counter.store(value, std::memory_order_release);



Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

(Removed)

"In either case compiler will eliminate all runtime overhead (100% for templates and 99% for parameters)." Really? Also in non-optimised builds? Also in inner loops or to implement spinning locks? I think there might well be a big difference between 100% and "99%".

"With parameters I can use "inheritance" of memory orders." But that seems only useful for *not* implementing some memory semantics, whereas I found it is better to just plug in policy objects (__TBB_fence_guard in my proposal) and only use delegation in other dimensions than memory semantics, e.g., going from store to looping on compare_and_swap.

"In either case memory order must be fixed at compile time. For templates it's a bit easier to enforce." Are you using inheritance graph design to test things at compile time (not test time)? And (how) do you enforce that the semantics are fixed at compile time?

"Some sophisticated template metaprogramming... I don't see what it can give to implementor in this case." I think template metaprogramming is somewhat disgusting, because the language was not designed for that (I found it to be a constant struggle), but with memory_semantics as an enum I don't see another way to test conditions at compile time; also see previous question.

"As for default parameters. I think that it's bad idea in this case." And I think that the default should be configurable, and documented where the atomic variable was declared (like you would do with a lock, right?): the code should follow from the role, not the other way around. Also there is what I wrote about not seeing the forest for the trees anymore if for better performance nearly every use was changed to release/acquire anyway. Ordered semantics are probably the exception, and I would not be surprised if ordered/seq_cst could be easily eliminated for use of stand-alone fence instructions (design-wise, not just technically feasible). Some people, and not the least of them, think that atomics are there primarily for their own sake (raw/relaxed/unfenced), so it seems a far stretch to take sequential consistency to be the default, except to achieve shortsighted fool-proofness (shortsighted because the code will degenerate to unsightliness before it is fully optimised, inviting bugs instead of discouraging them). Of course it is always possible that I just don't have enough experience with atomics for my intuition about these things to be reliable yet, but then it should be fairly easy to refute these arguments.

Arch D. Robison (Intel)'s picture

There is the comment "// - ordered for special purposes". What is "ordered"?


There is the comment "// - rel_acq for "normal" release/acquire/rel_acq defaults (it would be the defaults' default)". What is "rel_acq"? Same as ISO C++ draft or something else?


- Arch

Raf Schietekat's picture

Sorry, these comments are about an unrealised idea for tbb::atomic: a policy for setting default memory semantics (with its own default being rel_acq). I couldn't get past the incompatibility of default template arguments and partial specialisation, on g++ 4.0.3 at least, so I'm still thinking how that could be implemented. Here's how it would apply to some atomic operations (the policy names are recycled from memory_semantics to coincide with how they would affect something like fetch_and_add):

policy->store,load,fetch_and_add
relaxed->relaxed,relaxed,relaxed
rel_acq->release,acquire,rel_acq
ordered->ordered,ordered,ordered

The atomic operations themselves are declared and documented in tbb_machine.h. "ordered" was used in earlier discussions, together with "raw", and I picked one of these because it looks less awkward than "seq_cst" while dumping "raw" for "relaxed" because "relaxed" has 7 characters like the others (duh) and I didn't want to totally reject the C++0x names, even though a programmer shouldn't be too relaxed about using "relaxed". I thought acq_rel is too confusing: first you release, then you do the basic operation, then you acquire, so it's best not to disturb that sequence in one's mind, by having rel_acq instead. Also, most names in TBB are different than what C++0x now proposes anyway (not having the _and_ that TBB uses).

BTW, I'm now thinking about giving the tbb/machine headers more access to __TBB_fence_guard itself, because I'm still not happy about how it would be used on other architectures than linux_ia32.h. And right now I'm having some fleeting doubts whether template specialisation does more than using macros... but that's how it grew (removing the atomic_traits layer and going straight to a tier of simple templates); I'll give it some more thought, though.

(Added) In the spirit of full disclosure, here's another problem: currently default delegation keeps the same memory semantics, but that doesn't really work for linux_itanium.h, where, e.g., store could go to store instead of to a loop of compare_and_swap by way of fetch_and_store (conceptually). The reason is that the __TBB_fence_guard policy normally takes care of everything below ordered, and, having no access to tools/generate_atomic/ipf_generate.sh, I'm assuming that it might be easier to just make it conform to this pattern (relaxed being new and all) than to break the pattern.

Arch D. Robison (Intel)'s picture

Making rel_acq the default instead of sequential consistency is quite dangerous, because realistically almost all machines (except Power) will implement rel_acq the same as sequential consistency.Sounless you have a Power machine, there will be no way to truly test the code.


Thanks for your comment in our contribution about AtomicBackoff:


// TODO "Why is this often used in a compare-and-swap loop
// where the comparand is from the previous loop iteration?
// It would seem that this makes the new iteration more
// likely to fail because the underlying value has changed,
// thus costing extra iterations that are ever more
// likely to be interfered with.


I think what happened is that we timed benchmarks that had mostly very short waits, and there the reload seemed to slow things down. But after long waits, it's surely suboptimal to reuse the old value. I'll sned your comment around to the group here to make sure everyone sees it.

Raf Schietekat's picture

"Making rel_acq the default instead of sequential consistency is quite dangerous, because realistically almost all machines (except Power) will implement rel_acq the same as sequential consistency. So unless you have a Power machine, there will be no way to truly test the code."

This assumes that the atomic is used without also using store and load, or where that use is not a problem while the operations using rel_acq (fetch_and_store etc.) are used with a fatally mistaken assumption of sequential consistency: otherwise the problems with store and load would draw the attention to the problem. And sequential consistency is very expensive as a default... but I'm preaching to the choir now, because that's what I always heard from the Intel (TBB) side, I think (C++0x is not talking about a policy release,acquire,ordered: it's seq_cst pervasively). Taking it as the default makes it even more urgent to allow a different defaults policy to be plugged in, to avoid having to sprinkle explicit semantics around the code to get performance back, which can only lead to more bugs, not fewer.

Once there's a SPARC port (Solaris now works only with IA-32/Intel 64), it would also be potentially affected if RMO (Relaxed Memory Order) is configured, I guess (although I don't know what would be normal practice).

But I think such a decision should be based on a fair set of examples (of the very specific kind described in the second paragraph), not on a desire for fool-proofness that I think is misguided/short-sighted for reasons explained above. Disclaimer: I don't have the experience to know about such examples.

(Correction) Actually, I don't know which of SPARC's RMO/PSO/TSO modes would differentiate rel_acq vs. ordered, but quite likely none of them, so I withdraw the remark.

Raf Schietekat's picture

Improved atomics proposal (also officially contributed), tested for linux_ia32.h and now also for mac_ppc.h, still based on tbb20_20080408oss_src (current stable release). It settles the __TBB_fence(_guard) problem I still had, and applies the changes to most of the supported platforms (except for verification). Some things still to do: unit tests for bitwise operations etc. (if anyone would like to do that?), configurable memory semantics defaults policy (I don't know the path to a solution yet).

(Correction) Improved but not yet successful, so only for the curious.

(Added after next post) File seems to have disappeared, but please use file in next post instead.

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h and mac_ppc.h.

Apparently I had overestimated the power and reach of SFINAE (unless it's just a g++ limitation?), so I had to subspecialise some operations for ordered to use the default delegation anyway; an alternative might be to be more precise (and verbose) on the specialisations, thus excluding ordered in the first place.

I surrendered on the issue of not being able to mix partial specialisation and parameter defaults, so now I have atomic{,_relaxed,_rel_acq,_ordered}, with atomic behaving like atomic_rel_acq (anyone wanting to mix release, acquire and ordered might add atomic_mixed to the mix, but I'm still waiting for a good validating example); feel free to educate me on what I may have overlooked.

Still to do: more tests in test_atomic.cpp, finish other ports than linux_ia32.h and mac_ppc.h, but mainly I'm quite happy with this, so it should be mature enough for deeper scrutiny.

Attachments: 

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for linux_ia32.h (except that I merged this with linux_emt64t.h into gcc_x86_64_32.h) and mac_ppc.h.

Mainly laid the foundations for SPARC support (see gcc_sparc.h), but I have to wait until Tuesday to validate and test it, so feel free to comment on it first (I welcome corrections and test results).

Attachments: 

Arch D. Robison (Intel)'s picture

I've just had time to read so far. I like the inheritance hierarchy in the implementation classes.


One problem I see is an elegance versus practical efficiency issue with the return type on operator&=, operator|=, and operator^=. So far, atomic has been designed to support only atomic operations that have efficiently hardware support on typical platforms. E.g., atomic::operator+= exists because most modern hardware has fetch-and-add support. (Perversely, my own employer's Itanium has perhaps the weakest support for it because of the way it restricts the addend to certain constants.) But I'm not familiar with any widely used hardware that has fetch-and-{and,or,xor} support. However, x86 does have efficient "lock {and,or,xor}" instructions for the situation where an atomic bitwise operation is desired without knowning the prior value. Alas C++ does not allow overload resolution to choose based on whether the return value is used or not.


If there is no hardware support for fetch-and-{and,or,xor}, it might make more sense to generalize bitwise fetch-and-op to a template memberfetch_and_op(binaryFunctor) that does an arbitrary fetch-and-op.


So far, the only client in TBB of bitwise atomic ops is spin_rw_mutex, and it definitely wants the efficient x86 implementations that do not return a value. I suppose we could continue to support those ops as __TBB_Atomic{AND,OR} macros.

Arch D. Robison (Intel)'s picture

The __TBB_fence_guard is a clever trick. We'll have to inspect whether it has an impact on performance. Alas not-so-clever compilers can generate a lot of gratuitous extra code for sake of calling a destructor in event of an exception, even if an exception is not possible. Only inspection of the binary code will tell us if this is an issue or not. If code quality becomes an issue, we can break __TBB_fence_guard into two functions, one for entry and one for exit, which would be a slighly less elegant style but still retain genericity.

Dmitry Vyukov's picture
Raf_Schietekat: Latest and greatest (also officially contributed), based on
tbb20_20080408oss_src (most recent stable release), tested for
linux_ia32.h (except that I merged this with linux_emt64t.h into
gcc_x86_64_32.h) and mac_ppc.h.


Mainly laid the foundations for SPARC support (see gcc_sparc.h),
but I have to wait until Tuesday to validate and test it, so feel free
to comment on it first (I welcome corrections and test results).



Sorry for stupid question, but how I can view this change? Latest development release dated by 'May 14, 2008'. And I don't see cvs/svn access...

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Regarding Arch's last two posts:

Adding add/subtract/and/or/xor to atomic would be easy: __TBB_Op is already in place to support atomic::store (the same pattern can be used to link it up for the others) and has been implemented for __TBB_op_{store,add,and,or,xor} where possible to support __TBB_Atomic{AND,OR} as well as store and also because I'm a pedantic orthogonalist, so I'm definitely not against the idea.

I don't see a performance benefit for operations that don't need the existing value, only a register allocation benefit: the x86, as a CISC instruction set, is tailored to this "linguistically", and it has few (visible) registers, but internally it will still have to fetch the old value first, so in a way it is ironical that it only has instructions that discard that old value. Which (other) processors have I overlooked that do fetch-and-add better than fetch-and-{and,or,xor}? Well, there's also the CISC vs. RISC thing (instruction count), but that's a separate issue, I think; and I don't really know why not more processors allow locking the bus (seems rather greedy, though) or an address (using cache coherence hardware), but maybe there would be insufficient benefit over looping to consider that.

Generalizing is exactly what goes on underneath (in __TBB_(Fetch)Op), but it looks a bit heavy syntax-wise to expose that to the user. I'm not sure whether doing the same for a general user-supplied function would be beneficial, because there is some syntactic overhead to a functor that may offset the benefit of not having to write a simple compare-and-swap loop, and how often would it be used anyway (and if wanted, it can always be added as an external template operation). I'm only hesitating about whether POWER/PowerPC can do anything with reservations to leverage an intrusive design.

The question seems to be whether atomic should be the basis for locks or not. If they should be, there's a design problem with platforms that don't have self-contained atomic support, although I can't immediately think of any that are not nearing end of life (PA-RISC). Otherwise it seems fine to continue to rely on a layer that is not exposed in atomic, and the (non-rhetorical) question becomes what other purposes non-fetch add/subtract/and/or/xor would serve.

I do worry a bit about performance, but for the moment only about non-optimised builds that do not optimise away calls to inline functions. GNU has __attribute__((always_inline)) or somesuch to come to the rescue if needed; I don't know about the others. But I have never considered compilers, if any, that would generate code where nothing is needed for stack unwinding: surely after inlining they would come to their senses?

A question relating to locking the bus or not: to implement various operations, RISC processors will typically loop over compare-and-swap without something like TBB's AtomicBackoff, or that is what manuals suggest anyway. Until now I took it on faith that AtomicBackoff is like what Ethernet does for collision avoidance, but how certain are you that it is beneficial for short unblockable operations? Maybe the maximum pause should be limited or even be variable instead of ever-increasing (with its clandestine biblical bias: the first will be the last and the last shall be the first)? And one step further (maybe one too far?), would be the question whether the reportedly unfavorable test results for a skip list map included such a backoff on
the atomics or not?

Raf Schietekat's picture

Dmitriy (and others), after you have obtained tbb20_20080408oss_src from the download page (stable, not development), the following may help (copied from a makefile, please adapt to your situation):

cp $(FROM)/gcc_power.h cp $(TO)/include/tbb/machine
cp $(FROM)/gcc_sparc.h $(TO)/include/tbb/machine
cp $(FROM)/gcc_x86_64_32.h $(TO)/include/tbb/machine
cp $(FROM)/ibm_aix51.h $(TO)/include/tbb/machine
cp $(FROM)/linux_common.h $(TO)/include/tbb/machine
cp $(FROM)/linux_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/linux_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/linux_itanium.h $(TO)/include/tbb/machine
cp $(FROM)/mac_ppc.h $(TO)/include/tbb/machine
cp $(FROM)/windows_em64t.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32.h $(TO)/include/tbb/machine
cp $(FROM)/windows_ia32_inline.h $(TO)/include/tbb/machine
cp $(FROM)/atomic.h $(TO)/include/tbb
cp $(FROM)/tbb_machine.h $(TO)/include/tbb
cp $(FROM)/atomic_support.c $(TO)/src/tbb/ibm_aix51
cp $(FROM)/queuing_rw_mutex.cpp $(TO)/src/tbb
cp $(FROM)/tbb_misc.h $(TO)/src/tbb
cp $(FROM)/test_atomic.cpp $(TO)/src/test
cp $(FROM)/test_compiler.cpp $(TO)/src/test
cp $(FROM)/test_rwm_upgrade_downgrade.cpp $(TO)/src/test

Actually that's from my current list after I changed gcc_x86.h back to the rather silly gcc_x86_64_32.h still used in the zip file; I hope I didn't forget anything else.

robert-reed (Intel)'s picture

There are no stupid questions, only stupid answers, like uh,this one:we don't have cvs/svn access. That idea has been kicked around and I just sent a missive internally to kick around another idea, but for now the quickest way for you to see Raf's contribution would be if Raf could send you a copy. (Told you it was going to be a stupid answer.) Meanwhile we'll work on this and I'll see if we can come up with a better answer.

Raf Schietekat's picture

I'm sure I'm misinterpreting Robert's last post, but just in case: you only need Downloads>"Stable Release">tbb20_20080408oss>tbb20_20080408oss_src.tar.gz and "TBB Discussion Forum">"Additions to atomic">{atomic.Raf_Schietekat.20080529.zip from "05-29-2008, 12:08 PM" and patch script from item 32 which is the second post on the second page that does not have an absolute timestamp yet}.

Dmitry Vyukov's picture
Raf_Schietekat: I'm sure I'm misinterpreting Robert's last post, but just in case: you only need Downloads>"Stable Release">tbb20_20080408oss>tbb20_20080408oss_src.tar.gz and "TBB Discussion Forum">"Additions to atomic">{atomic.Raf_Schietekat.20080529.zip from "05-29-2008, 12:08 PM" and patch script from item 32 which is the second post on the second page that does not have an absolute timestamp yet}.




Oh... now I see attachment to the post :)
Ok, thank you. I will try to combine the release and the patch.

Btw, concerning fetch-and-{and,or,xor}.
In my implementation of atomics I provide 2 versions of each function:

value_t fetch_add(value_t v);

void add(value_t v);

value_t fetch_and(value_t v);

void and(value_t v);


and so on...

I'm still not sure whether it makes some sense...

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
robert-reed (Intel)'s picture

Sorry for the confusion, Raf. I'd seen that you had posted at least one of your atomic proposals to the Forum as well as submitting them to TBB-contrib, but had forgotten that detail when I composed my latest reply. I'm glad that Dmitriy was able to get a copy. We're working through the internal discussions now to plan how we might regularly share contributions so that our contributors don't have to.

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (most recent stable release), tested for x86-based GNU/Linux and PowerPC-based Mac OS X, see README for applying the patch.

Better SPARC support (still to be tested), first draft for Alpha support.

Attachments: 

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb20_20080408oss_src (not the most recent stable release anymore, but...), tested for x86-based GNU/Linux (not yet for 32-bit PowerPC-based Mac OS X), see README for applying the patch (I forgot to include this file in the official contribution).

Locked atomics (currently only used for 64 bits on 32-bit PowerPC), hopefully more complete support for Itanium, successful test for SPARC (well, at some point during the making of this version anyway), various things.

Attachments: 

Dmitry Vyukov's picture
Raf_Schietekat:
Locked atomics (currently only used for 64 bits on 32-bit PowerPC)



And what about early x86-64 systems, which doesn't support double-word cmpxchg instruction? ;)

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Dmitriy, feel free to contribute what I need to know (I suppose you mean quadword, because doubleword seems to be 32 bits on IA-32/Intel 64). I already merged the 32-bit and 64-bit variants of the x86 architecture (as described in Intel's documentation as IA-32/Intel 64), so it should be relatively easy to fit in an intermediate version. I just need to know the relevant predefined macros and what subset is available (everything except 64-bit cmpxchg?), I suppose (unless it's not a subset, in which case I need some more documentation).

Anyone, feel free to test anything other than g++ on 32-bit x86 and 32-bit PowerPC (those I can easily do myself, so they'll only need final validation later on, except that for the latter I could only build part of TBB). I did a test of SPARC at some point, but for the others I can only hope that I made no mistakes (it's not impossible, but...). In particular, Itanium needs some attention (totally revised), but maybe you want to use an Alpha machine a little while longer (not yet in TBB); I don't know how much interest there would be in PA-RISC (which would require a revision of some TBB assumptions because unlocked semaphores are non-zero), or in MIPS (I could not easily find documentation for it)? I have merged with tbb21_20080605oss by now, if anyone needs that instead.

P.S.: Hey, I reached a century (or how is that in cricket speak?)!

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release), tested for x86-based GNU/Linux and partially for 32-bit PowerPC-based Mac mini (single core), see README for applying the patch.

Not really interesting, just an upgrade to a new TBB stable release version, and 64-bit atomics didn't actually work yet on 32-bit PowerPC in the previous version (I was away from my Mac mini and made a few too many assumptions). The one interesting thing is that I've found that... it doesn't work for a release build on x86 with g++: test_atomic.exe fails for 64-bit atomics, and I don't know why. It does work for a debug build, and for both builds on Mac OS X, but it fails for release on Ubuntu 6.06 for real or on 8.04 in Sun xVM VirtualBox. Of course I'll keep trying to find out why (an extra problem is that sometimes the problem disappears when I add diagnostics, although it seems to be perfectly reproducible between runs if the code is the same), but if you'd like to, e.g., compile to assembler and investigate that to tell me what I missed (the relevant section should be fairly short), please go right ahead.

Oh yes, the revised Itanium code still has never been tested, and any other useful comments are welcome as well.

Attachments: 

Dmitry Vyukov's picture
Raf_Schietekat: Dmitriy, feel free to contribute what I need to know (I suppose you mean quadword, because doubleword seems to be 32 bits on IA-32/Intel 64). I already merged the 32-bit and 64-bit variants of the x86 architecture (as described in Intel's documentation as IA-32/Intel 64), so it should be relatively easy to fit in an intermediate version. I just need to know the relevant predefined macros and what subset is available (everything except 64-bit cmpxchg?), I suppose (unless it's not a subset, in which case I need some more documentation).



Sorry for acting like a$$ :)

I mean that some early x86-64 chips lacks cmpxchg16b instruction (128-bit, double-word CAS, in terms of 64-bit platform). And you can't detect cmpxchg16b support statically, you can detect it only in run-time, by analysing values returned by CPUID (eax=1) instruction. So you can't just define some macros...

Dmitriy V'jukov
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

That might indeed be relevant to any support for generic atomic objects (absent built-in 128-bit integral types), but the decision would (also?) need to be made at development time (if the instruction is not available, the atomic's implementation would have to be based on a lock, doubling the size, although the lock might still be disregarded at runtime for better performance even if the memory is already wasted). It's too late in the day for me now to go check the alignment issues, but I do have an uneasy feeling about that, so it might still all be irrelevant after all.

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release), tested for x86/Ubuntu 6.06/g++ 4.0.3 (only), see README for applying the patch.

Found a simple but strange remedy for the problem with 64-bit atomics on x86, removed AtomicBackoff altogether (was that a good idea?), some new tests, tracked TBB examples/Makefile change for SPARC.

(Added) It would seem that nobody is interested, for why else would there have been not a single reaction to the removal of AtomicBackoff (which I am hereby rolling back)?

Raf Schietekat's picture

Latest and greatest (also officially contributed), based on tbb21_20080605oss (most recent stable release) or tbb21_009oss (most recent commercially aligned release) which differ only in ./CHANGES, tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3 and PowerPC 32-bit (single core)/Mac OS X 10.4.11/g++ 4.0.1, see README for applying the patch.

Improved spin lock (see "Spinning"), added support for MIPS and some promising code for PA-RISC/HP-UX/g++ (some tests were not successful).

(Added) Please note that the PA-RISC code differs from the time the tests were run and will not build. An improved version is in the works.

Attachments: 

landmann's picture

I am quite a beginner with TBB and was wondering, where the mentioned atomic is used actually? Is it in the tbb21_009oss included? I am doing some tests on SMP ARM to compare it with my own scheduler.
I started to port using the powerpc header but it only implements cmpxchg atomics with a full barrier semantic. As these sync calls may become visible on ARM I had a look at the IA64 implementation but found no template either, only fully referenced externals....quite a lot of them (too much for a quick test, so I will stay with the ppc approach).
Has someone thought about using libatomic_ops ? It already covers a lot of architectures and provides semantics but is plain-C only.

Regards!

--L

Raf Schietekat's picture

If you would bear with me for another day or so, I will include support for ARM (after some final touches for 64-bit atomics) in a new version that will correct buildability on PA-RISC.

landmann's picture

Sounds good. Sorry for not having seen at the first time that atomics is a patch to TBB. I will use the simplest variant (just a full barrier cas) until then and if more advanced atomics become available via a patch I already have something for benchmarking and comparing both variants.
However, you wrote that your atomics differs from the proposal for C++0x. I don't want to discuss this here, but in case the proposal by Boehm/Crowl makes it into C++, is there a _technical_ issue which prevents it from being used in TBB?
If you are interested , feel free for some off-list discussions (also ARM specific if needed).

Regards!

--L

Raf Schietekat's picture

I'm going to mull over PA-RISC some more, but I did promise something for ARM, so here it is (on top of the 2008-08-03 patch, not yet officially contributed so no peeking from TBB staff , no changes since previous post so no 64 bits yet). If you are familiar enough with the issues please have a look at the fence-related stuff in gcc_arm.h, and let me know about any problems (you may need to officially contribute any nontrivial suggestions).

Attachments: 

AttachmentSize
Download armdiffs20080807.zip16.74 KB
landmann's picture

Wow, that looks astonishing. I need some time to go through it.
I really like the scoped fence construction!

One issue already exists: The atomic store operations should not be done using a simple store but using a ldrex/strex to clear the exclusive monitor. At least as long as Linux kernel does not guarantee to clear the monitor always on a context switch. There have been some discussions in the lists but honestly I do not know the current state of mainline.
You mentioned not 64bit yet. Does TBB run without it? Or is it simulated lock-based?
I don't understand the sign extension problematic you raised....as long as the compare instruction is used with the proper 'width' it should not matter whether the upper parts of the registers are sign or zero extended, shouldn't it?
What about using gcc's built-ins for log2 ? It would need only one declaration for the whole gcc-based includes.

Thank you!
--L

Alexey Kukanov (Intel)'s picture
Landmann: What about using gcc's built-ins for log2 ? It would need only one declaration for the whole gcc-based includes.


In general, we avoid using compiler built-ins in TBB, because the code will be built by different versions of different compilers (gcc 3.2.x to 4.3.x, and Intel C++ Compiler9.0 to 10.1, to name the least). That's why inlined assembly is preferred. For a particular platform where a zoo of different compilers is not a problem, using intrinsics might be totally acceptable.

Pages

Login to leave a comment.