Additions to atomic<T>

155 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf_Schietekat:
"Other threads can't determine whether load was before store or after." Not immediately, obviously, but why would that be a requirement? Given 2 threads A and B, globals a and b both initially clear, and lock l, if each thread acquires the lock, writes its ID into its own global, probes the other global, clears its own global and releases the lock, shouldn't both probes be clear when they compare notes later on? Yet without #LoadStore as part of a release fence, one might have seen the other thread's ID, which seems rather weird.

I was not right here. My reasoning can't be applied to hardware memory model.
Things are simpler. In TSO every load is followed by #LoadStore, so load can't sink below store.
I agree, that it's completely counter-intuitive. I think that their definitions root in how it's implemented in hardware.

Before I deliver the next patch version (already merged with TBB 2008-09-25,with new mut_ord semantics added and a tbb::atomic hostile takeover in progress),some questions:Would it be interesting tohave test_and_compare_and_* operations, or would that be sugar overdose? Alternatively, what would be the performance impact from making "test" implicit, and perhaps have notest_compare_and_* instead, i.e., the opposite default?Regardingcompare_and_store and friends, it turns out that there may be instances where the user is not interested in the new value, only in whether or not the operation succeeded, and having to provide and initialize a "comparand" variable is merely annoying. Would it be interesting to have a version compare_and_*_noupdate() which takes a comparand argument by value instead of by reference? Note that I don't (immediately) see any relevance at the assembler level, only at the C++ level.As an example (a combination of both, actually), here's a possible before and after in task.cpp (if wait=true is indeed required, the "noupdate" can be trivially expressed in terms of test_and_compare_and_swap instead of course), with a few other cases like it in the same file, I think:

 depth_type comparand = arena->slot[k].steal_end.load(); // acquire c/o compare_and_or below
if( comparand==-4 && arena->slot[k].steal_end.compare_and_or( comparand, 1, true ) ) {

which could be rewritten as:

 if( arena->slot[k].steal_end.test_and_compare_and_or_noupdate( -4, 1, true) ) {

The suffix "noupdate" is probably required to avoid accidents relating to const/no const overloads (anyone who has ever done battle with CORBA on C++ will probably immediately agree that const should never evermake a functional difference).So, what do you think? Too much and only occasionally useful and therefore to be avoided as creeping featurism, or just right? Note that I'm looking for input from fairly seasoned atomics programmers to make this call.Just in case anyone noticed,there's also the matterof how to deal with memory semantics related to failure. A particularly tricky aspect of that would be whether the test might float off without an associated acquire, and whether that might cause spurious spurious failure (sic, and perhaps pathologically so) or cripple the "test" step as a performance-enhancing device? Or maybe I'm just being paranoid?

Latest and greatest (also officially contributed), based on tbb21_20080925oss (most recent development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.atomic for applying the patch.

Upgraded from previous base, added tbb::mut_ord, more applications of tbb::atomic (the proof of the pudding is in the eating). No elements from my previous post #102.

Well, that was the intention, but apparently I cannot add files anymore, from neither Firefox nor Safari. Maybe somebody from the TBB team could add the officially contributed patch?

Some news from the trench lines (not yet contributed)...
By cutting out internal class atomic_impl2 I enabled tbb::atomic (and friends) to support pointers to incomplete types, a shortcoming that was revealed by my trying to repackage some atomic functionalityas applications of tbb::atomic, with the compound purpose of rationalising the code by internally using part of the published interface, testing atomic (with the above-mentioned positive outcome) and getting rid of the curious situation of treating non-volatile variables as volatile (which turned out to be causing real problems, see under subject test_mutex.exe).
This has left atomic.h in a sorry state, however, and between already having stretched the limits of the use of macros beyond the limits of decency and now the use of a new file atomic.repeated.h (#include'd repeatedly as the name suggests) it may now need to be bailed out by explicit code generation (anything is better than manually expanding code, though, because experience shows thatthis quickly becomes unmanageable, so the limited expansion that was introduced because #include cannot occur inside a macro definition should not be extended); no existing external code is affected, though.
Thefollowing step probably needs to be the support of atomic objects, e.g.,to allow some implementation hiding of tricks like a pointer combined with a one-bit flag.

Quoting - raf_schietekat

This has left atomic.h in a sorry state, however, and between already having stretched the limits of the use of macros beyond the limits of decency

Yeah :)

The last time I was trying to examine contents of atomic.h, I closed the file after several minutes of vain attempts to understand anything. In that period I don't spot any mentions of hardware atomic operations (lock cmpxchg) which would serve as some 'guiding star' :)

Quoting - raf_schietekat

Thefollowing step probably needs to be the support of atomic objects, e.g.,to allow some implementation hiding of tricks like a pointer combined with a one-bit flag.

How it will look like? Do you mean something like:

struct list_anchor
{

uint64_t head : 30;
uint64_t tail : 30;
uint64_t flags : 4;
};

atomic anchor;

anchor.compare_exchange(...);

If so, then I consider it as crucial.

But it won't work untill user define operator==(), operator+(), operator|() etc for list_anchor... And defining those operators is boring.

The possible solution can be:

atomic anchor;

i.e. on all operations (==, +, &, |) structure is automatically cast to uint64_t, and then back to list_anchor.

Hmmm... or probably implementation can automatically determine to which type to cast. I.e. if user structure is POD and sizeof=4, then on operations it's cast to uint32_t. If it's POD and sizeof=8, then on operations it's cast to uint64_t etc...

Quoting - raf_schietekat

Would it be interesting tohave test_and_compare_and_* operations, or would that be sugar overdose?

I'm considering it as sugar overdose. You can't provide functions for all situations, so I will have to make manual loops anyway.

In my opinion, atomics library must be (1) as functional as possible, and (2) as simple and minimalistic as possible at the same time. Personally for me it makes little sense if library provides functions as test_and_fetch_and_compare_and_xor(). There is a well-defined "basis" for atomic operations: exchange, compare_exchange, fetch_add. It's enough.

BUT it's still will be nice to see support for "atomic objects":

http://software.intel.com/en-us/forums/showpost.php?p=58638

Quoting - raf_schietekat
Regardingcompare_and_store and friends, it turns out that there may be instances where the user is not interested in the new value, only in whether or not the operation succeeded, and having to provide and initialize a "comparand" variable is merely annoying. Would it be interesting to have a version compare_and_*_noupdate() which takes a comparand argument by value instead of by reference?

Sugar overdose. (it's strictly my personal opinion!)

Sometimes it even hurts, because provokes to bad programming style. I feel it when use Win32's Interlocked* functions - I constantly reload value, though I already have most recent value.

Quoting - raf_schietekat

Alternatively, what would be the performance impact from making "test" implicit, and perhaps have notest_compare_and_* instead, i.e., the opposite default?

Additional test can have significant impact on performance. Especially if I didn't load the value yet. Consider:

cmp = 0;

xchg = my_thread_id;

lock.compare_exchange(cmp, xchg);

Note that cache line which holds 'lock' variable is not yet loaded into processor's cache when compare_exchange() start executing.

Quoting - raf_schietekat

Just in case anyone noticed,there's also the matterof how to deal with memory semantics related to failure. A particularly tricky aspect of that would be whether the test might float off without an associated acquire, and whether that might cause spurious spurious failure (sic, and perhaps pathologically so) or cripple the "test" step as a performance-enhancing device? Or maybe I'm just being paranoid?

C++0x solves it this way. If additional parameter for failure case is not supplied then:

seq_cst (on success) -> seq_cst (on failure)

acq_rel (on success) -> acquire (on failure)

release (on success) -> relaxed (on failure)

acquire (on success) -> acquire (on failure)

consume (on success) -> consume (on failure)

relaxed (on success) -> relaxed (on failure)

I.e. operation behaves as if only load part of RMW was executed.

Or user can supply additional explicit argument for failure case, but it can't be stronger than default (see table above).

I think it's reasonable. Though I never used latter form with explicit argument for failure.

Btw, I would not add automatic/implicit test before compare_and_*. TATAS mutex is slower than TAS mutex in many contexts. If user want additional test, he can easily add it.

Quoting - raf_schietekat

Upgraded from previous base, added tbb::mut_ord

Btw, is there any formal semantics for memory ordering types (like mut_ord)?

Yeah, they are intuitive. But only up to some point! When I was writing simulator for formal C++0x memory ordering rules, I face with many caveats, subtle moments etc, which can't be resolved w/o resorting to formal semantics. Most problems come from interactions between different memory ordering operations, for example, one thread executes mut_ord on first variable, and then acquire on another, and second thread executes seq_cst on second variable, and then relaxed on first variable - IS IT INTENDED TO WORK???

Also, if there will be no formal semantics then we can find ourselves in the same troubles as with TBB's container accessors:

http://software.intel.com/en-us/forums/showpost.php?p=64902

Nobody knows how they are intended to work, what is guaranteed, is it legitimately to change their behavior...

One paragraph per post from Dmitriy:

#105: Any architecture-specific code would be in include/tbb/machine/, of course.

#106: I'm still wondering how to do it, exactly. In the situation that led me to consider atomic objects, I already resorted instead to encapsulating an atomic, because it seemed more important to preserve the guarantee that fetch_and_add would be executed using LOCK XADD (otherwise it would be shot down...). I am considering accessors to provide, e.g., int-like operations with object-like signature: even a pointer atomic could expose such an accessor to sidestep referent size for addition of a flag bit, in "addition" to pointer-related addition, and this is not something you can do by just adding another template parameter. Then there are the objects that are not a power of 2, objects that are too big not to need locking, ... Not trivial.

#107: A spoonful of sugar makes the pattern go down (cf. the song), but it needs to really be a consistent pattern, of course.

#108: OK, explicit variables only it is then (BYOS: bring your own sugar).

#109: That's why I asked for an opinion about the *default* meaning (exceptions would still be possible), but I take note of the absense of enthusiasm for these ideas and withdraw the proposals.

#110: What is "consume"? Have you seen any indications to expect the opposite reaction from others even if you yourself would not miss configurable failure-related memory semantics?

#111: We're all informal here... mut_ord would be like C++0x seq_cst (which is confusing IMHO, like calling rel_acq=acq_rel a "full fence"). mut_ord would do its thing (beyond rel_acq) only relative to another mut_ord (hence the "mutual"), no other interactions should be inferred (other than (release and acquire)<=rel_acq<=mut_ord<=ordered), and I suppose the example calls for either 4 times "mut_ord" or 2 times "ordered" (there would be no seq_cst by that name). More formalism could still be nice, as long as there is a practical purpose.

Latest and greatest (also officially contributed), based on tbb21_20080925oss (most recent development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.atomic for applying the patch.

Atomics now support pointers to incomplete types and objects that can be in a union and are not larger than the largest supported integer size. More applications.

Hopefully the file will be included, because it's a very strange interface.

Another attempt...

Attachments: 

Quoting - raf_schietekat
#106: I'm still wondering how to do it, exactly. In the situation that led me to consider atomic objects, I already resorted instead to encapsulating an atomic, because it seemed more important to preserve the guarantee that fetch_and_add would be executed using LOCK XADD (otherwise it would be shot down...). I am considering accessors to provide, e.g., int-like operations with object-like signature: even a pointer atomic could expose such an accessor to sidestep referent size for addition of a flag bit, in "addition" to pointer-related addition, and this is not something you can do by just adding another template parameter. Then there are the objects that are not a power of 2, objects that are too big not to need locking, ... Not trivial.

What do you think about following implementation?

The main point is that 'inner' type (with which hardware atomic operations work) and 'outer' type (with which user works) are separated. Though user explicitly defines 'inner' type when declares atomic object.

The main use case for me is when I want to work with some packed structure like {pointer+flag} or {pointer+version number} of packed bit-filed which contains several fields. And it seems that such simple implementation would satisfy my needs.

----

// atomic class extended with traits
template
class atomic
{
private:
    T v;

public:
    typedef typename traits_t::outer_type outer_type;

    outer_type exchange(outer_type xchg)
    {
        T x = XCHG(&v, traits_t::to_inner(xchg));
        return traits_t::to_outer(x);
    }

    bool compare_exchange(outer_type& cmp, outer_type xchg)
    {
        T cmp_ = traits_t::to_inner(cmp);
        T const xchg_ = traits_t::to_inner(xchg);
        bool r = CAS(&v, &cmp_, xchg_);
        if (false == r)
            cmp = traits_t::to_outer(cmp_);
        return r;
    }

    outer_type fetch_add(outer_type addend)
    {
        T x = FAA(&v, traits_t::to_inner(addend));
        return traits_t::to_outer(x);
    }
};

// standard traits provided by library
template
struct std_atomic_traits
{
    typedef T outer_type;

    static T to_inner(outer_type x)
    {
        return x;
    }

    static outer_type to_outer(T x)
    {
        return x;
    }
};

// traits for marked pointer (possibly provided by library)
template
struct marked_pointer_traits
{
    typedef std::pair outer_type;

    static T to_inner(outer_type x)
    {
        assert(((intptr_t)x.first & 1) == 0);
        return (T)((intptr_t)x.first | (intptr_t)x.second);
    }

    static outer_type to_outer(T x)
    {
        return std::make_pair((T)((intptr_t)x & ~1), (bool)((intptr_t)x & 1));
    }
};

// and this is how user can extend machinery
struct my_node
{
    uint64_t    head    : 31;
    uint64_t    tail    : 31;
    uint64_t    deleted : 1;
    uint64_t    first   : 1;
};

struct my_node_traits
{
    typedef my_node     outer_type;

    static uint64_t to_inner(my_node x)
    {
        return *(uint64_t*)&x;
    }

    static my_node to_outer(uint64_t x)
    {
        return *(my_node*)&x;
    }
};

// usage examples
int main()
{
    atomic x;
    x.fetch_add(1);

    atomic y;
    int* p = 0;
    std::pair xchg = y.exchange(std::make_pair(p, true));
    bool s = y.compare_exchange(xchg, std::make_pair(p, false));

    atomic z;
    my_node xchg2 = {};
    my_node xchg3 = z.exchange(xchg2);
    s = z.compare_exchange(xchg2, xchg3);
}

----

Quoting - Dmitriy V'jukov

struct my_node_traits
{
    typedef my_node     outer_type;

    static uint64_t to_inner(my_node x)
    {
        return *(uint64_t*)&x;
    }

    static my_node to_outer(uint64_t x)
    {
        return *(my_node*)&x;
    }
};

Hmmm....

It seems that such case can be generalized:

// PROVIDED BY LIBRARY
template
struct reinterpret_cast_traits
{
    typedef outer_t outer_type;

    static inner_t to_inner(outer_t x)
    {
        return *(inner_t*)&x;
    }

    static outer_t to_outer(inner_t x)
    {
        return *(outer_t*)&x;
    }
};

Now I (as user) can write just:

atomic > z;

Quoting - raf_schietekat
#106: I'm still wondering how to do it, exactly. In the situation that led me to consider atomic objects, I already resorted instead to encapsulating an atomic, because it seemed more important to preserve the guarantee that fetch_and_add would be executed using LOCK XADD (otherwise it would be shot down...). I am considering accessors to provide, e.g., int-like operations with object-like signature: even a pointer atomic could expose such an accessor to sidestep referent size for addition of a flag bit, in "addition" to pointer-related addition, and this is not something you can do by just adding another template parameter. Then there are the objects that are not a power of 2, objects that are too big not to need locking, ... Not trivial.

As for objects that are big and/or not a power of 2.

I think it's just completely different case. It's a kind of much higher level of abstraction. It can be called something like 'atomic_register<>', and it must provide slightly different set of operations.

Following looks a bit strange for me:

tbb::atomic cache;

cache.fetch_add(additions);

And this looks much clearer:

tbb::atomic_register cache;

cache.compare_exchange(old, additions, apply_functor);

For me atomic object associates purely with low-level operations on bytes/words/double-words (mostly with hardware support).

Quoting - raf_schietekat

#109: That's why I asked for an opinion about the *default* meaning (exceptions would still be possible), but I take note of the absense of enthusiasm for these ideas and withdraw the proposals.

#110: What is "consume"? Have you seen any indications to expect the opposite reaction from others even if you yourself would not miss configurable failure-related memory semantics?

I can't say for others. It's only my personal opinions. Maybe it's better to create dedicated topic for such questions, and conduct a mini-survey. I think that TBB developers who work with low-level things like concurrent_queue, concurrent_vector, task_sceduler can say something on topic. Maybe some other 'external' people will express their opinions.

Quoting - raf_schietekat

#110: What is "consume"?

memory_order_consume is a subset of memory_order_acquire which works only for data-dependent operations.

For example:

node* n = anchor.load(memory_order_consume);

int data = n->data;

memory_order_consume is enough here, because address of 'n->data' is data-dependent on address of 'n'.

Another exmaple:

int x = g_x.load(memory_order_acquire);

int y = g_y;

memory_order_consume is not enough here, because 'g_y' is not data-dependent on 'g_x'.

Quoting - raf_schietekat

#111: We're all informal here... mut_ord would be like C++0x seq_cst (which is confusing IMHO, like calling rel_acq=acq_rel a "full fence"). mut_ord would do its thing (beyond rel_acq) only relative to another mut_ord (hence the "mutual"), no other interactions should be inferred (other than (release and acquire)<=rel_acq<=mut_ord<=ordered), and I suppose the example calls for either 4 times "mut_ord" or 2 times "ordered" (there would be no seq_cst by that name). More formalism could still be nice, as long as there is a practical purpose.

There is a need for formal semantics as soon as one faces any non-characteristic situation.

For example read my conversation with Alexander Terekhov about 'his list' (I remember you were referring to his list as "oh, no, again his list :) ):

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

I was unable to prove anything to him, as well as he was unable to prove anything to me. He was saying "this works", I was saying "no, it doesn't".

Also consider following code:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

Does it have to work? Or not? Even Hans Boehm was a bit puzzled.

Another example:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

According to common sense it must work, but according to formal semantics it doesn't. And as turns out there are objective reasons for that.

Quoting - Dmitriy V'jukov

There is a need for formal semantics as soon as one faces any non-characteristic situation.

For example read my conversation with Alexander Terekhov about 'his list' (I remember you were referring to his list as "oh, no, again his list :) ):

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

I was unable to prove anything to him, as well as he was unable to prove anything to me. He was saying "this works", I was saying "no, it doesn't".

Also consider following code:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

Does it have to work? Or not? Even Hans Boehm was a bit puzzled.

Another example:

http://groups.google.com/group/comp.programming.threads/browse_frm/threa...

According to common sense it must work, but according to formal semantics it doesn't. And as turns out there are objective reasons for that.

Though I am not saying that formal semantics must be priority #1. I am just saying that there is practical purpose. Definitely.

Recall confusion with semantics of accessors:

http://software.intel.com/en-us/forums/showthread.php?t=61153

"What do you think about following implementation?" How would you distinguish between normal pointer operations and manipulating a bit flag, with fetch_add for both resulting in LOCK XADD at the machine level?

"As for objects that are big and/or not a power of 2." The current patch treats them like bool and void* (no addition, no bitwise operators).

Thanks for explaining "consume", but that leaves me clueless about how to implement it. Maybe it's something that only a compiler can deal with?

(Currently skipping some references that look very interesting.)

By practical purpose I meant being relevant for a programmer to be confident about what the program will do and why, not to make him feel small and incompetent.

Quoting - Raf Schietekat

"What do you think about following implementation?" How would you distinguish between normal pointer operations and manipulating a bit flag, with fetch_add for both resulting in LOCK XADD at the machine level?

I don't understand your question.

Here is how one can set flag and leave pointer w/o changes:

atomicmarked_pointer_traits> x;

x.fetch_or(std::make_pair(0, true));

And here is how one can change pointer and leave flag w/o changes:

atomicmarked_pointer_traits> x;

std::pair cmp = x.load();

while (!x.compare_exchange(cmp, std::make_pair(new_value, cmp.second)));

This way one can construct any operation on flag, pointer or on both.

Quoting - Raf Schietekat

Thanks for explaining "consume", but that leaves me clueless about how to implement it. Maybe it's something that only a compiler can deal with?

Yes, in full measure memory_order consume can be implemented only by compiler, because it has to watch out to not break dependancies user can be relying on. For example:

int x = a.load(memory_order_consume);

int y = b[x - x].load(memory_order_relaxed);

Formally, y depends on x. But any current sane compiler will destroy the dependency.

But useful approximation of memory_order consume can be implemented w/o compiler support. I.e. implementation will respect only real dependencies, and sabotage fake dependencies like in example above. On platforms which respect data-dependencies implementation can look like:

T load_consume(T* volatile x)
{

T v = x[0]; // volatile load

compiler_acquire_fence();

return v;

}

Quoting - Raf Schietekat

By practical purpose I meant being relevant for a programmer to be confident about what the program will do and why, not to make him feel small and incompetent.

:))

Ok, let's count that it's only me who don't unserstand subtle moments w/o formal semantics :)

"I don't understand your question." ++x (to advance the pointer) vs. x.as().add(1) (to set the flag if known not to be set yet), both resulting in a locked operation on x86 (right?), things like that. The syntax is only meant to sketch the idea, I haven't thought about what would actually work yet.

"Ok, let's count that it's only me who don't unserstand subtle moments w/o formal semantics :)" I'm sure that's not the case, but I don't wish to hide that I have not yet been able to force myself to persevere trying to understand the concept of a race in C++0x, so sometimes formal semantics can stand in the way of understanding things.

Quoting - Raf Schietekat

"I don't understand your question." ++x (to advance the pointer) vs. x.as().add(1) (to set the flag if known not to be set yet), both resulting in a locked operation on x86 (right?), things like that. The syntax is only meant to sketch the idea, I haven't thought about what would actually work yet.

Personally for me it looks like sugar overdose. I will be incredibly happy if atomic will just automatically cast my structure to underlying type and back. This is what I usually do by hand. Or I create class like this:

union X
{

struct parts_t

{

uint32_t f1;

uint16_t f2;

uint8_t f3;

uint8_t f4;

};

uint64_t whole;

parts_t parts;

};

And then write stupid code like this:

X x;

x.whole = ATOMIC_XCHG(...);

... x.parts.f1 ...

Quoting - Raf Schietekat

"Ok, let's count that it's only me who don't unserstand subtle moments w/o formal semantics :)" I'm sure that's not the case, but I don't wish to hide that I have not yet been able to force myself to persevere trying to understand the concept of a race in C++0x, so sometimes formal semantics can stand in the way of understanding things.

I am not saying that formal semantics must be replacement to informal description.

Btw, you can find a number of informal descriptions of C++0x's concepts (data races, memory ordering etc) on Hans Boehm's page:

http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/

There are things like "Hans Boehm's attempt to explain the concurrency piece of the preceding formal proposal.", "Foundations of the C++ Concurrency Memory Model" and "A frequently asked questions list" etc. And they all live in harmony with standard's formal semantics.

This is a retraction of the support for more general object types in the latest patch, which requires some actual attention before any next attempt (sorry). Use with the previously supported types, now including pointers to incomplete types, is OK.

Latest and greatest (also officially contributed), based on tbb21_20080925oss (most recent development release), not yet successfully tested (release build of test_parallel_sort.exe fails), see README.atomic for applying the patch.

atomic

Attachments: 

Obviously "latest and greatest" does not quite apply here yet, so you might prefer an earlier version.

And I made yet another copy&paste mistake, so the text should have read (with my apologies):

"Intermediate snapshot version (also officially contributed), based on tbb21_20080925oss (a fairly recent development release), not yet successfully tested (release build of test_parallel_sort.exe fails) so you may prefer an earlier version, see README.atomic for applying the patch."

Retracing my changes, I was able to isolate failure to two versions of atomic.h (atomic.good1.h and atomic.bad3.h, to be substituted in the latest patch as include/tbb/atomic.h), and it would be more than helpful if somebody could shed some light on this. Reproducible failure is observed in the release version of test_parallel_sort.exe (on the environment specified above), so floating-point numbers, for which this is intended, are not even involved yet.

I think I'm ready to wrap up support for floating-point atomics (or atomic floating points?), so I'll do that later today or tomorrow.

What I've been struggling with on my development platform (x86 (dual core)/Ubuntu 6.06/g++ 4.0.3) can be illustrated by the following test code and its output (part of a template function with T either float or double):

/* test new snapshot */ {
    #if 1
    T l_snapshot;
    #else
    union { int m_dummy; T m_snapshot; } l_union;
    #define l_snapshot (l_union.m_snapshot) // never #undef'ed
    #endif
    l_snapshot = l_atomic;
    printf("addresses are %p and %p, value is %gn", &l_snapshot, &l_intended, l_snapshot);
    ASSERT( !memcmp(&l_snapshot, &l_intended, sizeof(T)), NULL );
}

// first version, debug, float resp. double
addresses are 0xbfb36418 and 0xbfb36428, value is 1
addresses are 0xbfb363f0 and 0xbfb36410, value is 1

// first version, release, float resp. double
addresses are 0xbff3b244 and 0xbff3b250, value is 1
addresses are (nil) and 0xbff3b240, value is 1
make[1]: *** [test_tbb_plain] Segmentation fault

// second version, debug, float resp. double
addresses are 0xbf8f1d18 and 0xbf8f1d28, value is 1
addresses are 0xbf8f1cf0 and 0xbf8f1d10, value is 1

// second version, release, float resp. double
addresses are 0xbfbe0520 and 0xbfbe0530, value is 1
addresses are 0xbfbe04f0 and 0xbfbe0520, value is 1

Any consoling words?

Quoting - Raf Schietekat

I think I'm ready to wrap up support for floating-point atomics (or atomic floating points?), so I'll do that later today or tomorrow.

What I've been struggling with on my development platform (x86 (dual core)/Ubuntu 6.06/g++ 4.0.3) can be illustrated by the following test code and its output (part of a template function with T either float or double):

/* test new snapshot */ { #if 1 T l_snapshot; #else union { int m_dummy; T m_snapshot; } l_union; #define l_snapshot (l_union.m_snapshot) // never #undef'ed #endif l_snapshot = l_atomic; printf("addresses are %p and %p, value is %gn", &l_snapshot, &l_intended, l_snapshot); ASSERT( !memcmp(&l_snapshot, &l_intended, sizeof(T)), NULL ); }

// first version, debug, float resp. double
addresses are 0xbfb36418 and 0xbfb36428, value is 1
addresses are 0xbfb363f0 and 0xbfb36410, value is 1

// first version, release, float resp. double
addresses are 0xbff3b244 and 0xbff3b250, value is 1
addresses are (nil) and 0xbff3b240, value is 1
make[1]: *** [test_tbb_plain] Segmentation fault

// second version, debug, float resp. double
addresses are 0xbf8f1d18 and 0xbf8f1d28, value is 1
addresses are 0xbf8f1cf0 and 0xbf8f1d10, value is 1

// second version, release, float resp. double
addresses are 0xbfbe0520 and 0xbfbe0530, value is 1
addresses are 0xbfbe04f0 and 0xbfbe0520, value is 1

Any consoling words?

It appears that in the first version Release configuration that the compiler optimized l_snapshot into a register (i.e. has no address). This is not necessarily a problem because when it shows up, the compiler figured out that the variable is not used outside the scope of the code section, and if that is the case, there is no reference to the variable passed on to a different thread, and thus no requirement to have made the variable atomic in the first place.

If your application were to subsequently be modified to pass the address (reference) on to another function, then the compiler optimization code would (should) recognise this and store the variable in memory (or observe what happens during potential inlining of code called).

There may be some odd circumstances whereby a shared variable might be presumed by the compiler to be referenced by only the code in current scope. Unfortunatly you do not have a "no register" attribute that you can assign to local variables.

You could use volatile but that may have undesirableside effects.

Jim Dempsey

Shouldn't any use of a pointer to the variable be a sufficient cue for the compiler not to interfere? I do think it's a bug, although I have not yet figured out exactly what triggers it.

"and thus no requirement to have made the variable atomic in the first place" Well, l_snapshot isn't, only l_atomic. The full code is in test_atomic.cpp, coming up shortly.

Latest and greatest (also officially contributed), based on tbb21_20080925oss (a development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.atomic for applying the patch.

atomic

Attachments: 

It appears that for more recent g++ compilers than my development platform I will have to expand the use of unions to get around the strict-aliasing police. Will this impose a performance penalty, I wonder, and does this mean that the code should be forked for atomic floating-point numbers? Or is there another suitable workaround to reinterpret a value representation?

Also, I've seen test_atomic.exe failing for wchar_t (sic!) at line test_atomic.cpp:213 on x64/Linux/g++ 4.1.1 (the test doesn't continue after that point).

So unless you (want to test floating-point atomics and (can avoid optimised builds or happen to not have problems on your platform)), you may still be better served with atomic.Raf_Schietekat.20081012.zip (stick to non-floating-point fundamental types, though).

Quoting - Raf Schietekat

It appears that for more recent g++ compilers than my development platform I will have to expand the use of unions to get around the strict-aliasing police. Will this impose a performance penalty, I wonder, and does this mean that the code should be forked for atomic floating-point numbers? Or is there another suitable workaround to reinterpret a value representation?

Also, I've seen test_atomic.exe failing for wchar_t (sic!) at line test_atomic.cpp:213 on x64/Linux/g++ 4.1.1 (the test doesn't continue after that point).

So unless you (want to test floating-point atomics and (can avoid optimised builds or happen to not have problems on your platform)), you may still be better served with atomic.Raf_Schietekat.20081012.zip (stick to non-floating-point fundamental types, though).

well the standard does say reinterpret_cast functionality is compiler dependent, so I guess its not a bug in g++, besides g++ has become the aliasing police as of lately, I've encoutered this issue way to many times since I upgraded my g++ compiler.

Kind of out of context( as it's not related with floating points, but with int32 and int64) I'll quote the g++ known bugs/non-bugs docs:

This is often caused by a violation of aliasing rules, which are part of the ISO C standard. These rules say that a program is invalid if you try to access a variable through a pointer of an incompatible type.

The aliasing rules were designed to allow compilers more aggressive optimization. Basically, a compiler can assume that all changes to variables happen through pointers or references to variables of a type compatible to the accessed variable. Dereferencing a pointer that violates the aliasing rules results in undefined behavior.

Now thesolutionto these problems in g++ is described here:

To fix the code above, you can use aunioninstead of a cast (note that this is a GCC extension which might not work with other compilers):

Which means that it's not easy to have a single code path that will work correctly on all compilers. One option is to compile with-fno-strict-aliasingthat will make g++ not cause the aliasing problems.

But did you try it, Robert?

Quoting - Raf Schietekat

But did you try it, Robert?

Not yet, got a busy dead-line this week.

I found that wchar_t failed with an unimagineable off-by-one error! Anybody who can explain *that* is welcome to do so...

This revised include/tbb/atomic.h file, making use of a new __TBB_type_punning_cast template, seems to work fine on x64/Linux/g++ 4.1.{1,2}.

Attachments: 

AttachmentSize
Downloadtext/x-chdr atomic.h50.67 KB

Latest and greatest (also officially contributed), based on tbb21_20080925oss (a development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.atomic for applying the patch.

Brushed up based on insight from tbb21_20090313oss for step size, some problem fixes.

Attachments: 

Another thing I feel is missing from atomic is what I'd call "tentative read": a completely unprotected read of the atomic's value, with no guarantee of correctness.

At least by my understanding of the performance characteristics of atomic operations, this would be valuable in conjunction with compare_and_swap:

tbb::atomic a;

void f() {
    unsigned previous = a;
    while(true) {
        unsigned replacement = /* ... */;
        unsigned tmp = a.compare_and_swap(replacement, previous);
        if (previous==tmp) break;
        previous = tmp;
    }
}

In that code fragment, note that the initial value of "previous" is unimportant to correctness; it's only there for performance, to improve the chances of the compare_and_swap succeeding first time round the while loop. My hunch is that the expense of a __TBB_load_with_acquire() is a poor trade-off compared with a simple load of the value.

"Another thing I feel is missing from atomic is what I'd call "tentative read": a completely unprotected read of the atomic's value, with no guarantee of correctness."
The patch offers tbb::atomic::load(), which can safely be used for this purpose.

Note that the patchitself so farentirely bypasses the atomics machinery for this purpose, which may be faster still; if generic atomics are ever supported, including on types that can have invalid states with associated misbehaviour, this may need to be revised (perhaps we're already there with signalling floating-points?).

Quoting - Raf Schietekat
Latest and greatest (also officially contributed), based on tbb21_20080925oss (a development release), tested for x86 (dual core)/Ubuntu 6.06/g++ 4.0.3, see README.atomic for applying the patch.

Brushed up based on insight from tbb21_20090313oss for step size, some problem fixes.

Also tested successfully on x86-64 (dual core)/Red Hat 5.1/g++ 4.1.2 (both original and patched tbb21_20080925oss seem to have pop_if_present problems in release-build test_concurrent_queue_v2.exe, though).

On PA-RISC 2.0 (multi-core)/HP-UX 11i/g++ 3.4.6 it failed in debug-build ./test_concurrent_vector.exe ("terminate called after throwing an instance of 'std::bad_alloc'" and/or just being very busy without making any progress). I may have another look later at what's happening.

Please feel free to test on your favourite platform, and report back. Other than adding to atomic per se, the patch intends to port TBB to the following ninearchitectures (each time for GNU g++, and occasionally also for another compiler), but it may still require some polishing: Alpha, ARM, Itanium, MISC, PA-RISC, POWER/PowerPC, SPARC, x86(-64), ESA/390-z/Architecture.

Perhaps somewhat disturbingly, this morning I awoke early thinking that very likely it is a mistake, in all of tbb21_20090313oss, by consequence the patch, but also the N2800 draft for C++0x, to provide "T*" instead of "const T*" implicit value access to atomic, either in the form "operator value_type() const" that is common to all atomic specialisations (which makes it unnecessary to define "operator*" as for smart pointers), or in the specific form "T* operator->() const".

The problem is that the associated fence is on the wrong side for writing to the referent. It is all right to do "v1=*a1;" or "v2=a2->m;", which will acquire the referent just after loading the pointer, so that this will synchronise with code that released the referent before storing the pointer, but code like "*a1=v1;", or "a2->m=v2;" will release local memory, store the pointer, and then update the referent... only locally!

What if, instead, "operator value_type() const" were hidden (either by only providing it in other specialisations or only declaring it for pointer atomics), and both "const X& operator*() const" and "const X* operator->() const" were provided for pointer atomics? The user would be less likely to make an accidental mistake if setting up the referent were separate from storing the pointer, and having both these operators restores the similarity with smart pointers.

So what do you think?

Note that "but code like "*a1=v1;", or "a2->m=v2;" will release local memory, store the pointer, and then update the referent... only locally!" is obviously wrong: it should be "but code like "*a1=v1;", or "a2->m=v2;" will load the pointer, acquire the referent, and then locally update it, which does nothing for synchronisation.".

Quoting - Raf Schietekat
but code like "*a1=v1;", or "a2->m=v2;" will load the pointer, acquire the referent, and then locally update it, which does nothing for synchronisation.

I asked in the team, and we agreed that you probably should not worry, because the case described is covered by atomic*, or atomic*> (atomic pointer to atomic variable) if access to both the pointer itself and the referent should be synchronized.

As for the atomic itself, its purpose I feel is to provide consistent view of the pointer value across the threads, but not of the referent value. E.g. in the implementation of queuing_mutex, the tail of the queue is an atomic pointer to the scoped_lock node object, and the actual data in that object do not matter much when the node at the tail changes.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today