Atomic floats, is this implementation safe?

Atomic floats, is this implementation safe?

I made an atomic float, and it's probably not blazingly fast (that's ok), but its faster than wrapping a lock around a float, and it works, but I'm not sure if this is because of my good luck, or if this is actually thread safe. I think it is... but you never know, so I came to ask the experts :)

struct AtomicFloat: public tbb::atomic

{

float compare_and_swap(float value, float compare)

{

size_t value_ = tbb::atomic::compare_and_swap((size_t&)value,(size_t&)compare);

return reinterpret_cast(value_);

}

float operator=(float value)

{

size_t value_ = (*this).tbb::atomic::operator =((size_t&)value);

return reinterpret_cast(value_);

}

operator float()

{

return reinterpret_cast(*this);

}

float operator+=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(*this);

new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

};

Also as a caveat I'm placing a static assert for size_of(float) == size_of(size_t).

Thanks!

54 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I dare not look too closely, but would you pick my lottery numbers next week?

How is this used? Has anybody else missed atomic floats? What operations should be supported?

Quoting - Raf Schietekat

I dare not look too closely, but would you pick my lottery numbers next week?

How is this used? Has anybody else missed atomic floats? What operations should be supported?

Sure I'll pick your lottery numbers :)

There you go 3-7-12-23-24.

But on a more serious note,I want/plan to support all generic tbb::atomic operations with the same interface, if this is a viable solution(thread safe).

And their intended use is as data-members on objects.

class ThreadSafeActor

{

atomic fingers;

atomic thirst;

atomic hunger;

};

So you see I'd rather have atomic floats, than locks on the object or each data member. And so far it amazingly works while running 10 threads concurrently on the same object. Nothing has broken so far...

----------

Edit: Performance wise using a locked float for 5.000.000 += operations with 100 threads on my machine takes 3.6s, while my atomic float even with its silly do-while takes 0.2s to do the same work. So the >30x performance boost means its worth it, (and this is the catch) if its correct.

Here is a full code listing that works, and is about 30x faster than a locked float
But having someone from intel say if this is kosher or not would be great.

struct AtomicFloat: public tbb::atomic
{

template
float fetch_and_add( float addend )
{
const uint_32 value_ = tbb::atomic::fetch_and_add((uint_32&)addend);
return reinterpret_cast(value_);
}

float fetch_and_add( float addend )
{

const uint_32 value_ = tbb::atomic::fetch_and_add((uint_32&)addend);

return reinterpret_cast(value_);

}

template
float fetch_and_increment()
{

const uint_32 value_ = tbb::atomic::fetch_and_increment();
return reinterpret_cast(value_);

}

float fetch_and_increment()
{

const uint_32 value_ = tbb::atomic::fetch_and_increment();
return reinterpret_cast(value_);

}

template
float fetch_and_decrement()
{

const uint_32 value_ = tbb::atomic::fetch_and_decrement();
return reinterpret_cast(value_);

}

float fetch_and_decrement()
{

const uint_32 value_ = tbb::atomic::fetch_and_decrement();
return reinterpret_cast(value_)

}

template
float fetch_and_store( float value )
{

const uint_32 value_ = tbb::atomic::fetch_and_store((uint_32&)value);
return reinterpret_cast(value_);

}

float fetch_and_store( float value )
{

const uint_32 value_ = tbb::atomic::fetch_and_store((uint_32&)value);
return reinterpret_cast(value_);

}

template
value_type compare_and_swap( value_type value, value_type comparand )
{

const uint_32 value_ = tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);
return reinterpret_cast(value_);

}

float compare_and_swap(float value, float compare)
{

const uint_32 value_ = tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);
return reinterpret_cast(value_);

}

operator float() const volatile // volatile qualifier here for backwards compatibility
{

const uint_32 value_ = (*this);
return reinterpret_cast(value_);

}

float& _internal_reference() const
{

return reinterpret_cast(this->my_value);

}

float operator=(float value){

const uint_32 value_ = (*this).tbb::atomic::operator =((uint_32&)value);
return reinterpret_cast(value_);

}

float operator+=(float value)
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator-=(float value)
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_ - value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator++()
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_++;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_);

}

float operator--()
{

volatile float old_value_, new_value_;
do
{

old_value_ = reinterpret_cast(*this);
new_value_ = old_value_--;

} while(compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_)

}

};


I would start by not inheriting from another atomic (maybe privately), using tbb::atomic_word instead (even if uint32_t is already a big improvement over size_t), rewriting the operations involving addition (oops!), thinking about what equality really means, ...

Quoting - Raf Schietekat

I would start by not inheriting from another atomic (maybe privately), using tbb::atomic_word instead (even if uint32_t is already a big improvement over size_t), rewriting the operations involving addition (oops!), thinking about what equality really means, ...

I miss being able to edit later...

atomic_word is not in tbb, of course, which makes an assert look more attractive against the alternative of emulating or blindly trusting atomic_word.

Fixed some stuff and removed some stuff... especially the additions, oops! :)

struct AtomicFloat

{

tbb::atomic atomic_value_;

template

float fetch_and_store( float value )

{

const uint_32 value_ = atomic_value_.tbb::atomic::fetch_and_store((uint_32&)value);

return reinterpret_cast(value_);

}

float fetch_and_store( float value )

{

const uint_32 value_ = atomic_value_.tbb::atomic::fetch_and_store((uint_32&)value);

return reinterpret_cast(value_);

}

template

float compare_and_swap( float value, float comparand )

{

const uint_32 value_ = atomic_value_.tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);

return reinterpret_cast(value_);

}

float compare_and_swap(float value, float compare)

{

const uint_32 value_ = atomic_value_.tbb::atomic::compare_and_swap((uint_32&)value,(uint_32&)compare);

return reinterpret_cast(value_);

}

operator float() const volatile // volatile qualifier here for backwards compatibility

{

const uint_32 value_ = atomic_value_;

return reinterpret_cast(value_);

}

float operator=(float value)

{

const uint_32 value_ = atomic_value_.tbb::atomic::operator =((uint_32&)value);

return reinterpret_cast(value_);

}

float operator+=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator*=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ * value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator/=(float value)

{

volatile float old_value_, new_value_;

do

{

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ / value;

} while(compare_and_swap(new_value_,old_value_) != old_value_);

return (new_value_);

}

float operator-=(float value)

{

return this->operator+=(-value);

}

float operator++()

{

return this->operator+=(1);

}

float operator--()

{

return this->operator+=(-1);

}

float fetch_and_add( float addend )

{

return this->operator+=(-addend);

}

float fetch_and_increment()

{

return this->operator+=(1);

}

float fetch_and_decrement()

{

return this->operator+=(-1);

}

};

I would omit the notions of increment and decrement (a normal float does not have them either), but you are still evaluating the outcome of compare_and_swap as if you didn't have to care about the binary representation of plus/minus zero.

Quoting - Raf Schietekat

I miss being able to edit later...

atomic_word is not in tbb, of course, which makes an assert look more attractive against the alternative of emulating or blindly trusting atomic_word.

The underlaying compare and swap is operating on32-bit valuewhen used for float, and presumably later by 64-bit value when used for double. The values as used by the compare and swap pass through general purpose registers and not through the FPU or SSE/MMX registers. Therefore, it may be advisable in your compare_and_swap to not obtain the old value (for use as comparand) by way of C++ statements that treat the copy operation as a copy of floats, which may use the FPU, SSE/MMX or other instructions for different processore. Instead, I would suggest copying the old value, for use as comparand, using a statment that casts the variable as int32 or int64 as the case may be. To use the cast as float may exhibit problems with old values containing denormalized numbers, NaN's, +/- Infinities, -0, etc..

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Raf Schietekat

I would omit the notions of increment and decrement (a normal float does not have them either), but you are still evaluating the outcome of compare_and_swap as if you didn't have to care about the binary representation of plus/minus zero.

Sorry, I should have checked first: increment and decrement are defined on all arithmetic and pointer types, which includes floating point. Even a bool can be incremented (which results in value true)... but not decremented; I'm happy that I didn't know this (although now I do).

Here is the latest iteration of the code, any comments?

#include
typedef unsigned int uint_32;
typedef __TBB_LONG_LONG uint_64;

template
struct atomic_float_
{
/* CRC Card -----------------------------------------------------
| Class: atmomic float template class
|
| Responsability: handle integral atomic memory as it were a float,
| but partially bypassing FPU, SSE/MMX, so it is
| slower than a true float, but faster and smaller
| than a locked float.
| *Warning* If your float usage is thwarted by
| the A-B-A problem this class isn't for you
| *Warning* Atomic specification says we return,
| values not l-values. So (i = j) = k doesn't work.
|
| Collaborators: intel's tbb::atomic handles memory atomicity
----------------------------------------------------------------*/
typedef typename atomic_float_ self_t;

tbb::atomic atomic_value_;

template
FLOATING_POINT fetch_and_store( FLOATING_POINT value )
{
const MEMORY_BLOCK value_ =
atomic_value_.tbb::atomic::fetch_and_store((MEMORY_BLOCK&)value);
//atomic specification requires returning old value, not new one
return reinterpret_cast(value_);
}

FLOATING_POINT fetch_and_store( FLOATING_POINT value )
{
const MEMORY_BLOCK value_ =
atomic_value_.tbb::atomic::fetch_and_store((MEMORY_BLOCK&)value);
//atomic specification requires returning old value, not new one
return reinterpret_cast(value_);
}

template
FLOATING_POINT compare_and_swap( FLOATING_POINT value, FLOATING_POINT comparand )
{
const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::compare_and_swap((MEMORY_BLOCK&)value,(MEMORY_BLOCK&)compare);
//atomic specification requires returning old value, not new one
return reinterpret_cast(value_);
}

FLOATING_POINT compare_and_swap(FLOATING_POINT value, FLOATING_POINT compare)
{
const MEMORY_BLOCK value_ =
atomic_value_.tbb::atomic::compare_and_swap((MEMORY_BLOCK&)value,(MEMORY_BLOCK&)compare);
//atomic specification requires returning old value, not new one
return reinterpret_cast(value_);
}

operator FLOATING_POINT() const volatile // volatile qualifier here for backwards compatibility
{
const MEMORY_BLOCK value_ = atomic_value_;
return reinterpret_cast(value_);
}

//Note: atomic specification says we return the a copy of the base value not an l-value
FLOATING_POINT operator=(FLOATING_POINT rhs)
{
const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::operator =((MEMORY_BLOCK&)rhs);
return reinterpret_cast(value_);
}

//Note: atomic specification says we return an l-value when operating among atomics
self_t& operator=(self_t& rhs)
{
const MEMORY_BLOCK value_ = atomic_value_.tbb::atomic::operator =((MEMORY_BLOCK&)rhs);
return *this;
}

FLOATING_POINT& _internal_reference() const
{
return reinterpret_cast(atomic_value_.tbb::atomic::_internal_reference());
}

FLOATING_POINT operator+=(FLOATING_POINT value)
{
FLOATING_POINT old_value_, new_value_;
do
{
old_value_ = reinterpret_cast(atomic_value_);
new_value_ = old_value_ + value;
//floating point binary representation is not an issue because
//we are using our self's compare and swap, thus comparing floats and floats
} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_); //return resulting value
}

FLOATING_POINT operator*=(FLOATING_POINT value)
{
FLOATING_POINT old_value_, new_value_;
do
{
old_value_ = reinterpret_cast(atomic_value_);
new_value_ = old_value_ * value;
//floating point binary representation is not an issue becaus
//we are using our self's compare and swap, thus comparing floats and floats
} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_); //return resulting value
}

FLOATING_POINT operator/=(FLOATING_POINT value)
{
FLOATING_POINT old_value_, new_value_;
do
{
old_value_ = reinterpret_cast(atomic_value_);
new_value_ = old_value_ / value;
//floating point binary representation is not an issue because
//we are using our self's compare and swap, thus comparing floats and floats
} while(self_t::compare_and_swap(new_value_,old_value_) != old_value_);
return (new_value_); //return resulting value
}

FLOATING_POINT operator-=(FLOATING_POINT value)
{
return this->operator+=(-value); //return resulting value
}

//Prefix operator
FLOATING_POINT operator++()
{
return this->operator+=(1); //return resulting value
}

//Prefix operator
FLOATING_POINT operator--()
{
return this->operator+=(-1); //return resulting value
}

//Postfix operator
FLOATING_POINT operator++(int)
{
const FLOATING_POINT temp = this;
this->operator+=(1);
return temp//return resulting value
}

//Postfix operator
FLOATING_POINT operator--(int)
{
const FLOATING_POINT temp = this;
this->operator+=(1);
return temp//return resulting value
}

FLOATING_POINT fetch_and_add( FLOATING_POINT addend )
{
const FLOATING_POINT old_value_ = atomic_value_;
this->operator+=(addend);
//atomic specification requires returning old value, not new one as in operator x=
return old_value_;
}

FLOATING_POINT fetch_and_increment()
{
const FLOATING_POINT old_value_ = atomic_value_;
this->operator+=(+1);
//atomic specification requires returning old value, not new one as in operator x=
return old_value_;
}

FLOATING_POINT fetch_and_decrement()
{
const FLOATING_POINT old_value_ = atomic_value_;
this->operator+=(-1);
//atomic specification requires returning old value, not new one as in operator x=
return old_value_;
}
};

typedef atomic_float_ AtomicFloat;
typedef atomic_float_ AtomicDouble;

???

Robert

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;
//floating point binary representation is not an issue because
//we are using our self's compare and swap, thus comparing floats and floats

If in obtaining old_value_ the compiler generate FPU instructions to copy the data .AND. if atomic_value_ contains a denormalized number, then old_value_ will never == atomic_value_ and therefore the compare and swap will always fail. In the case of the denormalized number the new_value_ will compute to the value you desire but could never get stored due to binary difference between old_value_ and atomic_value. SSE and MMX instructions may work (garbage in-garbage out). However, consider first time use of atomic_value containing uninitialized data which could potentially be several patters that are not floating point numbers. The safest way would be to copy to old_value using a cast on both sides as uint32 (of your flavor). Then the bit pattern that was in atomic_value_ is now in old_value_ and if the number were "not a normal number" then in this case the attempt at converson would occure at the "+" and not at the first "=". And whatever the result would be storable back into atomic_value_.

The main problem is not with the operator += since presumably with good coding practice this will occur only after at least one operator = that initializes the atomic_value_ to a valid floating point number (but NaN's and other non-sense can get stored too). If operator = uses the compare_and_swap you must match the bit pattern and not use the numeric equivilence (which may have a different bit pattern).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Robert

old_value_ = reinterpret_cast(atomic_value_);

new_value_ = old_value_ + value;
//floating point binary representation is not an issue because
//we are using our self's compare and swap, thus comparing floats and floats

If in obtaining old_value_ the compiler generate FPU instructions to copy the data .AND. if atomic_value_ contains a denormalized number, then old_value_ will never == atomic_value_ and therefore the compare and swap will always fail. In the case of the denormalized number the new_value_ will compute to the value you desire but could never get stored due to binary difference between old_value_ and atomic_value. SSE and MMX instructions may work (garbage in-garbage out). However, consider first time use of atomic_value containing uninitialized data which could potentially be several patters that are not floating point numbers. The safest way would be to copy to old_value using a cast on both sides as uint32 (of your flavor). Then the bit pattern that was in atomic_value_ is now in old_value_ and if the number were "not a normal number" then in this case the attempt at converson would occure at the "+" and not at the first "=". And whatever the result would be storable back into atomic_value_.

The main problem is not with the operator += since presumably with good coding practice this will occur only after at least one operator = that initializes the atomic_value_ to a valid floating point number (but NaN's and other non-sense can get stored too). If operator = uses the compare_and_swap you must match the bit pattern and not use the numeric equivilence (which may have a different bit pattern).

Jim Dempsey

Thanks for the in-depth explanation, I had misunderstood your point in your previous post.

But now I can see the possible danger there. Going back to fixing :)

I'm still wondering whether this is an ad hoc exercise or an indication that I should add floating-point atomics to the "Additions to atomic" patch.

During your fixing, remember that just using bit patterns would be oblivious to the equality of plus zero and minus zero for compare_and_swap, but maybe compare_and_swap should be deprecated against a compare_and_store (see the patch: in/out comparand as first parameter, new value as second parameter, third parameter indicating whether spurious failure is disallowed, boolean success status returned). Although, it could also be another parameter: by default minus zero equals plus zero, unless the user decides to differentiate them.

Quoting - Raf Schietekat

I'm still wondering whether this is an ad hoc exercise or an indication that I should add floating-point atomics to the "Additions to atomic" patch.

During your fixing, remember that just using bit patterns would be oblivious to the equality of plus zero and minus zero for compare_and_swap, but maybe compare_and_swap should be deprecated against a compare_and_store (see the patch: in/out comparand as first parameter, new value as second parameter, third parameter indicating whether spurious failure is disallowed, boolean success status returned). Although, it could also be another parameter: by default minus zero equals plus zero, unless the user decides to differentiate them.

Well I think atomic floats are anecessaryfeature, but then I'm the one that began the thread.

My line of thought here is that in theory, and in most text-book examples of threading, floats can be easily dismissed.

Also when working with graphics, no-one will want to use atomic floats.

But, in my case, working on a multithreaded event-driven simulation (in this case a game server), being able to use atomic floats as object data members, can seriously reduce the amount of locks (and a lock is are fairly big when compared with a float, and even most objects). This improves memory usage (can keep more objects hot), reduces thread convoying, overall making things snappier.

It also helps keep the code cleaner and removes lots of potential deadlocks, that could happen if someone isn't careful. Now if my simulation was non-event driven, say just calculating some complex end-state, I'd be better off using vectorization, just as with graphics. But alas its not.

The only down-side here is that without the proper a instruction set atomic floats are kind of a hack (A-B-A problem), but a useful tool if usedappropriately.

Nevertheless perhaps the real solution in the long run isn't atomics though, and instead its transactional memory, but since I'm working on a commercial product with dead-lines, Intel's STM won't do me any good for now. And adoption of transactional memory if it catches on will probably be something that happens several years down the road. So (pseudo-)atomic floats in the meantime would probably be welcomed many.

"a lock is are fairly big when compared with a float" I think you should test your assumptions again with tbb::spin_mutex for a lock, both about size (currently always one byte) and about performance (may be substantially better than what you've seen, especially if you keep the float and the lock in the same cache line and are careful about what else to allow there).

"removes lots of potential deadlocks" That's where scoped locks come in (assuming the thread remains active).

"A-B-A problem" Depending on your application...

"Intel's STM won't do me any good for now" What product would that be, and when is it expected?

P.S.: Actually I need either 5 out of 50 plus 2 out of 9 or 6 out of 42 for that lottery thing.

I just skimmed the thread, so maybe I missed this. What is a potential use case for an atomic float? I've thought myself about them, but I'd like to hear some ideas on how they can be useful. I did see your Actor class.

Quoting - robert.jay.gould

But, in my case, working on a multithreaded event-driven simulation (in this case a game server), being able to use atomic floats as object data members, can seriously reduce the amount of locks (and a lock is are fairly big when compared with a float, and even most objects). This improves memory usage (can keep more objects hot), reduces thread convoying, overall making things snappier.

Mutex can be as small as 1 bit. I.e. you can pack up to 32 mutexes into single 32-bit word.

You can see some interesting bit mutex implementation here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bbf4e0b8d1db5b4f

The interesting thing about it is that it's possible to lock/unlock arbitrary set of mutexes with single atomic RMW.

OR you can pack mutex and {pointer, number, enum, etc} into single word, i.e. no waste of space at all.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - robert.jay.gould

Nevertheless perhaps the real solution in the long run isn't atomics though, and instead its transactional memory, but since I'm working on a commercial product with dead-lines, Intel's STM won't do me any good for now. And adoption of transactional memory if it catches on will probably be something that happens several years down the road. So (pseudo-)atomic floats in the meantime would probably be welcomed many.

Locks will be much-much faster and scalable. And no possibility of deadlock, since transaction involves only one object.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

The September ACM Queue (http://mags.acm.org/queue/200809/) has an excellent overview on the state of Software Transactional Memory.Its subtitle is"why it is only a research toy". The article includes performance comparisons and detailed analysis of 3 TM systems.It's conclusion starts with "Based on our results, we believe that the road ahead for STM is quite challenging"It later remarks [emphasis mine]: "We observed that the TM programming model itself, whether implemented in hardware or software, introduces complexities that limit the expected productivity gains, ...".

Robert,

The atomic float operation is not an A-B-A problem. Instead, it is a read/modify/write problem whereby a competing thread can interceed between the read and write. This adverse interacton with respect to operator on float can be worked around using a CAS (single word compare and swap). The A-B-A problem typically involves a linked list operation whereby the head (or tail) pointer could point at node A on the read portion, of an attempted change, then the thread stalls (interrupt or context switch), then while the thread is stalled, the list changes state (potentially many times) and ends up with the head pointer point pointing at A. However, and this is the important part. Your CAS operation may be dependent on the head pointer still holding a pointer to A but your code may also be relying on node A not having been altered between the time you examined A and performed your CAS. If node A had been altered your CAS would succeed (in this example) but the copied value you obtained from A just before the CAS is not consistent with the current value. The end result being you just trashed the head of list pointer (or tail pointer). The atomic float can be protected with the CAS, and the A-B-A cannot.

To protect for A-B-A you typically use DCAS (Double word Compare And Swap) using a pointer and sequence number. This assumes your processor supports a DCAS instruction.

By the way. You are aware you could have used

#pragma omp atomic
sum += bump;

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Dmitriy V'jukov

Mutex can be as small as 1 bit. I.e. you can pack up to 32 mutexes into single 32-bit word.

You can see some interesting bit mutex implementation here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/bbf4e0b8d1db5b4f

The interesting thing about it is that it's possible to lock/unlock arbitrary set of mutexes with single atomic RMW.

OR you can pack mutex and {pointer, number, enum, etc} into single word, i.e. no waste of space at all.

That is a very interesting solution I hadn't thought about at all. Not sure if its the right solution for this exact problem, but its a very interesting concept that might help in other ways if I can get around to some refactoring there

Quoting - jimdempseyatthecove

Robert,

The atomic float operation is not an A-B-A problem. Instead, it is a read/modify/write problem whereby a competing thread can interceed between the read and write. This adverse interacton with respect to operator on float can be worked around using a CAS (single word compare and swap). The A-B-A problem typically involves a linked list operation whereby the head (or tail) pointer could point at node A on the read portion, of an attempted change, then the thread stalls (interrupt or context switch), then while the thread is stalled, the list changes state (potentially many times) and ends up with the head pointer point pointing at A. However, and this is the important part. Your CAS operation may be dependent on the head pointer still holding a pointer to A but your code may also be relying on node A not having been altered between the time you examined A and performed your CAS. If node A had been altered your CAS would succeed (in this example) but the copied value you obtained from A just before the CAS is not consistent with the current value. The end result being you just trashed the head of list pointer (or tail pointer). The atomic float can be protected with the CAS, and the A-B-A cannot.

To protect for A-B-A you typically use DCAS (Double word Compare And Swap) using a pointer and sequence number. This assumes your processor supports a DCAS instruction.

By the way. You are aware you could have used

#pragma omp atomic
sum += bump;

Jim Dempsey

Thanks for the better A-B-A explanation, I had previously had the issue with a Queue based State-Machine (that's when I found out about the term "A-B-A problem"), and became kind of paranoid about it since it couldn't be solved with CAS (finally I just placed a lock around it since it really wasn't worth spending more time on it, because actual contention rate was very low).

As for omp atomic... well I knew that, and use it often, just didn't remember about it when I began :)

Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible.

"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?

Quoting - Raf Schietekat

"Anyways my goal was to make all the numerical members atomic, and with the same interface, since I have them bound with templates to dozens of scripting engines that have code running on them for which I'm not responsible." So you aren't going to do a benchmark against tbb::spin_mutex?

Ok your right, I was being lazy here :)

Anyways the results of my lousy (simple for loop) benchmark (and no one should probably quote these numbers) was

spin_mutex 0.0018s

atomic float 0.0036s

omp atomic 0.0036s

critical section 0.0080s

This meaning spin_mutex was the winner, but fatter than atomic float/omp atomic. So I'lldefinitelykeep spin_mutex for the moment, but will conserve the atomic solutions until I can do some testing on a more real workload, where memory might be more critical than time, or the not.

Thanks for the idea there!

So where is "the >30x performance boost" you mentioned earlier? What exactly is "critical section"? Note that I'm not sure that an atomic float might not still be a winner in a heavily oversubscribed situation (too many threads) if contention on individual atomics is low enough, depending on how much of a nuisance convoying really is (that would be interesting to know), but your implementation is still incorrect, even not taking into account what I did not mention yet, let alone what I would yet discover if I looked more closely... :-)

Quoting - Raf Schietekat

So where is "the >30x performance boost" you mentioned earlier? What exactly is "critical section"? Note that I'm not sure that an atomic float might not still be a winner in a heavily oversubscribed situation (too many threads) if contention on individual atomics is low enough, depending on how much of a nuisance convoying really is (that would be interesting to know), but your implementation is still incorrect, even not taking into account what I did not mention yet, let alone what I would yet discover if I looked more closely... :-)

The 30x performance was when using 100 threads + classes and stuff, my previous quick-benchmark was with just 4in an omp loop, a more common usage, so more representative of "lab" performance on average(?).

And as you mention I still have a gut feeling (and some evidence) atomic floats should easily out perform spin_locks (at least in my case, not an HPC simulation), but I can trust spin_lock more than my own atomic float (for now) and it would really suck having to track down some really low level memory bug. So I'll use spin_lock for the moment.

But now that I have a spin_locked float class, I placed it in my more realistic benchmark again with 100 threads and a more complex scenario, this time I get:

atomic float: ~0.3 (it became a tad slower with the extra binary check in there)

spin_mutex: ~0.6

critical section: ~5.0 (omp critical)

mutex: ~8.0 (non-recursive fair)

So yes atomic float is still a magnitude (about 25x) faster than the mutex, (15x) faster than critical section, and (2x) faster than spin_mutex.

So far atomic float's results have been ok, no bugs (noticeable ones atleast), but I prefer the guilty until proven in this case, so until I can really trust atomic float, spinlock is a good replacement for a full mutex.

But if there was a better atomic float you bet I'd use that instead!

P.S>On a side note about performance, the overall performance (not just the tight loops) of the simulator is much better when using atomic and spin_lock than it was with Mutex (my guess is memory locality), so in reality the performance boost over all is more than what I'm showing here.

That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?

Quoting - Raf Schietekat

That does look interesting. Could you get some more resolution on the comparison between atomic float and spin_mutex (run the test a bit longer)?

Ok made a larger benchmark. This time all objects are tightly aligned in a vector, to reduce noise in the results (a more situation), The constants are full-loops = 100, objects = 1000 (each with 32 atomic/spin floats + some other stuff)

the variable is the amount of work per object.

And the results are

---------------------------------

Low contention scenario:

x1

spin = 0.909276

atomic = 0.940132

x5

spin = 4.8788

atomic = 5.03111

x10

spin = 9.75656

atomic = 10.077

x50

spin = 48.7615

atomic = 50.3652

x100

spin = 97.5665

atomic = 100.776

----------------------------------

Moderate contention scenario: (not much contention)

x1

spin = 0.94363

atomic = 0.994446

x5

spin = 5.04714

atomic = 5.33861

x10

spin = 9.99775

atomic = 10.5859

x50

spin = 50.0049

atomic = 52.8795

x100

spin = 100.084

atomic = 105.835

---------------------------------

High contention scenario:

x1

spin = 1.37543

atomic = 1.48458

x5

spin = 7.53178

atomic = 8.06872

x10

spin = 15.0603

atomic = 16.1485

x50

spin = 75.4305

atomic = 80.8925

x100

spin = 150.738

atomic = 161.545

-----------------------

Object Memory usage:

Spinlocked object (348 bytes)

Atomic object (216 bytes)

Observations:

-spin locks outperform atomics a little in performance, but both seem to be ok, both grow linearly and with a similar slope.

-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)

Guess:

-In the larger simulation, much more random and with lots of moving parts, using atomic gains us raw memory, which was probably what affected performance and not the locality of the objects.

Everybody's a winner and everybody gets a prize!

Maybe an atomic float can still be useful as a conveniently predefined standard tool (consistently is always nice), not necessarily lock-free...

Quoting - robert.jay.gould

-dense atomic float objects are about 1 third smaller than dense spin float objects (good memory gain)

If it is concern and object contains several floats you can use bit-mutex per cache-line:

struct my
{

// first cache-line

uint32_t bit_mutex1; // contains 15 mutexes, each mutex occupies 1 bit

float data1 [15];

// second cache-line

uint32_t bit_mutex2; // contains 15 mutexes, each mutex occupies 1 bit

float data2 [15];

// third cache-line

uint32_t bit_mutex3; // contains 7 mutexes, each mutex occupies 1 bit

float data3 [7];

};

Or if contention is not extremal, then you can use plain spin-mutex per cache-line:

struct my
{

// first cache-line

spin_mutex mutex1;

float data1 [15];

// second cache-line

spin_mutex mutex2;

float data2 [15];

// third cache-line

spin_mutex mutex3;

float data3 [7];

};

This is definitely a complication, but this reduces space overhead significantly.

Or if contention is moderate, then you can use just spin-mutex per whole object:

struct my
{

spin_mutex mutex;

float data [37];

};

Or if space is really a concern, then you can pack mutex with some pointer/integer/enum field:

struct my
{

my* next; // low-bit is spin-mutex

float data [37];

};

Or...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - robert.jay.gould

That is a very interesting solution I hadn't thought about at all. Not sure if its the right solution for this exact problem, but its a very interesting concept that might help in other ways if I can get around to some refactoring there

Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.

Packing a mutex into the same cache line with data it protects is generally a good approach, an example where cache line sharing between different variables is appropriate.

Packing a mutex as a single bit in some other variable that does not use this bit for other purpose might be appropriate if space is really an issue. An example could be the lowest bit in a pointer, which is usually zero due to natural alignment (of course unless you use this pointer to iterate over a single byte collection such as C string). But it adds complexities both to lock&unlock operations (you should preserve the value of the variable; and in case it could change outside of the locked scope, you must also account to such possibility) and to reads&writes (masks have to be used).

Quoting - Alexey Kukanov (Intel)

Be aware of dangers, however. The main one for this case is of course false sharing. You might pack 512 single-bit mutexes in a 64-byte cache line, but the cache contention will be at least the same as if you just used one mutex residing in this line, if not even bigger.

Good point, there. It might become anover-optimizationthat doesn't pay.

Anyways for the moment I'll stick with Raj's idea of placing a spinlock around each float, the performance gain in the overall system didn't change that much when compared to atomic float, and its already a huge improvement over the serial and common-mutex solutions. Also right now memory isn't that tight, so I'll let things be, but I'll be keeping in mind that there is aconsiderablememory saving to be gained here if memory does become an issue.

Nevertheless, if someone could add a true atomic float to tbb. I'd still be happy :)

Alexey, Dmitriy, Robert,

An alternate way is to have the float data itself be the mutex. In IEEE 754 Floating Point Standard there exist certain bit patters called SNaN (Signaling Not a Number). These represent invalid operations. You could elect to declare that should your program ever generate a SNaN that the signal routine (FPU intrerrupt routine if one present) is to take appropriate action or terminate the program. Where non-termination appropriate action could be to convert the SNaN into a QNaN (Quiet Not a Number). Since your program now by design can never legitimately generate an SNaN you are free to use SNaN's for your own purpose. One or moreof which you can use to express a mutex lock, others of which you can use to propigate SNaN's if you so wish to.An SNaN can have a 0/1 sign bit, all 1's for exponent (8 bits), a 0 for the msb of the23-bit mantissaand anything but 0 for the remaining 22 bits of the mantissa. Binary

s111 111 110x xxxx xxxx xxxx xxxx xxxx

s can be 0/1
x's can express any 22-bit binary value except 0

0x7F800001would be an example of an SNaN. The non-0 xxx's could also contain a 1-based thread number or 22-bits worth of bit mask.

Now to perform an atomic float reduction (operator +=)

const int32_t LOCK = 0x7F800001;
union
{
float f;
int32_t i;
};
while((i=XCHG((pointer, LOCK) == LOCK)
_mm_pause();
*pointer = f + r;

(you can add the casts)

No additional memory consumed for mutex, only 1 memory lock class instruction issued the (InterlockedExchange).
You may or may not wish to follow the write back with a _mm_sfence(); But this will unnecessarily delay the thread releasing the LOCK (at the expense of a delay toa potential other thread waitingto observethe release). If the floats have very high contentions and you wish some fairness then consider using the _mm_sfence() (or a 2nd XCHG). *** Also, depending on processor archetecture the 2nd XCHG may be required IA32 and EM64T (x64) would not, Itanium might (I don't use that ).

Jim Dempsey

www.quickthreadprogramming.com

"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.

Quoting - Raf Schietekat

"An alternate way is to have the float data itself be the mutex." This should be smaller and faster than with spin_mutex, but probably only if there are few readers (with a spin_lock the value itself can still be read and written atomically in the original sense of the word), and if a thread holding a lock becomes unavailable the value will be lost, so I would hesitate picking this implementation but it may have its uses.

With spin_lock the protocol permits reading/using of the value (float) provided the operation being performed by the owning thread results in an atomic write with respect to the memory bus (typically the case). Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock.

With SNaN technique the protocol inhibits using the value under change during the ownership of the lock (presumably a very few number of statements). Also the SNaNprotocol prohibits the writing of the value by the threads not holding the lock.

In both cases the protocol can be violated by non-owning thread writing to variable (but this is error in programming).

Note, mutex and SNaN can be avoided by using CAS (InterlockedCompareExchange) provided it works in with your programming requirements.

The question then becomes: is it desirable to permit reading of the atomic float while atomic update is pending?

Do not be too quick to answer yes.

Permit me to make a reasonably good argument for SNaN:

One of the main reasons to use an atomic is for the purpose of a reduction variable. Assumingthe floatis used for a reduction variable then the preponderance operations or only operationsuntil solution complete is RMW. Should you use a disjointed mutex (i.e. not comingled with the float) then you will be performing an interlockedRMW on a mutex (optionally preceded by a read loop waiting until mutex is free), then performing a non-interlocked RMW followed by a fenced write of the mutex. During the reduction phase of the life of the variable the code will never simply perform a direct read without obtaining the lock (for the purpose of the RMW). After the reduction phase is complete, code typically returns to a Join and proceeds to use the variable (float) but never required as an atomic. The variable has a split personality during its lifetime. At times it is required to be atomic at other times it does not.

When using a separated mutex there is an advantage in placing the mutex within the same cache line as the atomic (one cache line to mess with instead of two). This not only takes up space for the mutex but also means that the atomic plus mutex are cache line co-resident (64 byte structure or so depending on your implementation, but this could be 8 bytes if youare careful). Dmitriy's suggestion of packing the atomic's with bit-filed mutex (mutii?) is either a blessing or curse depending on the manner in which the atomic's were used. If single thread has preponderance of updates to multiple atomic's then blessing. If multiple threads have preponderance of updates to a particular atomic then curse. In second caseplacing mutex and atomic pair into separate cache lines from other mutex and atomic pair would be better technique.

The placement of mutex and atomic are application dependent.

Now for foibles of unbridled exuberance.

If one can make a single atomic one can make an array of atomic's that must be better. If one can make an array of atomics then one can make a stl::vector >,that must be better yet. Then one could consider using a tbb::concurrent_vector > that has to be best yet

I can only say that your unbridled exuberance in using new and improved potentially comes at a severe cost (which you must be willing to pay).

Using an array of atomics (with interspersed and separate mutex), and worse yet using either form of vector with said atomic inhibits your ability to use the processor small vector instructions (e.g. SSEn.n).

If you require the use of hardware vector instructions for performance reasons then you cannot use arrays of a cache-line joined mutex+atomic. To a limited extent you can with Dmitriys bitmask mutex and small array and then stride your way along. But then pre-fetching becomes problematic.

When it is desirable to use hardware vector instructions then you must assure that the floats are aligned properly and compacted into a contiguous array (or sub-subsections of the array) of floats that do not move during the duration of the vectored loop. In this case you (may) need a lock on the vector but you also need to assure that the vector is contiguous. stl::vector may provide contiguous vectors today but are not assured to later (under new and improved version update), tbb::concurrent_vector today does not assure all portion of vector is contiguous. And all flavors of mutex + atomic guarantee the floats are dis-contiguous.

What provides for both individual atomic and hardware vector exclusive use is for you to roll your own container where you have contiguous array with lock on array, optionally locks on zones of array, and potentially individual cell locks using SNaN technique.

For individual atomic SNaN may be the most efficient technique.

Also not discussed yet is if coded properly, the use of SNaN permits the programmer to code for read access to lockable variable not locked without explicit test if variable is locked. You let the hardware perform the lock test by generating a floating point exception and which your exception handler notes the designated SNaN and retries. If the expected collision frequency is high then either use CAS for reduction or peek and test for SNaN or use separate mutex. Again the application requirement determine how best to protect the float.

The biggest problem for any of these atomic reduction type of operations is for the annoying problem of an unnecessary interruption of your code sequence progressing from the lock through the release. If a clock tick or some other event occurs between the lock and the release then the lock is held for a very long time as opposed to a very short time. This is bad news for any thread waiting for the lock.

On various other forum messages I suggested to Intel that they seriously consider adding to the instruction set two instructions

Temporarily Inhibit Interrupts (TII)
Cancel Temporary Inhibit Interrupt (CTII)

The TTI could be followed by an immediate value indicating the number of instruction fetch cycles to delay acceptance of interrupts from outside sources such as clock but not events such as divide by 0, page fault etc To avoid abusive programming practices, TII could implicitly perform a CTII (should TII be pending) and honor pending interrupts prior to entering TII state.

The TII state then runs for the specified number of instruction fetches, or until fault, or until CTII occurs.

What this buys you is, excepting for fault, your program can guarantee short duration locks. And with the guarantee of short duration locks you can avoid all this unnecessary coding effort to work around the occasional circumstance of the lock owning thread being interrupted and scheduled out. This would mean you could dispense with Lock-Free/Wait-Free programming (at least for the simple cases), you can dispense with calling SwitchToThread in your busy wait loop, and possibly dispense with _mm_pause. Programming becomes much simpler and more efficient.

Jim Dempsey

www.quickthreadprogramming.com

"Also the spin_lock protocol prohibits the writing of the value by the threads not holding the lock." Yes, I meant (only) "read" of course, not "read and written", sorry (again).

Maybe I just have not seen enough code, but for the word count in thread "Scalable hash map algorithm" http://software.intel.com/en-us/forums/showthread.php?t=60494 it was easy to find a solution that did not involve sharing data. It's only a hunch, though, that code that almost never reads can probably be optimised similarly.

Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".

You're making unfounded assumptions about my "exuberance", but the compatibility with vector instructions seems a potentially relevant consideration. Use of signalling looks quite clever, although it probably should not be used in a basic atomic.

The idea behind "Temporarily Inhibit Interrupts" has been implemented before as mentioned elsewhere on this forum, and seems very useful indeed.

Quoting - Raf Schietekat

[...]

Plural of "mutex" would never be "mutii" even if it were a Latin word (is it a "portmanteau word" for "mutual exclusion" or would it be called something else?); "mutices" would be more likely. But it seems fruitless to try to use Latin inflexions in English: the plural of "fructus" is still "fructus" when written. Some words only look like Latin: the plural of "octopus" would more likely be "octopodoi" (but then why is this not just a single animal with eight "feet"?), and the word itself it transcribed from something looking more like "oktopous".

[...]

To avoid confusion: the plural forms are just "mutexes" (if you don't want to use "mutex variables") and "octopuses" (I wish I had not used this example, though, because "octopi" is also accepted). Now I hope this will be the end of that.

P.S.: Thanks to Robert for those lottery numbers, I have now become independently wealthy... not!

I thought I might have a go at adding atomic floats, but I got stuck on the exact behaviour.

First I decided that plus zero and minus zero are equal for operator==, and must therefore be equal for compare_and_swap, and therefore also for compare_and_store (and friends) because of functional equivalence when spurious failure is disallowed. For operations where spurious failure is not disallowed, it would be undefined whether this constitutes spurious failure (any use should be prepared to retry with the updated comparand value, but there would be no guarantee that the retry would not be internalised for not being a contention-related occurrence). So far so good?

The problem with a NaN value is that it is not equal to anything, even itself (right?). So it would never be possible to replace a NaN value (using the reasoning above for plus/minus zero), and there would be numerous opportunities for infinite loops. One solution would be to not define behaviour with NaN values, but this seems rather problematic and would not win atomic a lot of friends. Another solution is to disallow NaN values and throw an exception when they are detected or generated (divide by zero). Yet another solution is not to provide compare_and_swap and use the new freedom to redefine the other compare_and_... operations to allow success based on value representation. What would it all mean for a series of divide operations?

Any solution should probably be compatible with as yet undefined atomics on general object types, which would probably be required to have an equality operator. Maybe the unifying concept could be to throw an exception whenever a value is detected that is not equal to itself?

Any thoughts?

Quoting - Raf Schietekat

To avoid confusion: the plural forms are just "mutexes" (if you don't want to use "mutex variables") and "octopuses" (I wish I had not used this example, though, because "octopi" is also accepted). Now I hope this will be the end of that.

[...]

"Now I hope this will be the end of that." And of course I could be expected to violate that myself: "octopus", while having its origin in Greek, is indeed a Latin word, but with plural form "octopodes" (third declension, stem "octopod"), so "octopi" is just plain wrong and I was right after all. :-)

Raf

The compare_and_swap, when you drill down to the instruction set of the computer (IA32, IA64, EM64T/x64) does NOT perform a == for the compare. instead it performs a 32-bit word (32-bit aligned) or a 64-bit word/dword (64-bit aligned) or in the case of a CMPXCHG16B a 128 bit aligned BINARY compare. IEE -0 and +0 will exhibit equality on== but not when the binary bit patters are compared using 32-bit word, 64-bitwordor 128 bit bit-by-bit compares. Further the NaN's are perfectly valid bit patters for binary words (of respective lengths). Uninitialized data has a significant probability of being NaN's.

As I attempted to explain earlier, the CMPXCHG instruction uses the binary pattern of the word and not an equivilent float. Therefore, in order to assure that the old value you obtain, for purpose for use with CMPXCHG later, must be obtained using memory to register/register to memorymove instructions and not by way of floating point load and floating point store which may modify the bit pattern in the process or throw an exception in the case of SNaN. Once the old value is obtained you are free to use it (the copy)as a float in and expressionor simply ignore it until it comes time to be used for the CMPXCHG. In this manner even a NaN can be used for the CMPXCHG while not used in expression (e.g. when you initialize an atomic float).

Jim Dempsey

www.quickthreadprogramming.com

Yes, the instructions work on the "value representations" (bit patterns), but I was looking for a way to make things work nicely for the user with the equality test that would be performed on the result of compare_and_swap (avoiding "oops" moments where the test says yes and the swap didn't occur anyway). NaNs in uninitialised data would not be a concern because simple assignment gets rid of them easily; the concern is that naive or generic compare_and_... loops might get stuck forever, but the assumption would still be that such loops are not used with uninitialised or unassigned data: there's a difference between forgiving and totally foolproof.

Are you saying that operations on float/double variables are not restricted to their domain by default? I thought that only intermediate values in a calculation are kept in longer registers, and that these are round-trip faithful anyway. Equality tests would have to be between variables, of course, not between a variable and a calculated result.

Anyway, I'm not set on getting somewhere with this, just considering the possibilities for the moment.

"Anyway, I'm not set on getting somewhere with this, just considering the possibilities for the moment." But let's narrow it down to my favourite (and you might have guessed because I named it last)...

If an atomic floating-point type does not get to have compare_and_swap (which should probably be deprecated everywhere anyway), there is no urgent need to let minus zero pass for plus zero or vice versa (by repeated attempts with the respective value representations). It is not quite the same as spurious failure, because there you have an assumed progress guarantee ("if at first you don't succeed, try, try again"), but would a programmer really be tempted to use the same comparand every time with a floating-point type? If an atomic is used for a spin lock, it is an obvious use case to repeatedly try to substitute 1 for 0, so it helps that there is a one-to-one mapping between values (all equal to themselves and different from any other) and value representations for integral types (assuming that the compiler has done its job of normalising "true" for "bool"). But with a floating-point type, the most prominent use case is to read a value, perform a computation on it, and write back the result, and try again with the new snapshot instead if the substitution failed; it does not seem so bad to have to do this for the occasional failure based on plus/minus zero, and, because there would be no need for comparisons, no complexities would be introduced related to plus/minus zero and NaN.

Did I overlook anything?

>>But with a floating-point type, the most prominent use case is to read a value, perform a computation on it, and write back the result, and try again with the new snapshot instead if the substitution failed; it does not seem so bad to have to do this for the occasional failure based on plus/minus zero, and, because there would be no need for comparisons, no complexities would be introduced related to plus/minus zero and NaN.

Did I overlook anything?<<

If you fetch and savethe old value of the floating point variable using float context, the compiler may use floating point type instructions to copy the data into the "old_value" (especially since it sees you performing the reduction (+) one line later). This is bad news if you use that value for the compare and exchange .AND. if the old_value was tidied up in the process of copied. In this case you will never pass your compare_and_exchange. Pseudo code of the correct way

float* pCell = get_address_of_index(index); // vector cannot move
do {
float old_value;
(DWORD)old_value = *((DWORD*)pCell); // using 32-bit registers
float new_value = old_value + bump_value; // using floating point registers
} while(!DCAS(pCell, new_value, old_value)); // using 32-bit registers

Notes, check your flavor of DCAS for order and kind of variables
The cell pointer must remain valid for the duration of theatomic reduction.
Caution relating to vector >
Some forms of containers might not provide persistance of vector
cells accross vector expansion due to other thread potentially expanding
vector during insertion. To protect against this youmay need additional
coding (locks or other defensive coding).
Ithink that tbb containers do not move a cellonce the cell
isused whereasother containers make no such guarantee.

Jim Dempsey

www.quickthreadprogramming.com

Additionally you my need to declare the old value with volatile (depending on optimization of your compiler).

Or if using Intel C++ consider using __ld4_acq(void* src);

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Additionally you my need to declare the old value with volatile (depending on optimization of your compiler).

Or if using Intel C++ consider using __ld4_acq(void* src);

Jim Dempsey

Yes its a good idea to add the volatile to "play it safe".

As for vector > In my case I have assurance that vectors aren't modified while work takes place (pipeline-like stages work-tidy up-work)

"tidied up in the process of copied" So the user will probably have to be instructed to always include at least two successive compare_and_store() instructions in each iteration if a back-off is included in the loop. Would compare_and_store()'s update by reference also be compromised if the compiler can look inside the implementation? Still, that would merely be annoying (the implementation can be hidden in an object file instead of inlined). BTW, in the case of an IEEE 754 implementation there would be no "tidy up" of numbers (each has a unique value representation, except for the two related to plus/minus zero), so something like "disturb" seems more appropriate.

"Pseudo code of the correct way" That's what atomic would get rid of, or else what's the point...

In addition to -0, unnormalized numbers may get tidied up too.

An atomic by itself, or as a cell in a standard array, or member variable in structure are all fine as you can safely produce the & (address of) the atomic (for use in the compare_and_swap). If you abstract the further by placingthe atomicinto a container (vector >) and if the vector implementation does not assure that once a cell is used it won't move, or if the container does not have a "do not move this cell while I hold a reference to the cell", then it will not be safe to perform an exchange or store without additional protocol to avoid you trashing un-owned memory. In the case of the atomic reduction operator the actual reduction (atomic operation) would be best placed inside the container as a operator. The container code can be written to work around issues relating to relocating cells while reference pending for purpose of compare_and_swap, swap, or store.

Jim Dempsey

www.quickthreadprogramming.com

"In addition to -0, unnormalized numbers may get tidied up too." Hath not a denormalised number a specific value? If you "tidy it up", does it not change? (Sorry, couldn't resist.)

Would the following set of features be acceptable with float and double: {compare_and_,fetch_and_,}store(), operator=, load(), operator value_type(), operator{+=,-=,*=,/=}?

compare_and_swap() needs to be omitted because it would be fraught with problems, as already discussed.

Integer types have their use even for operations like compare_and_or(), but with floating points it seems better to wipe the slate clean than to add, for consistency, numerous operations referring to multiplication and division.

Still, it does not seem fully satisfactory: maybe operators ++ and -- should still be available. But then why not have ++ for bool? And *= and /= for integers? And what about >>= and <<=? But adding is always nicer than taking away, so...

Quoting - Raf Schietekat

Would the following set of features be acceptable with float and double: {compare_and_,fetch_and_,}store(), operator=, load(), operator value_type(), operator{+=,-=,*=,/=}?

compare_and_swap() needs to be omitted because it would be fraught with problems, as already discussed.

Integer types have their use even for operations like compare_and_or(), but with floating points it seems better to wipe the slate clean than to add, for consistency, numerous operations referring to multiplication and division.

Still, it does not seem fully satisfactory: maybe operators ++ and -- should still be available. But then why not have ++ for bool? And *= and /= for integers? And what about >>= and <<=? But adding is always nicer than taking away, so...

There is nothing wrong with a compare and swap on floating point data as long as the comparand is obtained using register moves (as opposed to floating point load and store). Example of atomic add for double

double AtomicAdd(double* pd, double d)
{
union
{
LONGLONG iOld;
double dOld;
};
union
{
LONGLONG iNew;
double dNew;
};
while(true)
{
iOld = *(__int64*)pd; // current old value
dNew = dOld + d;
if(InterlockedCompareExchange64(
(volatile LONGLONG*)pd, // loc
iNew, // xchg
iOld) // cmp
== iOld)
return dNew;
}
}

Jim Dempsey

www.quickthreadprogramming.com

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today