Race in spawn/gate?

Race in spawn/gate?

I am looking at the sources of Intel TBB (20080408oss version), and I
can't understand one moment. It looks like a data race which can lead
to worker thread blocking for arbitrary amount of time even if there
is work to steal. I don't fully examine all the code, so maybe I am
wrong... I hope so.

It's about class Gate (actually eventcount) and function
GenericScheduler::spawn().

My understanding is that spawn() must be implemented this way:

// pseudo-code
void spawn(task* t)
{
make_task_available_for_other_threads(t);
store_load_memory_fence(); // crucial!
gate_state_t state = gate.get_state();
if (need_signalling(state))
gate.signal();

}

In TBB implementation I don't see store_load_memory_fence().
Here is cuts from TBB implementation with my comments:

void GenericScheduler::spawn( task& first, task*& next ) {

// a lot of irrelevant code skipped

if( d>deepest )
deepest = d;
if( dprefix().steal_begin )
tp->prefix().steal_begin = d;

// at this point task is available to other threads

// no store-load fence, only store-release
release_task_pool();

mark_pool_full();

}

inline void GenericScheduler::mark_pool_full() {
// naked load
Gate::state_t snapshot = arena->prefix().gate.get_state();
// check whether we need to signal gate
if( snapshot!=SNAPSHOT_FULL
&& snapshot!=SNAPSHOT_PERMANENTLY_OPEN )
// signalling
arena->prefix().gate.try_update(
SNAPSHOT_FULL, SNAPSHOT_PERMANENTLY_OPEN, true );
}

state_t Gate::get_state() const {
// naked load
return state;
}

So I don't see store_load_memory_fence() between making task available
to other threads and loading of gate (eventcount) state. This can lead
to situation when thread, which executes GenericScheduler::spawn(),
will incorrectly decide to *NOT* signal gate, when it actually has to.

Where I am not right?

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I would think that a store-load fence does not work on IA-32/Intel 64 (it does not go all the way to memory and back, you need a full fence to do anything significant), and Gate::get_state() returns the value of an atomic, i.e., with acquire semantics for the load. I don't mean to imply that your post is without merit (I have not studied the issue, and maybe the store-load fence is exactly what is required even if it is redundant on Intel), but at least get_state() is not a naked load, and I see different code for mark_pool_full() in the version you are referring to ("// Double-check idiom"), which might be significant before going deeper into this.

Raf_Schietekat:I would think that a store-load fence does not work on IA-32/Intel 64 (it does not go all the way to memory and back, you need a full fence to do anything significant),

I mean that StoreLoad is what is needed here. If architecture doens't support exactly this type of fence, than one have to promote fance type to the next stronger fence type. On x86 it will be full fence (mfence).

Raf_Schietekat: and Gate::get_state() returns the value of an atomic, i.e., with acquire semantics for the load.

In my version (20080408oss) state type is std::ptrdiff_t. Anyway acquire semantics doesn't solve the problem.

Raf_Schietekat: I don't mean to imply that your post is without merit (I have not studied the issue, and maybe the store-load fence is exactly what is required even if it is redundant on Intel),

???
store-load is the only type of fence that is NEEDED on x86 (as well as on Sparc TSO).
It's load-acquire and store-release which is NOT needed on x86. But they don't substitute store-load (mfence in terms of x86).

Raf_Schietekat: but at least get_state() is not a naked load, and I see different code for mark_pool_full() in the version you are referring to ("// Double-check idiom"), which might be significant before going deeper into this.

Sorry I strip the comment when posting. Here is full version:

inline void GenericScheduler::mark_pool_full() {
// Double-check idiom
Gate::state_t snapshot = arena->prefix().gate.get_state();
if( snapshot!=SNAPSHOT_FULL && snapshot!=SNAPSHOT_PERMANENTLY_OPEN )
arena->prefix().gate.try_update( SNAPSHOT_FULL, SNAPSHOT_PERMANENTLY_OPEN, true );
}

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
Raf_Schietekat:I have not studied the issue, and maybe the store-load fence is exactly what is required...

I also cross-post it to comp.programming.threads:
http://groups.google.ru/group/comp.programming.threads/tree/browse_frm/t...

Maybe in c.p.t. someone will make situation clear.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I think Dmitry is right and there should be acquirefence in get_state.

The issue however is rather negligible in general case, though it might cause a performance issue in some corner cases. The general usage model of TBB assumes many tasks are spawned frequently; so the SNAPSHOT_EMPTY signal missed at spawn time will be caught during the next spawn. And relaxed sequential execution model used in TBB algorithms guarantees no work will be missed: the spawning thread will execute it itself.

I will also ask Arch Robison to comment.

Sorry, I missed that state is only atomic for "#elif __TBB_USE_FUTEX", so otherwise it's a race all right. If I would want to investigate this, I would probably try to find out why mark_pool_full() follows release_task_pool() instead of preceding it (normally the sequence is, I think: change stuff, release it, "signal" it (separately if safe or linked together with release), no more stuff unless preceded by a full fence, and store-load/release-acquire does not a full fence make on Intel).

Thanks for reviewing the code. I'm starting to be convinced that open-source and many eyeballsare the only way to deal with memory fence issues.

The two pathologies of interest are:

  1. "spurious wakeup" - workers are woken up but the task pools are empty
  2. "lost wakeup" - a task pool has a task, but workers remain asleep (i.e. blocked by the gate).

Both of these pathologies perhaps cause performance issues, but never correctness issues. Spuriously woken threads waste time looking for work that is not there. Lost wakeup misses potential parallelism. In TBB parallelism is always optional, not mandatory, so there is no correctness issue.

I won't consider spurious wakup any further. Because workers spin a bit while doing random task stealing, curing spurious wakeup is unlikely to improve performance or reduce power noticeably.

Dmitriy is right that a full fence is necessary to prevent a lost wakeup. An acquire fence is insufficient because an acquiring load in get_state() could move upwards over the releasing store in release_task_pool(). Indeed, that's the only reordering that x86 hardware does anyway.

So now I'll argue "worse is better". TBB, like Cilk, assumes that there is much "parallel slack"; i.e. thereis much morepotential parallelism than available threads. A corollary shown in the Cilk papers is that stealing becomes relatively rare compared to non-stealing, and hence it becomes important to minimize overhead for the non-stolen case, even if this incurs some penalty for the stealing case. Regrettably, fences are very expensive on x86. So we have a choice between:

  1. Occasional missed wakeups, and thus occasional missed stealing opportunities.
  2. Paying for a full fence on every spawn, whether the task is stolen or not.

My intution is that (2) is a much higher penalty than (1), but only benchmarking could really tell. Indeed, in the futex based version, we should take out the acquire fence implied by the read of the atomic state. I.e., we need to be consistent in our evil :-)

Currently full fences can be expressed in TBB only in inefficient and obtuse ways, like: atomic::fetch_and_add(0) or atomic::compare_and_swap(0,0).

This discussion does point out two things missing from atomic that should be added:

  1. A way to express an unfenced atomic load (e.g. for curren futex-based gate)
  2. A way to express a full fence (e.g. if we wanted to be strict about wakeup)

I've started a separate note on adding this functionality.

MADakukanov:

I think Dmitry is right and there should be acquirefence in get_state.

Acquire fence won't help. There must be 'mfence' instruction between making task available and signalling.

MADakukanov:

The issue however is rather negligible in general case, though it might cause a performance issue in some corner cases. The general usage model of TBB assumes many tasks are spawned frequently; so the SNAPSHOT_EMPTY signal missed at spawn time will be caught during the next spawn. And relaxed sequential execution model used in TBB algorithms guarantees no work will be missed: the spawning thread will execute it itself.

I will also ask Arch Robison to comment.

Well, ok, you are 100% right here.
I just want to know whether it's deliberate choice to omit 'mfence' here or it's just an accident.
Anyway I think it will be better if you comment this moment in the source code. Something like 'actually here must be 'mfence', but it's ommited for optimization purposes...'.

I see that Gate class is not public interface. Otherwise it will be better to provide 2 methods:

state_t Gate::get_state() const
{
__asm mfence;
return state;
}

state_t Gate::get_state_naked() const
{
return state;
}

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
Raf_Schietekat:Sorry, I missed that state is only atomic for "#elif __TBB_USE_FUTEX", so otherwise it's a race all right.

Yes, I only investigate code for USE_WINTHREAD and USE_PTHREAD. I don't see __TBB_USE_FUTEX. Sorry, that I don't mention this.

Raf_Schietekat:If I would want to investigate this, I would probably try to find out why mark_pool_full() follows release_task_pool() instead of preceding it (normally the sequence is, I think: change stuff, release it, "signal" it (separately if safe or linked together with release), no more stuff unless preceded by a full fence,

Sorry, I don't get you. Can you describe you point in more detail?

Raf_Schietekat:
and store-load/release-acquire does not a full fence make on Intel).

What you mean by 'store-load' and 'release-acquire' here?

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

With regard to Dmitriy's last two posts:

I do not want to take a position on whether a full fence is actually required here, so I'm only offering some remarks in the hope that they contribute more than they distract (?). You will probably agree that MFENCE functionality would also be provided by making either the release or reading the state become a fully fenced equivalent (see below). I believe Arch has answered the other questions, and has given a glimpse of high-level insight into these low-level details, although I have not studied the code to verify whether indeed worse is better. I would agree about having comments like the one you suggested, especially since C++0x currently leaves the result of a data race "undefined", so strictly speaking the computer may go up in a puff of smoke, or at the very least catch fire. I do not agree with the code you offer for get_state(), though: there should probably also be an acquire fence after reading the signal (or implied by the read, and in this case fetch_and_add(0) does it all in one operation), and get_state() should probably include an indication in its name about the use of a full fence.

Funny that you should have missed __TBB_USE_FUTEX in gate.h, where I at first overlooked the others. I hope you don't mind if I don't explain the intuitive reasoning you quoted next because I am still trying to really understand this stuff, so please just consider it the first step in a double-check idiom (suspicion followed by investigation), where I omitted the second step. With "store-load/release-acquire" I meant anything like SFENCE plus LFENCE, now redundant on IA-32/Intel 64 because implied by stores and loads.

Raf_Schietekat:With regard to Dmitriy's last two posts:

I do not want to take a position on whether a full fence is actually required here, so I'm only offering some remarks in the hope that they contribute more than they distract (?).

My opinion is that mfence is required. But it can be omitted. But it must be documented. It's trade-off decision.

Raf_Schietekat:You will probably agree that MFENCE functionality would also be provided by making either the release or reading the state become a fully fenced equivalent (see below).

Yes, on most x86 (including AMD) locked RMW is usially faster than plain op + mfence. But on other architectures this can be not true.

Raf_Schietekat:I believe Arch has answered the other questions, and has given a glimpse of high-level insight into these low-level details, although I have not studied the code to verify whether indeed worse is better. I would agree about having comments like the one you suggested, especially since C++0x currently leaves the result of a data race "undefined", so strictly speaking the computer may go up in a puff of smoke, or at the very least catch fire. I do not agree with the code you offer for get_state(), though: there should probably also be an acquire fence after reading the signal (or implied by the read, and in this case fetch_and_add(0) does it all in one operation), and get_state() should probably include an indication in its name about the use of a full fence.

Can you explain why acquire fence is needed when loading state?

Raf_Schietekat:Funny that you should have missed __TBB_USE_FUTEX in gate.h, where I at first overlooked the others. I hope you don't mind if I don't explain the intuitive reasoning you quoted next because I am still trying to really understand this stuff, so please just consider it the first step in a double-check idiom (suspicion followed by investigation), where I omitted the second step. With "store-load/release-acquire" I meant anything like SFENCE plus LFENCE, now redundant on IA-32/Intel 64 because implied by stores and loads.

store-load as well as acquire-release can't be implemented with SFENCE/LFENCE on x86. They only can be implemented with MFENCE.

It's interesting question - whether SFENCE is redundant on x86 or not?
Is using of non-temporal stores is prohibited in tbb::task::execute() in TBB documentation? ;)
Are you considering the case when I am using movntq for storing data for child task? It seems that for now it will work as expected.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

Thanks for reviewing the code. I'm starting to be convinced that open-source and many eyeballsare the only way to deal with memory fence issues.

Unfortunately, not. Open-source is not the way too. There are just too few expert eyeballs.
There is open-source library called 'AppCore' (http://appcore.home.comcast.net/~appcore/). Maybe you've heard about it - it's referenced by a number of resources, including articles on Intel web-site and some university lectures. It was 3 years in open-source until I've found exactly the same bug in it:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread...
AppCore provides support for general-purpose message-passing, so there the bug is considerably more important.

MADadrobiso:The two pathologies of interest are:

  1. "spurious wakeup" - workers are woken up but the task pools are empty
  2. "lost wakeup" - a task pool has a task, but workers remain asleep (i.e. blocked by the gate).

Both of these pathologies perhaps cause performance issues, but never correctness issues. Spuriously woken threads waste time looking for work that is not there. Lost wakeup misses potential parallelism. In TBB parallelism is always optional, not mandatory, so there is no correctness issue.

I can't agree with you here. In the context of TBB threading is all about performance/scalability. So if there is something which can force my code to execute only on 1 core, instead of 8. Well, my opinion is that it's error, and must be fixed.

In whole it's the only requirement for TBB - it must provide and guarantee maximum utilisation of hardware resources.

MADadrobiso:I won't consider spurious wakup any further. Because workers spin a bit while doing random task stealing, curing spurious wakeup is unlikely to improve performance or reduce power noticeably.

Dmitriy is right that a full fence is necessary to prevent a lost wakeup. An acquire fence is insufficient because an acquiring load in get_state() could move upwards over the releasing store in release_task_pool(). Indeed, that's the only reordering that x86 hardware does anyway.

So now I'll argue "worse is better". TBB, like Cilk, assumes that there is much "parallel slack"; i.e. thereis much morepotential parallelism than available threads. A corollary shown in the Cilk papers is that stealing becomes relatively rare compared to non-stealing, and hence it becomes important to minimize overhead for the non-stolen case, even if this incurs some penalty for the stealing case. Regrettably, fences are very expensive on x86. So we have a choice between:

  1. Occasional missed wakeups, and thus occasional missed stealing opportunities.
  2. Paying for a full fence on every spawn, whether the task is stolen or not.

My intution is that (2) is a much higher penalty than (1), but only
benchmarking could really tell. Indeed, in the futex based version, we should take out the acquire fence implied by the read of the atomic state. I.e., we need to be consistent in our evil :-)

Yes, according to Cilk's "work-first" principle it's better to remove this full fence.
I was going to propose to remove it again, when you would agree to insert it :)

But consider following moment. Now TBB is considered by community not only as tool for high-performance computating (as it is with Cilk). So user can use tasks as futures:

// pseudo-code

void my_root_task::execute()
{
spawn(my_first_task());
spawn(my_second_task());
}

Here is no parallel slack at all, but user still expects that those 2 task will be executed in parallel. If user uses 'scheduler bypass' for second task, there will be only one 'spawn'.

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Can you explain why acquire fence is needed when loading state?" Otherwise maybe the code will start reading data that should be protected by the signal before it reads the signal itself? But I haven't studied the code, so I don't really know.

"store-load as well as acquire-release can't be implemented with SFENCE/LFENCE on x86. They only can be implemented with MFENCE." I guess there's a fence definition problem that has also tripped me up before, but the meaning behind this is exactly what I wrote myself.

"It's interesting question - whether SFENCE is redundant on x86 or not?" With stores always performed in order it very much seems to be.

"Is using of non-temporal stores is prohibited in tbb::task::execute() in TBB documentation? ;) Are you considering the case when I am using movntq for storing data for child task? It seems that for now it will work as expected." Do elaborate, I'm not familiar with assembler programming and working on something else.

"Well, my opinion is that it's error, and must be fixed." I'm sure that "there is no correctness issue" did not mean to imply that there was no performance problem, it was just (rightfully) the first priority in Arch's analysis.

Raf_Schietekat:
"It's interesting question - whether
SFENCE is redundant on x86 or not?" With stores always performed in
order it very much seems to be.

"Is using of non-temporal stores
is prohibited in tbb::task::execute() in TBB documentation? ;) Are you
considering the case when I am using movntq for storing data for child
task? It seems that for now it will work as expected." Do elaborate,
I'm not familiar with assembler programming and working on something
else.

Non-temporal stores CAN sink below subsequent non-temporal stores and below subsequent plain stores.
Consider following code:

// user code
my_task* my_task::create_child_task()
{
my_task* t = ...;
_mm_stream_si32(&t->data, 1); // msvc intrinsic for 'movnti' instruction
return t;
}

// user code
void my_task::execute()
{
//...
spawn(create_child_task());
//...

}

// library code

void spawn(task* t)

{

//...
compiler_barrier(); // assume compiler won't reorder

some_shared_location = t; // publication of task, plain store

//...

}

Here store to t->data in my_task::create_child_task() CAN sink below
store to some_shared_location in spawn(). Plain store to
some_shared_location doesn't have release semantics wrt non-temporal
stores.

To prevent this one must use SFENCE instead of compiler_barrier().

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Aha... Well, it should be easy enough to provide your own SFENCE calls. TBB should only have to cater to standard C++, and if you insist on breaking out of that cage, you're on your own, and deservedly so. (On the other hand it would be reassuring to have some documentation about memory semantics related to tasks, concurrent containers, etc., like Java provides.)

On that other issue, it seems you are looking for (soft) real-time guarantees from TBB. Are you? Doesn't that imply different compromises than maximal throughput, which is what TBB aims for, I believe? It's not as if TBB is going to provide sluggish behaviour for any extended amount of time, but maybe you would be interested to pay with lower overall performance to avoid hickups like this. Looks like an interesting discussion potentially in the making...

Raf_Schietekat:
[...]
On the other hand it would be reassuring to have some documentation about memory semantics related to tasks, concurrent containers, etc., like Java provides.

Indeed.

Raf_Schietekat:
On that other issue, it seems you are looking for (soft) real-time guarantees from TBB. Are you?

No. I and just looking. I am not using TBB.

Raf_Schietekat:
Doesn't that imply different compromises than maximal throughput, which is what TBB aims for, I believe? It's not as if TBB is going to provide sluggish behaviour for any extended amount of time, but maybe you would be interested to pay with lower overall performance to avoid hickups like this. Looks like an interesting discussion potentially in the making...

I'm not interested to pay with lower overall performance. Personally I would eliminate *ALL* atomic RMW and heavy memory barriers from fast-path. But not just by omitting them :)

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I'm not interested to pay with lower overall performance." If you're not interested in real-time predictability, you don't have to, but then you should also not object to Arch working his evil ways (as long as there are apologetic comments in the code, of course!).

Raf_Schietekat:"I'm not interested to pay with lower overall performance." If you're not interested in real-time predictability, you don't have to, but then you should also not object to Arch working his evil ways (as long as there are apologetic comments in the code, of course!).

I'm not objecting. My intent is:
1. understand whether the issue has place
2. if yes, get to know whether it's intentional thing or not
3. if it's not intentional, then inform developers about the issue

Since it's intentional thing, no problem, ...

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Here's my proposed comment:

 // Double-check idiom that is deliberately sloppy about memory fences.
// Technically, to avoid missed wakups, there should be a memory fence between the point we released
// the task pool and read the gate's state. However, adding such a fence might hurt overall performance
// more than it helps, because the fence would be executed on every task pool release, even when
// stealing does not occur. Since TBB allows parallelism, but never promises parallelism, the missed
// wakeup is not a correctness problem.
Gate::state_t snapshot = arena->prefix().gate.get_state();
if( snapshot!=SNAPSHOT_FULL && snapshot!=SNAPSHOT_PERMANENTLY_OPEN )
arena->prefix().gate.try_update( SNAPSHOT_FULL, SNAPSHOT_PERMANENTLY_OPEN, true );

If it looks acceptable, I'll commit it. Our checkin process requires listing the code reviewer, so bless it or suggest an improvement and you can be immortalized in our SVN log as a code reviewer :-)

Non-temporal stores can safely be used in task::execute, but the reasoning is subtle.As Dmitriy points out,the store that releases a task pool is not a strong enough fence to stop a non-temporal store from migratingfrom before to after it. However,whenTBBpublishes a task, it firstdoes a "lock cmpxchg" instruction to acquire the task pool. A non-temporal store cannot migrate past a locked instruction, hence we are safe. The above argument is peculiar to x86 semantics, but so are non-temporaral stores, so the argument seems fair.

TBB does not promise parallelism. Indeed, this has been the root issue in debating whether to add futures to TBB. One viewpoint is that futures only allow optional parallelism (and thus are easily implementable in TBB), the other is that they promise parallelism (and thus are unimplementable in TBB). In general, mandatory parallelism hurts performance because it inevitably leads to oversubscription of a processors resources.

The TBB viewpoint on parallel slack is that it is fundamental to scalability. Programs without parallel slack are not scalable to even one additional processor.

MADadrobiso:

Here's my proposed comment:

 // Double-check idiom that is deliberately sloppy about memory fences.
// Technically, to avoid missed wakups, there should be a memory fence between the point we released
// the task pool and read the gate's state. However, adding such a fence might hurt overall performance
// more than it helps, because the fence would be executed on every task pool release, even when
// stealing does not occur. Since TBB allows parallelism, but never promises parallelism, the missed
// wakeup is not a correctness problem.
Gate::state_t snapshot = arena->prefix().gate.get_state();
if( snapshot!=SNAPSHOT_FULL && snapshot!=SNAPSHOT_PERMANENTLY_OPEN )
arena->prefix().gate.try_update( SNAPSHOT_FULL, SNAPSHOT_PERMANENTLY_OPEN, true );

If it looks acceptable, I'll commit it. Our checkin process requires listing the code reviewer, so bless it or suggest an improvement and you can be immortalized in our SVN log as a code reviewer :-)

I only suggest to add 2 little additions:
...there should be a full memory fence between the point we released the task pool (i.e spawned task) and read the gate's state...

Otherwise I bless it. If I would see such comment in the code initially, then this topic would not born :)

Dmitriy V'jukov

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today