Bug in concurrent_queue

Bug in concurrent_queue

TBB 2.1 (20080825)

Consider following code from concurrent_queue.cpp

void concurrent_queue_base_v3::internal_push( const void* src ) {

 [...]

 // it's sequentially consistent atomic RMW,

 // so we can consider that it's followed by #StoreLoad style memory fence

 ticket k = r.tail_counter++;

 [...]

 // it's relaxed load

 if( r.n_waiting_consumers>0 ) {

 EnterCriticalSection( &r.mtx_items_avail );

 // signal consumers

 [...]

 LeaveCriticalSection( &r.mtx_items_avail );

 }

 [...]

}


void concurrent_queue_base_v3::internal_pop( void* dst ) {

 [...]

 // it is really empty.. go to sleep FOREVER!

 EnterCriticalSection( &r.mtx_items_avail );

 // it's relaxed load, followed by relaxed store (or relaxed pseudo RMW)

 r.n_waiting_consumers++;

 // it's load-acquire

 while( r.tail_counter<=k ) {

 // block thread

 [...]

 }

 LeaveCriticalSection( &r.mtx_items_avail );

 [...]

}

Here is a bunch of problems.
I. Most serious problem. This logic can lead to DEADLOCK.
Consider following execution sequence:
1. Producer increments tail_counter
2. Producer loads n_waiting_consumers, n_waiting_consumers==0, so it doesn't signal consumers
3. Consumer increments n_waiting_consumers
4. Consumer loads tail_counter, load returns stale value, consumer goes to sleep forever
To fix this you must issue #StoreLoad style memory fence after incrementing n_waiting_consumers.

II. Because n_waiting_consumers participates in lock-free algorithm, i.e. not always protected by mutex, it's better to be declared volatile, in order to calm down compiler.

III. n_waiting_consumers and n_consumers_to_wakeup declared as uint16_t, i.e. maximum number of waiting threads is 65k. Well, it's enough for any sane situation for a couple of years. But this can be not enough even today for all those insane situations in which insane user will decide to use TBB. On my 32-bit system with 1GB of physical memory I'm able to create 30719 threads.
I don't see reasons why n_waiting_consumers and n_consumers_to_wakeup declared as uint16_t, and not as uint32_t, they doesn't used with CAS, they doesn't affect critical data layout.

And exactly the same situation with reverse signaling, i.e. when consumer signals producer about decreasing of queue's size.

Btw, Relacy Race Detector can easily reveal such errors (item I). Moreover unit-test can be as simple as: 2 threads, thread 1 enqueues single item, thread 2 dequeues single item. On this simple unit-test Relacy will be able to detect deadlock in a second. (SPIN/Promela won't detect, because it doesn't support relaxed memory models)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I. tail_counter is declared to be atomic. I don't know how in step 4 the consumer sees the old value of tail_counter? And how does the #StoreLoad style fence after increment of n_waiting_consumers help in this case?

II. I believe accesses ton_waiting_consumers are protected in a critical section?

III.Is it not a way to keep insane users from creating too many threads? :^) No, I am joking.. I see your point. Let me review it; I may change them to uint32_t unless I find a reason against it.

MADwkim3:

I. tail_counter is declared to be atomic. I don't know how in step 4 the consumer sees the old value of tail_counter? And how does the #StoreLoad style fence after increment of n_waiting_consumers help in this case?

I don't get your point. How atomic<> itself can help with ordering? atomic<> itself only guarantees atomicity, not ordering. Memory fences guarantees ordering.

Btw, this is exactly king of ordering problem I've discovered in task scheduler:
http://software.intel.com/en-us/forums//topic/58670
So I think that Arch Robison can confirm the issue.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADwkim3:

II. I believe accesses ton_waiting_consumers are protected in a critical section?

There is a read of n_waiting_consumers outside of critical section. This fact breaks all guarantees which critical section provides. We are on 'lock-free ground' here.
Compiler can cache value of n_waiting_consumers. Compiler can reorder access to n_waiting_consumers with other variables. While all access to all variables are protected with critical section those things are invisible to programmer and can not harm. But this is not the case here.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADwkim3:

III.Is it not a way to keep insane users from creating too many threads? :^) No, I am joking.. I see your point. Let me review it; I may change them to uint32_t unless I find a reason against it.

Note that there is a cache-line pad after those variables anyway, so we are not even gaining any profit wrt memory consumption here :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I believe TBB atomic hasproper memory fences in place. Thus, I don't think a compiler would reorder n_waiting_consumers++ and the load of 'tail_counter' in the pop method. The same applies to the produer side. Please correct me if I am wrong.

Assuming no reordering, by the time a consumer loads 'tail_counter' in the slow path, n_waiting_consumers++ must have been executed, and all the cache lines containing n_waiting_consumers must have been invalidated. So, it the consumer read the queue is empty, the producer has not yet incremented tail_counter, which means it has not read n_waiting_consumers.

MADwkim3:

I believe TBB atomic hasproper memory fences in place.

TBB atomics by default execute store as store-release and load as load-acquire. It's enough for weak causal synchronization, but it's not enough for strong competition resolution like in the blocking/signaling algorithm.

MADwkim3:
Thus, I don't think a compiler would reorder n_waiting_consumers++ and the load of 'tail_counter' in the pop method. The same applies to the produer side. Please correct me if I am wrong.

Theoretically, store to n_waiting_consumers can sink below load-acquire of tail_counter. This can be done by compiler and/or hardware.

MADwkim3:

Assuming no reordering, by the time a consumer loads 'tail_counter' in the slow path, n_waiting_consumers++ must have been executed, and all the cache lines containing n_waiting_consumers must have been invalidated.

By the time a consumer loads 'tail_counter' in the slow path,
n_waiting_consumers++ must have been retired, and store to the cache line
containing n_waiting_consumers now situated in store buffer, no cache lines invalidated yet, and nothing is visible to other cores/processors.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

>By the time a consumer loads 'tail_counter' in the slow path, n_waiting_consumers++ must have been retired, and store to >the cache line containing n_waiting_consumers now situated in store buffer, no cache lines invalidated yet, and nothing is >visible to other cores/processors.

Then, when the producer atomically increments 'tail_counter', the store buffer is flushed and the cache line is invalidated, no?

At that time, the producer sees the change made to n_waiting_consumers?

MADwkim3:

Then, when the producer atomically increments 'tail_counter', the store buffer is flushed and the cache line is invalidated, no?

Right. Atomic RMW is assumed to have trailing StoreLoad memory fence.

MADwkim3:

At that time, the producer sees the change made to n_waiting_consumers?

No. Change to n_waiting_consumers can still be in store buffer. BOTH sides must execute StoreLoad memory fence (mfence).
If things are like you are saying, then it will be possible to create mutex implementation which doesn't issue no atomic RMW nor full fences. Basically free of charge mutex. This is impossible.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

I was wrong about my previous statement. I was mixing up two different things.

I talked to Arch and we decided to make n_waiting_consumers (and n_waiting_producers)atomic for now. This would address the other issues you raised, as well.

BTW, take a look at

J.L.W. Kessels, "Arbitration without common modifiable variables" acta informatica 17, 135-141, 1982.

I think you can implemement Kessels' mutex using sfences only. It only requires loads and stores be indivisible.

The reason we decided to make it atomic, and not use an explicit fence, is strictly as a temporary hack. Looking through the TBB machine headers, we found we did not have a clean existing way to write a full fence.

Once we have Raf's revision of atomics integrated into TBB, we'll use his notation for a full fence instead of the hack.

- Arch

MADwkim3:

BTW, take a look at

J.L.W. Kessels, "Arbitration without common modifiable variables" acta informatica 17, 135-141, 1982.

I think you can implemement Kessels' mutex using sfences only. It only requires loads and stores be indivisible.

On one hand I am sceptic on this, on the other hand I am thrilled...

Unfortunately I can't find any online copy of this paper...

Is it similar to Peterson's algorithm (which requires mfence)?

http://en.wikipedia.org/wiki/Peterson%27s_algorithm

I was a bit inaccurate in my statement. It's impossible to create practical general-purpose algorithm for mutual exclusion which doesn't use locked instruction of mfence on both sides. Though theoretically it's possible to create just mutual exclusion algorithm which doesn't use locked instruction of mfence on both sides.

I developed 2 such mutual exclusion algorithms. They are both practical and provide extreme performance for *some* situations, but they cannot be called general-purpose. First uses a kind of rendezvous technique, and second uses global quiescent state tracking. As a result in both algorithms 'one side' is free of any heavy operations and executes just few plain machine instructions (i.e. takes just a few cycles). Algorithms are suitable for reader-writer mutex provided reader mostly workload.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
MADadrobiso:

The reason we decided to make it atomic, and not use an explicit fence, is strictly as a temporary hack. Looking through the TBB machine headers, we found we did not have a clean existing way to write a full fence.

Once we have Raf's revision of atomics integrated into TBB, we'll use his notation for a full fence instead of the hack.

Atomic RMW (which includes trailing full fence) is perfectly Ok here. Modification of tail_counter is done exactly this way. And on some architectures it can be even faster than just full fence. For example I remember reading AMD documentation on optimization, and they recommended to use locked instruction instead of mfence whenever possible. On Intel processors (at least on Pentium 4) locked RMW is also faster than mfence. And on some architectures full fence is faster than atomic RMW. So ideally this must be platform dependant. But as it on slow-path it actually makes no sense here :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
randomizer:
MADwkim3:

BTW, take a look at

J.L.W. Kessels, "Arbitration without common modifiable variables" acta informatica 17, 135-141, 1982.

I think you can implemement Kessels' mutex using sfences only. It only requires loads and stores be indivisible.

On one hand I am sceptic on this, on the other hand I am thrilled...

Unfortunately I can't find any online copy of this paper...

Is it similar to Peterson's algorithm (which requires mfence)?

http://en.wikipedia.org/wiki/Peterson%27s_algorithm

I was a bit inaccurate in my statement. It's impossible to create practical general-purpose algorithm for mutual exclusion which doesn't use locked instruction of mfence on both sides. Though theoretically it's possible to create just mutual exclusion algorithm which doesn't use locked instruction of mfence on both sides.

I developed 2 such mutual exclusion algorithms. They are both practical and provide extreme performance for *some* situations, but they cannot be called general-purpose. First uses a kind of rendezvous technique, and second uses global quiescent state tracking. As a result in both algorithms 'one side' is free of any heavy operations and executes just few plain machine instructions (i.e. takes just a few cycles). Algorithms are suitable for reader-writer mutex provided reader mostly workload.

I don't have an online copy or link either. I got it froma library. It is similar to Peterson's. Maybe even closer to Lamport's. Basically it is binary tournament. It may or may not be used as a general-purpose mutex in its form presented in the paper. But I think it can be adapted with some reasonable assumptions on the platform. Whether it will perform and/or scale better than locked instruction based mutexes is another question.

MADwkim3:
It is similar to Peterson's. Maybe even closer to Lamport's.

But they both requires #StoreLoad style memory fence! (mfence instruction in terms of x86)
Basically if there is a sequence like:

store X
(*)
if (load Y)
 ...

And this code assumed to provide strong competition resolving (for example, mutual exclusion), then in (*) #StoreLoad style memory fence must be executed. Unconditionally!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today