Questions on TBB Concurrent Queue Implementation

Questions on TBB Concurrent Queue Implementation

I went through the Concurrent Queue class in TBB and I havea fewquestions :

1. The concurrent_queue_rep class makes sure that head_counter and tail_counter are placed on seperate cachelines
so that the pushing threads and popping threads dont contendwhile accessing them.Also the consecutive operations to the queue aredirected to different microqueues so that these operations can be done quite independently. The size of each micro_queue happens to be 20 bytes. And the concurrent_queue_rep stores all the 8 microqueues consequetively
in memory(starting from the start of a cacheline).

So my question is, though the 8 concurrent operations are directed to 8 different micro queues, all the 8 threads are trying to access memorysome whereamong this 160 bytes ( 8 microqueues * 20 bytes) andI see quite some contention. So why not keep the each of the microqueues on different cachelines as well?

2. Idonot understand as to why the field 'mask' in the page struct is used to decide the success of a popping operation.
Because I see the operations SpinwaitUntilEq( head_counter, k ) and SpinwaitWhileEq( tail_counter, k ) in micro_queue::pop() method makesure that an itemis present for the popping threadto pop and hence I dont find the use of mask further to determine if the item is present.

3. The following is the code in the internal_pop methos of concurrent_queue_base
do {
k = r.head_counter++;
} while( !r.choose(k).pop(dst,k,*this) );

Why is the head_counter incremented in a do-while loop? How will a thread which doesnot find a item for pop operation N find an item for an an operation (N+1)?

4. If the threads that push into the concurrent queue and the threads that pop from the concurrent queue are different then the threads which pop from the queue will be the threads who call deallocate_page(). This again calls for a slow path in the scalable_free since the popping threads will never be the owners of the page memory. So why not have something like the LIFOQueue( used in scalable allocator code) per microqueueto place the free pages and get it back when necessary. Would this not reduce contentionsin the scalable_allocator as well?

Please correct me if my understanding on the above is wrong.

24 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - Shankar
2. Idonot understand as to why the field 'mask' in the page struct is used to decide the success of a popping operation.
Because I see the operations SpinwaitUntilEq( head_counter, k ) and SpinwaitWhileEq( tail_counter, k ) in micro_queue::pop() method makesure that an itemis present for the popping threadto pop and hence I dont find the use of mask further to determine if the item is present.

I think it has something to do with exceptions. For example assume producer reserves an item, but then exception occurs during copy constructing the element in the queue.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Shankar

4. If the threads that push into the concurrent queue and the threads that pop from the concurrent queue are different then the threads which pop from the queue will be the threads who call deallocate_page(). This again calls for a slow path in the scalable_free since the popping threads will never be the owners of the page memory. So why not have something like the LIFOQueue( used in scalable allocator code) per microqueueto place the free pages and get it back when necessary. Would this not reduce contentionsin the scalable_allocator as well?

Scalable allocator will do exactly this itself.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

I think it has something to do with exceptions. For example assume producer reserves an item, but then exception occurs during copy constructing the element in the queue.

Oh ya this makes sense. So this also answers my question no.3 as to why headcounter is incremented when a pop operation fails( because of mask not being set by a producer as some exception occured while copy constructing).

Quoting - Dmitriy Vyukov

Scalable allocator will do exactly this itself.

The difference may not be dramatic, butwouldn't one of those proverbial overzealous interns be quite likely to findat least some benefit in refurbishing fixed-size items like pages (no bin search, old contents more quickly vacating caches thus making less advantageous evictions less likely, maybe others)? I have not looked very closely at queues yet, though.

(Added after #6) Oh, it's even as clear-cut as explained in #5! But I can also see Dmitriy's positiondescribed in#6, of not wanting to be penny-wise and pound-foolish (nice result, if somewhat depressing for the rest of us!). So please ignore this rather superfluous intervention.

Quoting - Dmitriy Vyukov

Scalable allocator will do exactly this itself.

Ya thats correct. But I see a difference in trying to acquirethe lock in the scalable allocatorand in trying to acquire the lock of the micro_queue. In fact in the current concurrent_queue implementation, the page is deallocated just after the micro_queue's lock isreleased. sois it notbetter toqueuethe pagein a free list in the microqueuefor reuse and then release the lock.

So the destructor of pop_finalizer would look like

~micro_queue_pop_finalizer() {
page* p = my_page;
if( p ) {
spin_mutex::scoped_lock lock( my_queue.page_mutex );
page* q = p->next;
my_queue.head_page = q;
if( !q ) {
my_queue.tail_page = NULL;
}
micro_queue_freelist.push( p );
}// lock Released
my_queue.head_counter = my_ticket;
}

Because this would not call for the use of scalable allocator freeand hence there are no chancesthat other threads in the applicationwould contend with the popping thread(s) of concurrent queue(s)to lock the public_free_list in scalale_allocator.

Btw, if you are interested in scalable mpmc queue design you may take a look at my new queue algorithm:
http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890

The algo does not use micro-queues, but instead tries to handle contention on queue ends gracefully. In a result it achieves 23x speedup over TBB queue on quad-core. On higher number of cores difference must be higher.
Note that my queue is not a drop-in replacement for TBB queue, though. Particularly it requires user type to have special "NULL" value and does not handle exceptions during copy construction (at least in current implementation).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov
Btw, if you are interested in scalable mpmc queue design you may take a look at my new queue algorithm:
http://groups.google.com/group/lock-free/browse_frm/thread/3173c1a68a49f890

The algo does not use micro-queues, but instead tries to handle contention on queue ends gracefully. In a result it achieves 23x speedup over TBB queue on quad-core. On higher number of cores difference must be higher.
Note that my queue is not a drop-in replacement for TBB queue, though. Particularly it requires user type to have special "NULL" value and does not handle exceptions during copy construction (at least in current implementation).

I've got a 10% performance boost from this queue (because my program isn't all queues, it uses them heavily but relatively the work load isn't all on push/pop stuff).

Quoting - robert.jay.gould
I've got a 10% performance boost from this queue (because my program isn't all queues, it uses them heavily but relatively the work load isn't all on push/pop stuff).

Cool!
Not that bad for [almost] drop-in replacement for real-life application.
Btw, did you have any problems with incorporating it into your application?
And what hardware do you use? The more hardware parallelism the more speedup you get. The queue just leaves a little bit more interconnect bandwidth for your useful work.
The queue is also "more lock-free", i.e. blocked thread will block progress of other threads with less probability. And not susceptible to the thundering herd problem (which is the problems for the TBB queue I believe), because of the fine-grained eventcount algorithm. So must handle over-subscription of threads and "near empty queue" more gracefully. Is your queue full or near empty most of the time?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Cool!
Not that bad for [almost] drop-in replacement for real-life application.
Btw, did you have any problems with incorporating it into your application?
And what hardware do you use? The more hardware parallelism the more speedup you get. The queue just leaves a little bit more interconnect bandwidth for your useful work.
The queue is also "more lock-free", i.e. blocked thread will block progress of other threads with less probability. And not susceptible to the thundering herd problem (which is the problems for the TBB queue I believe), because of the fine-grained eventcount algorithm. So must handle over-subscription of threads and "near empty queue" more gracefully. Is your queue full or near empty most of the time?

Unfortunately I tested on a duo-core, a quad-core would probably do even better, alas I don't have one here :)

As for balance of empty / full, it's hard to say. On a macro scale the queue gets filled up and takes a while to drain out (few seconds), and then it might spend up to a few seconds with almost nothing to do, until a new flood of work arrives, but it's also almost never truly empty as there is typically a few work batches here and there to process, but not enough to keep everyone busy.

Anyways I agree 10% is a really nice boost for a rather trivial change (from the application's viewpoint).

"the thundering herd problem (which is the problems for the TBB queue I believe)"
Is this equally problematic with my "Additions to atomic" patch? I thought I had invented (or reinvented, more likely) a super-duper anti-thundering herd spinlock, so this could be a good test for that. :-) Aside from any additional benefits from designing around the issue, of course.

Quoting - Raf Schietekat

"the thundering herd problem (which is the problems for the TBB queue I believe)"
Is this equally problematic with my "Additions to atomic" patch? I thought I had invented (or reinvented, more likely) a super-duper anti-thundering herd spinlock, so this could be a good test for that. :-) Aside from any additional benefits from designing around the issue, of course.

Sorry, it's not just spinlocks, of course. (Now how can I let those builds last longer so I have more time to think...)

Different question, for my education: where exactly is the thundering herd in concurrent_queue, why is it such a difficult problem to solve that a different algorithm has to be invented to avoid it altogether, and what should be done in other situations where it might occur?

Quoting - robert.jay.gould
Unfortunately I tested on a duo-core, a quad-core would probably do even better, alas I don't have one here :)

As for balance of empty / full, it's hard to say. On a macro scale the queue gets filled up and takes a while to drain out (few seconds), and then it might spend up to a few seconds with almost nothing to do, until a new flood of work arrives, but it's also almost never truly empty as there is typically a few work batches here and there to process, but not enough to keep everyone busy.

Anyways I agree 10% is a really nice boost for a rather trivial change (from the application's viewpoint).

Robert,

Ok, thank you for information.

One more question, just to be sure, didn't you expirience any application crashes after incorporation of my queue? :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat
Sorry, it's not just spinlocks, of course. (Now how can I let those builds last longer so I have more time to think...)

Different question, for my education: where exactly is the thundering herd in concurrent_queue, why is it such a difficult problem to solve that a different algorithm has to be invented to avoid it altogether, and what should be done in other situations where it might occur?

I meant thundering herd after queue not empty/full signaling. Assume 16 threads are blocked on queue empty condition. Producer submits single item to the queue and wakes up all the consumers. All 16 threads start fighting to death for that single item and then 15 of them block again. Then producer submits another item, and wakes up all 15 consumers... and so on.

It's not too difficult to avoid thundering herd. However one must explicitly pay care to this and has suitable flexible fine-grained instruments. For example, TBB Gate class just does not allow you to wake single thread, you may wake only all blocked threads. Or if consumers are not equivalent (i.e. you want to wake up some particular consumer), and you are using semaphore or condition variable for blocking, you will have to wake up all threads even if you want to wake up only 1 particular thread.
Blocking primitive must allow for waking up 1 particular thread or a particular set of threads.

Mutex algorithm have to wake up 1 or 0 blocked threads when owner releases ownership. Consider, thread 1 releases the mutex, mutex wakes up blocked thread 2 (there are more blocked thread, but mutex releases only one of them). But then fresh thread 3 squeezes in between and acquires the mutex. When thread 3 releases the mutex, mutex may wake up NO threads at all, because it knows that there is still thread 2 which is acquiring the mutex.

In general, only thread[s] that have to be waken up must be waken up. If blocking primitive or algorithm wakes up excessive thread then this is bad.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I meant thundering herd after queue not empty/full signaling."
pthread_cond_broadcast()... that has to hurt under the wrong circumstances (no pun intended), but I haven't really looked very closely at the algorithm (yet), so based on the principle of presumed innocence... :-) For example, for the consumer side, wouldn't it make sense to broadcast the condition if each consumer can react very quickly after being woken up by just taking off one item, as opposed to serialising through pthread_cond_signal()? The producer side of a bounded queue that has reached its limit is an entirely different story, of course.

Quoting - Raf Schietekat

"I meant thundering herd after queue not empty/full signaling."
pthread_cond_broadcast()... that has to hurt under the wrong circumstances (no pun intended), but I haven't really looked very closely at the algorithm (yet), so based on the principle of presumed innocence... :-) For example, for the consumer side, wouldn't it make sense to broadcast the condition if each consumer can react very quickly after being woken up by just taking off one item, as opposed to serialising through pthread_cond_signal()? The producer side of a bounded queue that has reached its limit is an entirely different story, of course.

If each producer submits only 1 item, then he does not submit enough items for 16 consumers to take off one item.

If producer submits batches of items, then he may wake up a batch of consumers.
If we have 16 blocked threads and need to wake up, let's say, 4 of them. What is better: single call to broadcast() or 4 calls to signal()? I don't know.
Another moment: batch is usually formed by delaying first items, so wouldn't it be better to submit each item individually so that items will be received by consumers earlier?

Btw, why you consider producer side as different story? It can be treated as "consumers produce 'free cells' in the queue".

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"If each producer submits only 1 item, then he does not submit enough items for 16 consumers to take off one item."
I'm sorry, that was careless: I was focused on the idea of simultaneously releasing a number of threads vs. the latency in a relay mechanism, and what came out was nonsensical.

"If we have 16 blocked threads and need to wake up, let's say, 4 of them. What is better: single call to broadcast() or 4 calls to signal()? I don't know."
I wondered about that: do 4 calls to signal() translate to roughly 4 threads being released? I don't see that in the documentation. Maybe the calls could unintentionally get coalesced? In that case, would the latency in a serial relay workaround maybe be worse than the problem with a broadcast()? Perhaps that's what you are referring to?

"Btw, why you consider producer side as different story? It can be treated as "consumers produce 'free cells' in the queue"."
Yes, see above.

I guess I should first go read some code now. :-)

Quoting - Raf Schietekat

"If we have 16 blocked threads and need to wake up, let's say, 4 of them. What is better: single call to broadcast() or 4 calls to signal()? I don't know."
I wondered about that: do 4 calls to signal() translate to roughly 4 threads being released? I don't see that in the documentation. Maybe the calls could unintentionally get coalesced? In that case, would the latency in a serial relay workaround maybe be worse than the problem with a broadcast()? Perhaps that's what you are referring to?

Please, elaborate more. Maybe some example. I am not sure I understand.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Please, elaborate more. Maybe some example. I am not sure I understand.

Let's say you add 4 elements to the queue, and you want to express "this is work for 4 threads". It does not seem illogical to me to have an API that takes an integer parameter that can be 1 (same as signal() now), -1 (same as broadcast() now), or any other strictly positive value (the kernel would guarantee to wake up at least one thread, and use the actual value as a hint to wake up roughly that many threads; I'm not sure whether it would be useful to promise to wake up at least that many threads). But can the third behaviour be synthesised from a number of signal() calls, or would "signal(); signal();" typically be coalesced into just "signal();", because the normal semantics of a "condition" is singular anyway (a test has become true), so it doesn't make inherent sense to remember how many threads should react if there is no explicit broadcast. Coalescing can be an optimisation, e.g., to forget about intermediate scrollbar positions and only update the screen for the most current one; maybe signal() works the same way. What is your take on that? I'm just wondering at this time, it may not be directly relevant here.

Quoting - Raf Schietekat

Let's say you add 4 elements to the queue, and you want to express "this is work for 4 threads". It does not seem illogical to me to have an API that takes an integer parameter that can be 1 (same as signal() now), -1 (same as broadcast() now), or any other strictly positive value (the kernel would guarantee to wake up at least one thread, and use the actual value as a hint to wake up roughly that many threads; I'm not sure whether it would be useful to promise to wake up at least that many threads).

The API is called semaphore and supported by both POSIX and Win32.
It's useful in some scenarios (queue with batch enqueue), because one can wake up 1, ALL or exactly N threads. However with semaphore one loses the ability to distinguish between consumers, i.e. you can't say "I want to wake up this and that thread". It's only suitable IFF all the threads are equal, otherwise one will have to wake up all threads all the time.

Quoting - Raf Schietekat

But can the third behaviour be synthesised from a number of signal() calls, or would "signal(); signal();" typically be coalesced into just "signal();", because the normal semantics of a "condition" is singular anyway (a test has become true), so it doesn't make inherent sense to remember how many threads should react if there is no explicit broadcast. Coalescing can be an optimisation, e.g., to forget about intermediate scrollbar positions and only update the screen for the most current one; maybe signal() works the same way. What is your take on that? I'm just wondering at this time, it may not be directly relevant here.

I don't think signal() calls can be coalesced. Condvar signals about state change rather than condition become true. And you generally don't know how many threads I need to handle that new state. I may only communicate "I need ALL the threads" (broadcast), or "One more, please" (signal).
If signal() calls are coalesced then following simple blocking queue algorithm won't work. Producer executes signal() IFF he submits an element to the empty queue. If some signal() calls are coalesced, one may end up with suboptimal parallelization or even deadlock.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

Let's say you add 4 elements to the queue, and you want to express "this is work for 4 threads". It does not seem illogical to me to have an API that takes an integer parameter that can be 1 (same as signal() now), -1 (same as broadcast() now), or any other strictly positive value (the kernel would guarantee to wake up at least one thread, and use the actual value as a hint to wake up roughly that many threads; I'm not sure whether it would be useful to promise to wake up at least that many threads). But can the third behaviour be synthesised from a number of signal() calls, or would "signal(); signal();" typically be coalesced into just "signal();", because the normal semantics of a "condition" is singular anyway (a test has become true), so it doesn't make inherent sense to remember how many threads should react if there is no explicit broadcast. Coalescing can be an optimisation, e.g., to forget about intermediate scrollbar positions and only update the screen for the most current one; maybe signal() works the same way. What is your take on that? I'm just wondering at this time, it may not be directly relevant here.

Raf,

It may help to look at this from the perspective of the waiting thread(s) as opposed to the en-queue-ing operation.

A waiting thread (on Windows) can WaitForSingleObject or WaitForMultipleObjects, and where WaitForMultipleObjects can wait for any one object to signal or all objects to signal. The question now is to how best to structure the signaling procedure into the thread scheduler. Each individual queue could have an event as well as common groupings of threads could have an event handle. The problem now devolves into a question of

a) is it more efficient to signal the desired number of each waiting thread via individual Events
or
b) is it more efficient to have a permutation of Events
e.g.
64 single events
2 x 32 x 32 32-thread events
4 x 16 x 16 x 16 x 16 16 thread events
...

On a 4 core system, permutation of events may be feasible, on a 64 core (or more) it certainly is not.
Also, the O/S overhead of filtering through all those threads "may" be time consuming.

The scheme I use in QuickThread is a variation on a)

parallel_invoke( [&]{fee(a);}, [&]{fi(b);}, [&]{fo(c);}, [&]{fum(d);});

Will, insert into the invoking thread's affinity queue, and while inserting, test for sleeping or scrounging threads, insert a back pointer to the invoking thread's queue, into a shunt in the discovered sleeping or scrounging thread, then if discovered thread was sleeping, Signal an event, if scrounging, nothing left to do.

parallel_invoke(Waiting$, [&]{fee(a);}, [&]{fi(b);}, [&]{fo(c);}, [&]{fum(d);});

Will prioritize enqueu-ing into Waiting and Scrounging thread queues, waking if need be.

parallel_distribute(L3$+Waiting$, foo, a, b, c);

Looks for a socket on the system (e.g. 4 socket system with each socket holding 4 core processor), with the most sleeping and/or scrounging threads, then en-queues slices of the same taskto those queues (threads) only, waking if need be.

Maybe I can show you this at IDF

Jim Dempsey

www.quickthreadprogramming.com

#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly), wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway? But it seems plausible that thinking beyond semaphores and dealing with specific threads is better, and I now think that this is what you meant with "What is better: single call to broadcast() or 4 calls to signal()? I don't know.".

#19 "I don't think signal() calls can be coalesced."
Hey, why did my program suddenly stop working on that new platform? :-)

Jim, you will find in me a captivated audience.

Quoting - Raf Schietekat

#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly) wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway?

Just now I do not see why I will be more costly... if we do not mean "always one" or "always zero" behind "at least N". If the promitive will be frankly trying to wake up N threads then the cost must be the same I think.
Also note that if producer wants to wake up 4 threads because he just submited 4 items to the queue, then if primitive will wake up only 3 threads this may be suboptimal in itself, because 1 item will be delayed. I.e. in some cases you want "exactly N" semantics.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - jimdempseyatthecove

a) is it more efficient to signal the desired number of each waiting thread via individual Events
or
b) is it more efficient to have a permutation of Events
e.g.
64 single events
2 x 32 x 32 32-thread events
4 x 16 x 16 x 16 x 16 16 thread events

c) all threads are waiting on single event (as it now implemented in subj)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Leave a Comment

Please sign in to add a comment. Not a member? Join today