The Concurrent Queue Container with Sleep Support

By Wooyoung Kim (Intel) (6 posts) on June 3, 2008 at 5:59 am

The TBB Concurrent Queue container is a bounded first-in-first-out data structure. Multiple threads may concurrently insert items into it and take items out of it [1]. The main operations supported are push and pop. (The names may be misleading. Nonetheless, the TBB concurrent queue guarantees that an item inserted before another will be removed before the latter when only one thread is running. For multi-threaded execution, the guarantee is provided only for two items inserted by a producer and removed by a consumer.)

The implementation in tbb20_20070927oss used a ticket-based scheme with multiple sub-queues which resulted in fast push/pop operations. Producers grab a ticket for a potentially available slot and when it is actually available, it goes ahead and inserts the item in the slot. Similarly, consumers grab a ticket for a potentially available item and when it is assured it is available, it goes ahead and removes the item from the queue. A couple of forward-marching counters with atomic fetch&add are used. Actual accesses are distributed among the internal sub-queues to reduce contention. However, if a ticket is for an occupied slot (or a yet-to-be inserted item), the producer (or consumer) spin-waits.

When the number of pushes and the number of pops are equal or the rates of items being generated and consumed are equal, everything works fine. But, if they don't match, i.e., producers (consumers) wait for available slots (items) for an extended period of time, an unintended event happens -- they hog the cpu cycles. The issue deems serious as a number of TBB users reported it.

The TBB stable release tbb20_20080408oss has fixed the issue by allowing threads to go to sleep until the slot/item for their ticket becomes available, though not before we made a couple of fumbles. We set out adding the sleep support with a goal of adding minimal performance penalty to the fast path while making the slow path not too slow. A thread takes the fast path when its slot/item is available when it grabs a ticket; otherwise, it takes the slow path which may lead it to sleep.

My first attempt was to change the order of the events and let threads wait (and possibly sleep) until a slot(an item) becomes available. Newly entering threads and threads coming out of sleep compete for the next available slot (item) using the atomic compare-and-swap. To avoid the "thundering herd' issue, departing threads wake up a single sleeping thread of the opposite kind. Also, the number of threads in sleep is maintained to send a wake-up call only when necessary. All went well with the use of futex in Linux, WaitForSingleObject in Windows, and pthread_condwait in the others. Until I compared the performance of the old and new versions, that is.

On a clovertown-based 8-core Xeon server running RHEL Nahant U4, a micro-benchmark of adding and removing 1M items concurrently gave:

total time type
1.45973 Old_ConcurrentQueue<int> (one without the sleep support)
2.3691 ConcurrentQueue<int>
4.13421 SimpleLockedQueue<int> (std::queue with a big fat lock)
1.79493 Old_ConcurrentQueue<string>
2.7716 ConcurrentQueue<string>
5.05688 SimpleLockedQueue<string>

Repeated atomic compare-and-swaps to grab a next available item seemed to be the most critical factor for the performance degradation on the fast path. Even though failed atomic compare-and-swap on modern Intel(R) processors do not lock the bus, it accesses the bus anyway and thus keeps other cores from progressing.

The first attempt to correct the issue was to replace the compare-and-swap with a more optimistic fetch-and-add: atomically grab the next item with the optimistic assumption that it will be available, and if not, return it and repeat the steps. The improvement was marginal. Worse, it totally broke down at the boundaries. For example, when the queue is full, the optimistic atomic increment invalidates the premise at the boundary to compute size() which is defined as difference of pushes and pops made to the queue.

I should have known that repeated atomic instructions were very costly and we should avoid them (in particular, on platforms built with a shared bus). So, the choice was clear: use the ticket-based scheme for the fast path. The remaining question was how to revise the slow path. Since threads got assigned a ticket up-front, we should wake sleeping threads up in order of their tickets. But such an in-order wake-up is not available in general. Instead, we tried an ingenious proposal of depositing tickets when going into sleep and grabbing them back in order when coming out of sleep. But it required an additional data structure for the 'pot' -- we used a priority queue. And, it was necessary to place a lock to protect the 'pot'. More importantly, it was much harder to properly implement synchronization requirements.

Eventually, we decided to ignore the potential thundering herd issue in the slow path and go with a scheme using the condition variables. For Linux/Unix platforms, pthread_condvar/condwait was used. On Windows, condition variables/wait was emulated with [Enter|Exit]CriticalSection and generation counters [2].

total time type
1.47468 Old_ConcurrentQueue<int>
1.57742 ConcurrentQueue<int>
3.09968 SimpleLockedQueue<int>
1.77619 Old_ConcurrentQueue<string>
1.84828 ConcurrentQueue<string>
6.02623 SimpleLockedQueue<string>

Intel's next generation microarchitecture, code named Nehalem, replaces the front-side bus with the point-to-point Intel® QuickPath Interconnect (http://www.intel.com/technology/quickpath), and has integrated on-die memory controllers. This shift in microarchitecure may change relative costs of locked instructions and we may re-evaluate the old 'repeated locked compare-and-swap' implementation on this microarchitecture.

References
[1] TBB reference manual.
[2] Strategies for Implementing POSIX Condition Variables on Win32. http://www.cs.wustl.edu/~schmidt/win32-cv-1.html

Categories: Open Source, Parallel Programming, Software Tools
Tags:

For more complete information about compiler optimizations, see our Optimization Notice.

Comments (5)

October 7, 2008 2:01 AM PDT


Zhi Gan
Is it true that TBB queue is not FIFO when multiple producers and consumers are operating at the same time?
March 20, 2009 8:35 AM PDT

yinghai
yinghaiTotal Points:
70
Green Belt
Is TBB queue lock-free?
March 20, 2009 9:10 AM PDT

Wooyoung Kim (Intel)
Wooyoung Kim (Intel)Total Points:
1,105
Brown Belt
"Is it true that TBB queue is not FIFO when multiple producers and consumers are operating at the same time?"

It depends on what reference point you use. With respect to a concurrent queue, pushes and pops are strictly FIFO because it is ticket-based. For producers and consumers, the order may not appear FIFO because the distance between the moment an item is pushed into (popped off) a queu and the moment a producer (consumer) sees the item may be arbitrary. So, if you print items that are pushed and popped, the order can be anything.

However, the observed order per thread is strictly FIFO. What that means is when a thread pushes items into a queue, and pops items off the queue concurrently, the pop order among the items that the thread pushed (i.e., excluding items that other threads pushed into the queue) is the same as the push order of the items.

March 20, 2009 9:27 AM PDT

Wooyoung Kim (Intel)
Wooyoung Kim (Intel)Total Points:
1,105
Brown Belt
"Is TBB queue lock-free?"

No. It is not lock-free.

The slow paths in internal_push()/internal_pop() need locks to implement the blocking support correclty.

The TBB concurrent queue uses an array of micro queueus to distribute push/pop operations. micro_queue::push() and micro_queue::pop()
have a spin_wait loop which is regarded as a lock. Also, page allocation/deallocation in micro_queue need protection, too.

We may provide a lock-free version of concurrent queue if users want it really really badly, but our experience so far has been that lock-freedom does not come for cheap; it usually requires more complex code which is harder to validate for correctness, and it costs you more in terms of performance.
January 27, 2011 9:00 PM PST

softarts
softartsTotal Points:
1,225
Registered User
(Question1)"Is it true that TBB queue is not FIFO when multiple producers and consumers are operating at the same time?"
i.e, A0,A1,A2 is in time order,respectively pushed to micro_queue0,micro_queue1,micro_queue2....
is it possible that consumer pop A2 earlier than A1?
(Question2)
how is the "sleep" policy? simply sleep or use pause instruction in a while loop?

Trackbacks (0)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*