Possible concurrent_queue improvement

Possible concurrent_queue improvement

Hi,

I was wondering why the concurrent queue is using compare and swap to get the next ticket in stead of using fetch and increment. (in concurrent_queue_base_v3::internal_pop_if_present and in concurrent_queue_base_v3::internal_push_if_not_full).

Calling compare and swap may harm the performance of the queue under high contention - the reasons for this are best explained by Dave Dice here: https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs

So if the call to compare_and_swap( tk+1, tk ) would be changed to fetch_and_inc(tk) than it will make the queue more scalable.
any thoughts?

Thanks,
Elad. 

12 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hi Elad,

the very use of shared atomic counters (for sake of ordering guarantees) makes the implementation inherently limited at scaling. Nevertheless, it's worth an experiment whether changing CAS to atomic increment helps concurrent_queue scalability on modern HW, and how much.

And maybe throw in an atomic_backoff?

Alexey, 
You are right saying that the FIFO property means that the scalability is limited. However, I do believe that using fetch-and-inc instead ofCAS will give better performance. 

Raf, 
If fetch-and-inc will be used there will be no need for atomic_backoff as it is always successful. This is its main advantage on CAS, and the reason it is likely to be more scalable, as there is no need to retry when the contention is high.

There's still a loop around that: as long as you're evaluating performance in that detail, you might as well have a quick look? I keep wondering about how to avoid that threads that were unsuccessful a few times remain at a disadvantage (and might get starved!), but it still seems somehow more polite than just hammering away...

Hi Elad,

Thanks for the suggestion/ As a matter of fact, tbb::concurrent_queue is already using atomic increment (based on tbb::fetch_and_inc()) in push()/pop() routines.

Note that head_counter/tail_counter are declared as tbb::atomic<ticket>. And their values are incremented with '++' in internal_push/Internal_pop.

Now, I think you found uses of compare_and_swap in internal_pop_if_present/internal_push_if_not_full.These are for bounded queues. As such, we would like a calling thread to return immediately if the condition is no longer met. WIth xchg, things get really messy (and I am no sure a correct implementation is even feasible) because one of the assumptions that the concurrent queue is built upon is that these counters are monotonically increasing.

Thanks.

 

Sorry I missed that! I'm trying to concentrate on something else right now, and I've already failed by getting distracted about uses of backoff. :-)

But while on that theme, what is the view about backoff in the queue? I see there's one for pushing, but it's mainly about concurrent_monitor, which seems to go straight to a kernel-level semaphore (right?). I seem to remember that this was a change from an earlier spin-only implementation, but shouldn't there be some middle ground? Just curious...

Raf, the general idea is:

Assume you have two queues. Each queue is a ring buffer who’s size (rbSize) is greater than the number of max(producers, consumers, arbitrarySize). NULL is used to indicate cell not occupied. One buffer is a list of tokens (address of struct/node) that your thread must possess in order to insert into the second queue. The second queue is the DoWork queue. IOW in order to enqueue a node into the DoWork queue the thread must first obtain a node (into which it places the work items or references). Each DoWork queue has an exclusive free node queue. Pseudo code

Enqueue:
index = mod(XCHGADD(&fillIndex, 1), rbSize)
while(rb[index]) WaitOrWorkStealOrDoNothing; // in only Debug, occupied should never happen
rb[index] = node
end Enqueue

Dequeue:
index = mod(XCHGADD(&EmptyIndex, 1), rbSize)
while(rb[index] == NULL) WaitOrWorkStealOrDoNothing; // or after while use CAS to backoff and return NULL
node = rb[index]
rb[index] = NULL
end Dequeue

Jim Dempsey

www.quickthreadprogramming.com

I'm afraid I don't recognise any of that in TBB's concurrent_queue.

Raf, It is not in concurrent_queue, its used in place of concurrent_queue. Though in some respects it behaves like concurrent_queue.

Using TBB currently, Raf might obtain a node from scalable allocator or heap, insert data, then enqueue into current concurrent_queue. Then on other side extract node from concurrent_queue, do work, return node to scalable allocator or heap.

Using suggested system, Raf would obtain node from private pool for use exclusively by new_concurrent_queue. You change the names to protect the innocent. The new_concurrent_queue object would be constructed to contain the private pool of free tokens (nodes). Enqueue node to new_concurrent_queue(work queue). Then on other side, extract node from work queue, do work, return node to private queue.

Enqueue can never overflow and will always have a free slot and who's slot number can be obtained by one XCHGADD (no failure as with CAS).
Dequeue of free node could potentially end up empty handed, but if coded well will always/mostly succeed with the expense again of one XCHGADD.

The question then becomes twofold:

a) When queue is not under contention, are 4 XCHGADD's less overhead than: allocate node from scalable allocator or heap + concurrent_queue fill + concurrent_queue empty + return node to scalable allocator or heap?

b) When under highly contentious circumstance are 4 XCHGADD's less overhead than numerous failures on CAS's as implimented in current scheme of things?

Code would have to be written before one could answer both questions, however, my gut feel is a) will be less overhead, and b) will be orders of magnitude less overhead.

Jim Dempsey

www.quickthreadprogramming.com

I see, you were not correcting my analysis but proposing an alternative... Well, the CAS is only for non-blocking try_push() and try_pop() with concurrent_queue, if I'm not mistaken, so b) might have to be reformulated?

(2013-04-03) Shortened.

An additional optimization would be for each thread to have a private queue of free nodes (for use in the new_concurrent_queue) and where the ring buffer is .gt. the sum of the sizes of the private queue of free nodes. Using this technique then full cycle through new_concurrent_queue would take a hit of 2 XCHGADDs.

Jim Dempsey

www.quickthreadprogramming.com

Melden Sie sich an, um einen Kommentar zu hinterlassen.