Question on tbb::concurrent_queue

Question on tbb::concurrent_queue

Hi, I have a requirement where in a thread has a lot of requests( in hundreds ) to be enqueued. I was just wondering, if i can form an array or something and enqueue it directly, it should be faster than touching the shared resource again and again. But i don't find an API like that in the reference documentation. Even if i enqueue it as an array, it should get dequeued element wise. Can someone explain what would be the best approach in solving a problem like this?Thanks in advance,Gokul.

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

How many producers and consumers do you have?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Producers and consumers will be in the range of 5-20.Gokul.

> if i can form an array or something and enqueue it directly, it should
be faster than touching the shared resource again and again.

Definitely. Batch enqueue is a useful feature in some situations.
AFAICT, TBB queue does not support it currently. I think that TBB queue can be patched to support batch enqueue.
However real advantage of batch enqueue can be revealed in node-based queues, where a thread can prepare a chain of linked nodes, and pass pointer to head and tail of the chain to a queue. I think it's quite easy to modify my node-based queue algorithm this way:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300

Ah, OK, I can do it right now:

void enqueue(node_t* head, node_t* tail)
    {
        tail->next_ = 0;
        node_t** prev = (node_t**)
            _InterlockedExchange((long*)&tail_, (long)tail);
        prev[0] = head;
    }


So, 100 or 1000 nodes enqueued with a single InterlockedExchange. Try to beat it! ;)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

>> However real advantage of batch enqueue can be revealed in node-based queues
Can you explain what is the difference between batch node-based queues and tbb::concurrent_queue?Thanks,Gokul.

Quoting ksgokul
>> However real advantage of batch enqueue can be revealed in node-based queues
Can you explain what is the difference between batch node-based queues and tbb::concurrent_queue?

Batch *enqueue* (not batch queue). Well, by batch enqueue I just mean a situation when you pass a batch of nodes to a queue for enqeueue, and the queue somehow optimizes it as compared to a series of single-node enqueues.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Sorry my question was"Can you explain what is the difference between your node-based queues and tbb::concurrent_queue?" I read in a post, that tbb::concurrent_queue can take more than 1 producer at a same time?Thanks,Gokul.

Quoting ksgokul
Sorry my question was"Can you explain what is the difference between your node-based queues and tbb::concurrent_queue?" I read in a post, that tbb::concurrent_queue can take more than 1 producer at a same time?Thanks,Gokul.

Tbb queue is array based. This means that in order to insert a batch of N elements, you need to do N object copies and hold the mutex all that time. O(N). Not very cool for a batch insert.

That my queue is node based. This means that in order to insert a batch of N elements, you need to update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitriy Vyukovwrote:
>That my queue is node based. This means that in order to insert a batch of N elements, you need to >update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

But 'before' inserting the batch ,don't you need to organize itin a linklist form ?

And if you organize it in a link list form - and insert many nodes -,
that still will take O(N) insert... no ?

>[...] and hold the mutex all that time

That's true.

Amine Moulay Ramdane.

Dmitriy Vyukovwrote:
>Tbb queue is array based...[...]
>That my queue is node based.
>and hold the mutex all that time

Not all the time, you can insert them one by one in my
lockfreeParallelQueue or lockfree_mpmc...

But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'contention' also, so it's the asmeas the array based, think about it more !

Sincerely,
Amine Moulay Ramdane.

But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'contention' also, so it's the sameas the array based, think about it more !

Sincerely,
Amine Moulay Ramdane

Hello,

I know that you are smart Dmitry...

So follow with me please...

If you win5$with theleft hand , and you lose 5$ withthe righthand ...

You have to see the big picture to be able to judge, not just the left hand...

So read carefuly.

But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'CONTENTION' also, so it's the sameas the array based, think about it more !

So the left hand here is the node based queue , and the right hand is the FREELIST or
the MEMORY MANAGER.

So, to be able to judge you have to seejudge from the BIG picture not just the left hand...

>That my queue is node based. This means that in order to insert a batch of N elements, you need to >update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

But 'before' inserting the batch ,don't you need to organize itin a linklist form ?

And if you organize it in a link list form - and insert many nodes -,
that still will take O(N) insert... no ?

Sincerely,
Amine Moulay Ramdane

Hello,

I know that you are smart Dmitry...

As i said , if you win5$with theleft hand , and you lose 5$ withthe righthand,

You have to see the big picture to be able to judge, not just the left hand...

So if you take a look for example at my lock-free ParallelQueue...

Youwill notice thati havemade the following affirmation:

"Because my lock-free ParallelQueue algorithm is using a hash based method and lock striping - with multiple lock-free queues working as one lock-free queue, see the source code - to reduce more the contention and is using just a LockedInc() , so, i am REDUCING the duration for which locks are held AND REDUCING the frequency with which locks are requested, hence i am REDUCING A LOT the contention, so it's very efficient.

So, my ParallelQueue algorithm is lock-free and is using lock striping
and also avoids false sharing, it's why it scored 17 millions of pop() transactions
per second (see the graphs bellow)... better than flqueue and RingBuffer [...]."

But there is still an important information that i forgot to put on the page
(to be able to see the big picture not just the left hand !),

If you take a look at lockfree_mpmc , it's using a

LockedIncLong(temp) +CAS(tail[0],lasttail,newtemp) on the push() side..

But if you take a look at my lock-free Parallelqueue, i amonly
a single lockinc()

etc.

Welcome:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

Sincerely,
Amine Moulay Ramdane.


Hello,

As you have noticed Dmitry, my Thread Pool engine is also using my lock-free
ParallelQueue to be able to SCALE better in the future...

http://pages.videotron.com/aminer/threadpool.htm

And as you have noticed , withmy Object Pascal Thread Pool Engine
i have implemented my Parallel Compression Library , my Parallel Sort Library etc.

Give me a smaller and efficientthing and i will design a better and bigger thing !

that was my idea !

It's amazing to allthe things that we can dowith just this efficient Object Pascal
Thread Pool engine that i have designed !

Not too bad! right Dmitry ?

Welcome:

http://pages.videotron.com/aminer/

Sincerely,
Amine Moulay Ramdane.

>>>[...] and hold the mutex all that time

>That's true.Actually i am more concerned about this.Its actually O[N] inserts without holding the mutex and O[1] with the mutex - for a batch based insertIts actually O[N] inserts with the mutex - if i make the batch insert get inserted one by one.I think if we take the contention factor together with the Order Of Operations, we might get a fair picture.Gokul.

Hello,

If you ask me a question like:

"Your Thread Pool Engine and lockfree ParallelQueue and lockfree_mpmc,
are working with Windows, Linux , Solaris and MacOSX on x86 , why
not port them also to Sparc Solaris station also ?"

My answer:

That's very easy to port my Thread Pool Engine , lockfree ParallelQueue
and lockfree_mpmc to Sparc Solaris multicores station etc and i will do
it in the near future...

and FreePascal 2.4.2 is already here:

http://www.freepascal.org/download.var

http://www.freepascal.org/

Sincerely,
Amine Moulay Ramdane.

>Actually i am more concerned about this.>Its actually O[N] inserts without holding the mutex and O[1] with the mutex
>- for a batch based insert Its actually O[N] inserts with the mutex - if i make
>the batch insert get inserted one by one.>I think if we take the contention factor together with the Order Of Operations,
>we might get a fair picture.>Gokul

You still didn't get the BIG picture Gokul..

In the lock-free Array based queue there is *NO* CONTENTION on the locks
insidea freelist ora memory manager.

In the lock-free node-based one , there is a contention on the locks inside
the freelistor the memory manager...

Sincerely,
Amine Moulay Ramdane.

> But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'CONTENTION' also, so it's the sameas the array based, think about it more !

I'm not insane to use centralized memory manager in a parallel program (most likely I won't use a centralized multi-producer/multi-consumer queue too, but that's aside for a moment). So of course memory comes from a local cache. No contention here.

> And if you organize it in a link list form - and insert many nodes -
that still will take O(N) insert... no ?

It's local operation conducted not under mutex. Moreover it should be combined with node filling, so it's basically zero overhead.

> Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

Well, you can use it with my queue too... However the better design would be to not use a centralized queue in the first place.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net


Dmitriy wrote
>> Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

>Well, you can use it with my queue too...

Right andkeep up the good work Dmitriy...

Sincerely,
Amine.

I wrote:
>>>Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

>>Well, you can use it with my queue too...

>Right andkeep up the good work Dmitriy..

I will add this Dmitriy..

Even if my lock-free ParallelQueue - that i am using inside my Object Pascal Thread Pool
Engine - doesn't scale beyong 100 cores or 200 cores ... you can still run many process
applications or threads in parallel that will use my Thread Pool Engine(s)and every process
application or threads will scale to 100 cores. Example my a Parallel Threadthat usemy
Parallel SortLibrary will scale to 100 cores and another thread or process that uses my
Parallel CompressionLibrary will scale to 100 cores etc. etc. so that yourthousand(s) cores
system will still besmartly UTILIZED, that's still fun ! rigth Dmitriy ?

Sincerely,
Amine Moulay Ramdane.

Quoting aminer10
you can still run many process
applications or threads in parallel that will use my Thread Pool Engine(s)and every process
application or threads will scale to 100 cores.

That's pretty cool!

Unfortunately, I can't say the same about my code, I don't have a 100-core machine for tests...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Intel:1,000-core Processor Possible:

http://www.pcworld.com/article/211238/intel_1000core_processor_possible.html

Sincerely,
Amine Moulay Ramdane.

Dmitry wrote:
>That's pretty cool!

>Unfortunately, I can't say the same about my code, I don't have a 100-core machine for tests...

Even if every thread or process that uses my Parallel Sort Libraryor Parallel
Compression Libraryetc. etc. will not scale beyong 16 cores , that's still fun!

You haveto *UTILIZE* your hundred or thousand cores , so run as much as possible
of your processes or threads on your systemevenif they don't scale very well !

That's still cool ...

Sincerely,
Amine Moulay Ramdane.

I wrote:
>Even if my lock-free ParallelQueue - that i am using inside my Object Pascal
>Thread Pool Engine - doesn't scale beyong 100 cores or 200 cores ...
[..]
>Example my a Parallel Threadthat usemy
>Parallel SortLibrary will scale to 100 cores and another thread or
>process that uses myParallel CompressionLibrary will scale to 100 cores etc. etc
>[...]

But you know very well Dmitriy that it was just hypothetical example that i gave,
it was not the reality yet ...

Look for example at the follwing application,it does not scale beyong 8 or 10 cores
on a 32 cores system
http://software.intel.com/en-us/blogs/2010/11/18/benchmarks-of-a-haskell-model-checking-application-running-on-intels-manycore-testing-lab-2/

But as i said before, even if every thread or process that uses my
Parallel Sort Libraryor Parallel Compression Libraryetc. etc. will not scale
beyong 16 cores , that's still fun!

You haveto *UTILIZE* your hundred or thousand cores , so run as much as possible
of your processes or threads on your future 100 or 1000 cores systemevenif they
don't scale very well !

That's still cool ...

It's good to make it scale 'better', but, in reality, a much perfect scale isharderto realize...

Sincerely,
Amine Moulay Ramdane.

> Look for example at the follwing application,it does not scale beyong 8 or 10 cores
on a 32 cores system

Yeah, I saw this one. The funny thing is that the type of simulation they do should easily scale to 32 cores.

> It's good to make it scale 'better', but, in reality, a much perfect scale isharderto realize...

The problem is not sub-linear scalability, the real problem is negative scalability. Check out graphs in the article you pointed to, what number of threads would you choose for the program - 32 or 16?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitry:
>The problem is not sub-linear scalability, the real problem is negative scalability.
>Check out graphs in the >article you pointed to, what number of threads would you
>choose for the program - 32 or 16?

You are right...

And they didn't say in the webpagewhat wasis factor that limit and even make a negative scalability?

have you any idea on this simulation program Dmitiry ?

Sincerely,
Amine.

There are a lot of possible problems. I don't know how they parallelize it, and don't know details of Haskell scheduler and runtime, so I have no idea wrt negative scalability.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Dmitry wrote:
>There are a lot of possible problems. I don't know how they parallelize it,
>and don't know details of Haskell scheduler and runtime, so I have no idea
>wrt negative scalability

I understand.

They have not talked about it , and have notexplain to us what
was happening exactly inside there code...

Sincerely,
Amine.

>> In the lock-free Array based queue there is *NO* CONTENTION on the locks
>> insidea freelist ora memory manager.Can you point me to some article on this? So does that mean, it wouldn't matter if i do the batch inserts as multiple single inserts?Thanks,Gokul.

Leave a Comment

Please sign in to add a comment. Not a member? Join today