I've parallelized a list chasing toy benchmark using TBB parallel_pipeline and got a surprising result. The program was parallelized using a pipeline of two stages where both stages does the same computation, furthermore the communication between the stages is just a pointer and the second stage is parallel.
What surprised me was that using this partitioning with TBB the best execution time I got was about ~5s (with only one token..) but when I execute the program with this same partitioning and a "hand-written" cache-friendly queue [1,2] the execution time is only ~1,5s. The serial execution of the program is ~3s.
I'm not very familiar with TBB so I may be doing something wrong - I started using it a few days ago. So my question is: Am I doing something wrong here? How can I improve the performance of this example?
It seems that the "buffer" used to communicate between stages is not very optimized - the best execution time I got was when using only one token. It is possible that false-sharing is causing the slowdown. Is there a way to change the pipeline to use another more cache-friendly "buffer"/queue or even a custom queue implementation?
The source code is here: http://pastebin.com/bwB9Yfzq