Performance of Parallel Pipeline - How to Improve?

Performance of Parallel Pipeline - How to Improve?

Hello there!

I've parallelized a list chasing toy benchmark using TBB parallel_pipeline and got a surprising result. The program was parallelized using a pipeline of two stages where both stages does the same computation, furthermore the communication between the stages is just a pointer and the second stage is parallel.

What surprised me was that using this partitioning with TBB the best execution time I got was about ~5s (with only one token..) but when I execute the program with this same partitioning and a "hand-written" cache-friendly queue [1,2] the execution time is only ~1,5s. The serial execution of the program is ~3s.

I'm not very familiar with TBB so I may be doing something wrong - I started using it a few days ago. So my question is: Am I doing something wrong here? How can I improve the performance of this example? 

It seems that the "buffer" used to communicate between stages is not very optimized - the best execution time I got was when using only one token. It is possible that false-sharing is causing the slowdown. Is there a way to change the pipeline to use another more cache-friendly "buffer"/queue or even a custom queue implementation?

The source code is here: http://pastebin.com/bwB9Yfzq

[1] http://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf
[2] http://ce.colorado.edu/Publications/pact07_ff.pdf

 
15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

http://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_pipeline_func.htm

See comment of the example:

The example is only for demonstrating syntactic mechanics. It is not as a practical way to do the calculation because parallel overhead would be vastly higher than useful work.

If I understood your code correctly, that's exactly what happens here

I don't know if this is just an simple example of a much more complex functionality. Anyway two simple tips to make your code perform much faster.

1. use vectors instead of linked lists

2. create 2 vectors with half the # of elements each and let both vectors be calculated on 2 differerent tasks or just use parallel_for to let the scheduler split it up

(Added after #5) Paragraph 1 was a reaction to "the second stage is parallel", which is contradicted in the actual benchmark code (see #5); paragraph 2 was purely my mistake (see correction in #5); paragraph 3 is just an aside.

I have not looked at the benchmark code, but a TBB pipeline does not use a queue before a parallel stage: the same thread that just finished executing one stage will immediately execute any subsequent parallel stage (for the same item), striking while the iron is hot (in cache). That means that queue implementation could not possibly play any role in a pipeline where all non-initial stages are parallel.

It seems a bit unfair to not mention execution time with more than one token.

Note that, with one token, a pipeline is executed purely sequentially, but still incurs scheduling overhead, which is apparently non-negligible, so perhaps TBB's implementation should bypass the scheduler for that special case?

 

I only just noticed that you mentioned that the best execution time was with one token. And, when I then look at the benchmark code, I find that do_work() uses the second stage as serial_in_order (not only not parallel, but heavier than serial_out_of_order!), and that there is rather little work per stage (emphasising overhead).

What is the result with serial_out_of_order?

 

 

I don't know if you guys got my point here. I'm try to be more clear using a list of items:

- Although the linked code used "serial_in_order" I have done experiments with all variants (in-order, O-o-O and parallel) and the result is as I explained in the first post.

- I have experimented with several number of tokens. And the results are as I said in the first post.

- I run the *same* benchmark with a hand-tunned queue (or better, inter-core/thread comm. mechanism) and the result is way better. So, yes the benchmark is simple and somewhat "lightweight", but I have a speedup of 2x (over the original serial implementation) with a custom inter-core queue.

- My main motivation to use TBB is the easy to parallelize the programs, i.e., the API. But it seems that a hand-coded Queue/Buffer *considerably* outperforms the default API.

- My question is: how is the inter-stage communication handled in TBB? Is there a way that I can change it to use a custom implementation?

I'm considering that TBB uses queue because in the literature of thread pipeline usually queues are mentioned to decouple execution. If that is not the case please explain. 

Quote:

Raf Schietekat wrote:

I only just noticed that you mentioned that the best execution time was with one token. And, when I then look at the benchmark code, I find that do_work() uses the second stage as serial_in_order (not only not parallel, but heavier than serial_out_of_order!), and that there is rather little work per stage (emphasising overhead).

What is the result with serial_out_of_order?

Serial-out-of-order and serial-in-order resulted in the same thing. The reason probably is because the dominating overhead is in the communication.

I'd say I could not believe it until I tried it for myself.

I changed code a little bit to use better parallelism:

void do_work(struct node *ptr) {
    auto NewStage1 = [&](tbb::flow_control& fc)  -> struct node * {
        if (ptr == NULL) {
            fc.stop();
            return NULL;
        }
        auto aux = ptr;
        ptr = ptr->next;
        return aux;
    };

    tbb::parallel_pipeline(
        10,
        tbb::make_filter<void, struct node*>(tbb::filter::serial_in_order, NewStage1)
        & tbb::make_filter<struct node*, void>(tbb::filter::parallel, Stage2())
    );
}

This code gives a negative speedup 2 times SLOWER than Divino code.. (Ran this on a 32 cores opteron).

CPU it quite busy though..  Here is a snapshot of the profiler:

 

Attachments: 

AttachmentSize
Downloadimage/jpeg firstrun.jpg241.53 KB

Doing the same thing with a parallel_for

Yields significant speedup (executes in 0.13s compared to 7 seconds with previous example)

    std::vector<struct node *> tmpv;
    while (ptr != nullptr) {
        tmpv.push_back(ptr);
        ptr = ptr->next;
    }

    CGL::StopWatch timer;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, tmpv.size(), 1000), [&] (const tbb::blocked_range<size_t> p_Range) {
        Stage2 stage2;
        for (size_t i = p_Range.begin(); i < p_Range.end(); ++i) {
            stage2(tmpv[i]);
        }
    });

    std::cout << timer.Elapsed() << std::endl;

 

Going back to the original problem..  Here is the hotspot:

 

Attachments: 

AttachmentSize
Downloadimage/jpeg firstrun-hotspot.jpg56.45 KB

Hello Michel, quite interesting results.

The results for the "parallel_for" implementation is not too surprising since it uses a "block" partitioning and thus is able to amortize some communication overhead. However, as I understand the parallel_for using blocked_range iterators require a priori knowledge of the number of iterations of the loop - which I'll not always have.

In recent experiments I increased the size of the list's node to 64 bytes in an attempt to circumvent false-sharing but that did not improve the performance. May be you could try this on your 32-core machine? The list's node I used was this:

struct node {
	char pad[48];
	int val1;
	int val2;
	struct node *next;
};

 

From the profiling I've done, the problem seems to be in the internal tbb task stealing.  
Maybe a pipeline is not really a good fit for ultra fast stages..

Here is a test..  If you bundle a batch of nodes per stage, you get almost all the speedup of a parallel_for..

    typedef std::pair<struct node*, struct node*> TokenType;
    auto NewStage1 = [&](tbb::flow_control& fc)  -> TokenType {
		if (ptr == NULL) {
			fc.stop();
			return TokenType(nullptr, nullptr);
		}
        int batchSize = 1000;
        TokenType tok(ptr, nullptr);
        while (batchSize > 0 && ptr != nullptr) {
            ptr = ptr->next;
            --batchSize;
        }
        tok.second = ptr;
        return tok;
    };

    auto NewStage2 = [] (TokenType tok) {
        Stage2 stage2;
        while (tok.first != tok.second) {
            stage2(tok.first);
            tok.first = tok.first->next;
        }
    };

    CGL::StopWatch timer;
	tbb::parallel_pipeline(
			32,
			tbb::make_filter<void, TokenType>(tbb::filter::serial_in_order, NewStage1)
			& tbb::make_filter<TokenType, void>(tbb::filter::parallel, NewStage2)
	);
    std::cout << timer.Elapsed() << std::endl;

 

Quote:

Divino C. wrote:
I don't know if you guys got my point here.

It was a bit confusing...

Apparently the implementation has changed somewhat since I last looked at it, unless I'm mistaken that execute() did not just jump out of execute() at the end of each stage. Why would a task at the end of the pipeline only recycle itself if there were already "idle tokens waiting for input stage" (see "Only recycle if there is one available token" in pipeline.cpp)? Perhaps this should be monitored, e.g., by counting the number of stage_task instances and the number of times they get recycled? Or maybe the task should be recycled regardless of the previous value of input_tokens? Perhaps worth a try...

(Added) Since the decision whether to recycle is not reflected in a further change to input_tokens, I'm dubious about the invariants, so I'll leave the details to whoever wants to experiment with this. Perhaps it's safer to (initially) try "ntokens_avail>=1" instead of removing the condition altogether. Again, I'm not sure that this exact change will be OK, let alone beneficial, but it seems plausible that the solution is to be found here.

> How can I improve the performance of this example? 

The reason for bad performance was correctly named in the very first response: overheads for creating and executing tasks dominate in the execution time.

Without some aggregation, you will not get much better, scalable performance. That does not mean there is nothing to improve in the TBB's pipeline implementation. But these improvements would have quantitative, not qualitative, impact on performance. It can be seen in this discussion: a hand-written queue reduced the execution time twice comparing to serial execution; but aggregation of work by portions, performed by a suboptimal pipeline implementation, reduced it to more than an order of magnitude. And the code by Michel Lemay shows how to do aggregation even without a priori knowledge of the size of the data set.

>  From the profiling I've done, the problem seems to be in the internal tbb task stealing.

No, it's not. It's a common mistake, unfortunately not yet well addressed/explained by profilers, to consider these hotspots as parallel runtime overhead for stealing or scheduling. In fact, the hotspot indicates busy waiting, which can be caused by work imbalance or, as in this case, lack of work. When you ran a pipeline with just 10 tokens on 32 threads, obviously 22 threads did nothing but looking for work at any given point in time.

> I'm considering that TBB uses queue because in the literature of thread pipeline usually queues are mentioned to decouple execution. If that is not the case please explain. 

Queues are used by the pipeline only when an item cannot be immediately processed and need to be buffered. This is typically necessary for serial filters preceded by a faster parallel or serial filter. Otherwise, a data item is not queued but processed immediately by the next filter. E.g. if the second filter in your code is parallel (as it should be, because it does not modify any shared state), no queues are used at all.

Quote:

Alexey Kukanov (Intel) wrote:

Without some aggregation, you will not get much better, scalable performance. That does not mean there is nothing to improve in the TBB's pipeline implementation. But these improvements would have quantitative, not qualitative, impact on performance.

Sure, aggregating more work per item will mask the intrinsic performance of the pipeline implementation, and that would be an excellent workaround if the application supported it, but why should it be so easy to substitute another algorithm to get better performance at any given granularity? (BTW, I'm not sure that "quantitative" is supposed to mean a weaker form of "qualitative"...)

I've wondered before why TBB's own pipeline isn't used as the input buffer for serial_out_of_order at least. And now there seems to be an alternative for serial_in_order as well?

Also, why should there be any (unamortised) overhead at all for task creation? I would expect a stage_task to cycle around to the beginning unconditionally, but instead I see it giving up a token ("size_t ntokens_avail = ++my_pipeline.input_tokens;") and then only recycling itself if "ntokens_avail>1" does not hold where ntokens_avail is ++input_tokens, and I don't immediately see what that means. Doesn't this possibly result in excess stage_task instances, and, if so, couldn't that break some code, like reading from a circular buffer? Doesn't it prevent recycling instances in a fully ramped-up pipeline, and, if so, shouldn't that immediately be fixed for better performance? What does it mean that the decision whether to recycle is not reflected in a reverse update to input_tokens?

Disclaimer: My own ramp-up in this thread has been a bit shaky, and I have not yet analysed the latest implementation beyond the above initial impressions.

(Added 2014-11-13, modified 2014-11-14) Sorry, please disregard most of what I wrote in the main paragraph above: it was written too hastily, but I don't want to give specific corrections that may themselves be wrong. Perhaps I can concentrate on this at a later time, and come up with something that does make sense. :-)

Leave a Comment

Please sign in to add a comment. Not a member? Join today