Performance issue with compact form of parallel_for

Performance issue with compact form of parallel_for

Hi,I am running a simple parallel for over an array of floats, implemented both as a range-based parallel for and as a compact one that loops over a consecutive range of integers, i.e.:

   tbb::parallel_for( 
            tbb::blocked_range(0,n), 
            [=](const tbb::blocked_range& r) {
                for ( size_t i = r.begin(); i != r.end(); ++i )
                    Foo(a[i]);
            } 
            );

and

   tbb::parallel_for( size_t(0), n, 
            [=](size_t i) { 
            Foo(a[i]); 
         } ); 

The second version always performs worse than the first, usually by a factor of 2x or more. Any idea why is that happening? tbb::parallel_for( tbb::blocked_range(0,n), [=](const tbb::blocked_range& r) { for ( size_t i = r.begin(); i != r.end(); ++i ) Foo(a[i]); } );

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What are these?

...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...

and

...
[=](size_ti){
...

Is that a result of some incorrect "Copy-and-Paste" operation?

I guess it's just some wrong conversion of less-than/greater-than symbols from within the code block to the corresponding HTML keywords.

Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.

Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.

You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i
(Added) Hmm, better also make a local copy of my_step...

Leave a Comment

Please sign in to add a comment. Not a member? Join today