parallel_for is easier with lambdas, Intel® Threading Building Blocks

Lambdas are an exciting new addition to C++ in the current draft for C++ 0x. (see my prior post for "Hello Lambda" - my introduction to Lambdas). The Intel compilers support them now in the Intel compiler products, and Microsoft has support in their beta for Visual Studio 2010. I think we can expect to see support for lambdas added quickly, and a great deal of interest in using them in C++ code.Lambdas quite simply allow code to be specified inline in ways that find particularly useful for parallel programming constructs - notably for Intel® Threading Building Blocks (Intel® TBB).

Intel TBB supports forms of algorithms in the old style (without lambdas) and in the new style with lambdas.

To get an idea why I expect lambdas will be very popular with Intel TBB - we can look at the "with" and "without" syntax.

The "old" syntax before lambdas were available involved coding your "work to be done in parallel" into an operation() within a class. It was the toughest thing to teach about using Intel TBB - and it is THE reason C programmers complained about "C++ syntax" being needed with Intel TBB.

parallel_for(range, body, optional partitioner) w/out lambdas


#include "tbb/tbb.h"

using namespace tbb;

class ApplyFoo {

  float *const my_a;

  public:

    void operator()( const blocked_range<size_t>& r ) const {

      float *a = my_a;

      for( size_t i=r.begin(); i!=r.end(); ++i )

        Foo(a[i]);

    }

    ApplyFoo( float a[] ) :

      my_a(a) {}

  };

  void ParallelApplyFoo( float a[], size_t n ) {

  parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));

}

Writing the same program, but using lambdas - is actually quite readable:

parallel_for(first, last, step, function) with lambdas


#include "tbb/tbb.h"

using namespace tbb;

void ParallelApplyFoo( float* a, size_t n ) {

  parallel_for( blocked_range<size_t>(0,n),

    [=](const blocked_range<size_t>& r) {

      for(size_t i=r.begin(); i!=r.end(); ++i)

        Foo(a[i]);

      }

    );

}

Starting with Intel TBB 2.2, there is another form of parallel_for allowed too - and seems to be considered a little easier to read by some:

parallel_for(first, last, step, function) with lambdas


#include "tbb/tbb.h"

using namespace tbb;

void ParallelApplyFoo(float a[], size_t n) {

  parallel_for(size_t(0), n, size_t(1) , [=](size_t i) {Foo(a[i]);});

}

Final note - ridding yourself of compiler warnings.

There is a compiler warning that the above will produce currently with the Intel compiler - which you can just ignore, or it can be eliminated using:


//The pragma turns off warnings from the compiler about "use of a local type to declare a function".

#pragma warning( disable: 588)

For more complete information about compiler optimizations, see our Optimization Notice.

10 comments

Top

Hi all,

sorry to add to this old thread - maybe still somebody looks at it ...
what about auto-vectorization on the various parallelization flavors. Just ran into a case where a TBB lambda parallel_for version wouldn't vectorize with Intel icc (assumed dependencies). Is this a general issue with TBB lambdas? Any recommendations? Not to use lambda?

Thanks,
Matthias

James R.'s picture

My apologies. I was thinking about cilk_for - which allows negative stepping. TBB does not allow a negative step. The TBB reference manual says: A parallel_for(first,last,step,f) represents parallel execution of the loop: for( auto i=first; i<last; i+=step ) f(i); The index type must be an integral type. The loop must not wrap around. The step value must be positive. If omitted, it is implicitly 1. There is no guarantee that the iterations run in parallel. Deadlock may occur if a lesser iteration waits for a greater iteration. The partitioning strategy is always auto_partitioner. The solution in TBB, is to reverse the initial and terminal value. The thinking is this: in a parallel for there is NO order... the purpose of parallel for is to say "do all these." So, when using TBB parallel_for - the order matters. A little inconvenient perhaps? We'd like to know. Let me know if reversing the logic will work for you, and if not - I'd like to know a bit more. Perhaps TBB should change? Of course, you could always try out cilk_for - which is really a subset of TBB put into the compiler where it can help optimize the parallelism you specify.

florin.d's picture

How can I write that loop or what libraries I have to include ? because this is what I get at runtime:

terminate called after throwing an instance of 'std::invalid_argument'
what(): Step must be positive
Aborted

It is sufficient to catch that exception ?

James R.'s picture

Yes, negative steps are possible.

florin.d's picture

Is a posibility to have parallel_for with a negative step ? for example to parallelize this:

for ( i = 10; i >= 0; i -=2 )

sm345's picture

The new parallel_for syntax( considered easier to read by some!) is in line with the microsoft syntax for their parallel for. That's good.. very good. But please don't get rid of the range concept in future releases. Its useful when I need to fine tune the partitioning. I assume either block or auto is the default partitioner with the new syntax...and its just syntactic sugar for the REAL classic parallel_for.

James R.'s picture

Here is an example of using lambdas that emphasizes that the capture is when the lambda is defined:

template<typename F>
void Eval( const F& f ) { int i = 77; f(); }
void foo() {
  int i = 22;
  Eval( [=]{printf("Hello, Lambdas %dn",i); } );
}
void bar() {
  int i = 99;
  auto f = [=]{printf("Hello, Lambdas %dn",i); };
  f();
  i = 88; {
    int i = 66;
    f();
  }
  f();
}
void bar2() {
  int i = 99;
  auto f = [&]{printf("Hello, Lambdas %dn",i); };
  f();
  i = 88; {
    int i = 66;
    f();
  }
  f();
}
void bar3() {
  int i = 99;
  auto f = [=]() mutable {printf("Hello, Lambdas %dn",i); };
  f();
  i = 88; {
    int i = 66;
    f();
  }
  f();
}
int _tmain(int argc, _TCHAR* argv[]) {
foo();
bar();
bar2();
bar3();
return 0;
}

This prints values of 22, 99, 99, 99, 99, 88, 88, 99, 99, 99. The first 22 is printed inside Eval() ignoring the local value of 77 and sticking with the value captured in foo() of 22. Next, in bar(), the capture of 99 is done and holds while i is 99, 88, a new local i is 66, and back to 88 - we see 99, 99, 99 from the lambda. Next, in bar2(), we see the same code as bar() but capture by reference - here the change of variable i is tracked, but the precise i that was in scope. So we see 99, 88, 88 - note that the i=66 has no effect because it is not the i pointed to by the reference when the lambda is created. Finally, bar3() shows that the mutable keyword has nothing to do with whether a capture by value tracks te variable captured. It only affects whether the compiler allows changes inside the lambda body to the value.

James R.'s picture

No, lambda functions capture at the exact moment they are defined - in the context of the definition. That means the capture by reference grabs a pointer, and capture by value grabs the value at that instant.

anonymous's picture

So the capture statement of C++ 0x lambda says that variables are captured from where the lambda function is called?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.