parallel_for is easier with lambdas, Intel Threading Building Blocks

By James Reinders (Intel) (16 posts) on August 3, 2009 at 10:34 pm

Lambdas are an exciting new addition to C++ in the current draft for C++ 0x. (see my prior post for "Hello Lambda" - my introduction to Lambdas). The Intel compilers support them now in the Intel compiler products, and Microsoft has support in their beta for Visual Studio 2010. I think we can expect to see support for lambdas added quickly, and a great deal of interest in using them in C++ code.Lambdas quite simply allow code to be specified inline in ways that find particularly useful for parallel programming constructs - notably for Intel Threading Building Blocks (TBB).
Intel TBB supports forms of algorithms in the old style (without lambdas) and in the new style with lambdas.

To get an idea why I expect lambdas will be very popular with Intel TBB - we can look at the "with" and "without" syntax.

The "old" syntax before lambdas were available involved coding your "work to be done in parallel" into an operation() within a class. It was the toughest thing to teach about using Intel TBB - and it is THE reason C programmers complained about "C++ syntax" being needed with Intel TBB.

parallel_for(range, body, optional partitioner) w/out lambdas
   #include "tbb/tbb.h"
   using namespace tbb;
   class ApplyFoo {
      float *const my_a;
   public:
      void operator()( const blocked_range<size_t>& r ) const {
         float *a = my_a;
         for( size_t i=r.begin(); i!=r.end(); ++i )
            Foo(a[i]);
      }
      ApplyFoo( float a[] ) :
         my_a(a)
      {}
   };
 
   void ParallelApplyFoo( float a[], size_t n ) {
      parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
   }

Writing the same program, but using lambdas - is actually quite readable:

parallel_for(first, last, step, function) with lambdas

   #include "tbb/tbb.h"
   using namespace tbb;
 
   void ParallelApplyFoo( float* a, size_t n ) {
      parallel_for( blocked_range<size_t>(0,n),
                    [=](const blocked_range<size_t>& r) {
                       for(size_t i=r.begin(); i!=r.end(); ++i)
                          Foo(a[i]);
                    }
      );
   }

Starting with Intel TBB 2.2, there is another form of parallel_for allowed too - and seems to be considered a little easier to read by some:

parallel_for(first, last, step, function) with lambdas

   #include "tbb/tbb.h"
   using namespace tbb;

  void ParallelApplyFoo(float a[], size_t n) {
      parallel_for(size_t(0), n, size_t(1) , [=](size_t i) {Foo(a[i]);});
   }

Final note - ridding yourself of compiler warnings.

There is a compiler warning that the above will produce currently with the Intel compiler - which you can just ignore, or it can be eliminated using:

   //The pragma turns off warnings from the compiler about "use of a local type to declare a function".
   #pragma warning( disable: 588)

Categories: Open Source, Parallel Programming, Threading Building Blocks

Comments (4)

August 4, 2009 7:47 AM PDT


Rafael
So the capture statement of C++ 0x lambda says that variables are captured from where the lambda function is called?
August 5, 2009 9:19 AM PDT

James Reinders (Intel)
Total Points:
1,717
Status Points:
1,717
Black Belt
No, lambda functions capture at the exact moment they are defined - in the context of the definition.

That means the capture by reference grabs a pointer, and capture by value grabs the value at that instant.
August 5, 2009 9:26 AM PDT

James Reinders (Intel)
Total Points:
1,717
Status Points:
1,717
Black Belt
Here is an example of using lambdas that emphasizes that the capture is when the lambda is defined:

template<typename F>
void Eval( const F& f ) {
int i = 77;
f();
}
void foo() {
int i = 22;
Eval( [=]{printf("Hello, Lambdas %dn",i); } );
}

void bar() {
int i = 99;
auto f = [=]{printf("Hello, Lambdas %dn",i); };
f();
i = 88;
{
int i = 66;
f();
}
f();
}

void bar2() {
int i = 99;
auto f = [&]{printf("Hello, Lambdas %dn",i); };
f();
i = 88;
{
int i = 66;
f();
}
f();
}

void bar3() {
int i = 99;
auto f = [=]() mutable {printf("Hello, Lambdas %dn",i); };
f();
i = 88;
{
int i = 66;
f();
}
f();
}

int _tmain(int argc, _TCHAR* argv[])
{
foo();
bar();
bar2();
bar3();

return 0;
}


This prints values of 22, 99, 99, 99, 99, 88, 88, 99, 99, 99.
The first 22 is printed inside Eval() ignoring the local value of 77 and sticking with the value captured in foo() of 22.
Next, in bar(), the capture of 99 is done and holds while i is 99, 88, a new local i is 66, and back to 88 - we see 99, 99, 99 from the lambda.
Next, in bar2(), we see the same code as bar() but capture by reference - here the change of variable i is tracked, but the precise i that was in scope. So we see 99, 88, 88 - note that the i=66 has no effect because it is not the i pointed to by the reference when the lambda is created.
Finally, bar3() shows that the mutable keyword has nothing to do with whether a capture by value tracks te variable captured. It only affects whether the compiler allows changes inside the lambda body to the value.
September 7, 2009 8:08 PM PDT

sm345
Total Points:
5
Registered User
The new parallel_for syntax( considered easier to read by some!) is in line with the microsoft syntax for their parallel for. That's good.. very good. But please don't get rid of the range concept in future releases. Its useful when I need to fine tune the partitioning. I assume either block or auto is the default partitioner with the new syntax...and its just syntactic sugar for the REAL classic parallel_for.

Trackbacks (2)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*