4,391 Posts served
10,712 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Site News & Announcements

- Software Tools

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
parallel_for is easier with lambdas, Intel Threading Building Blocks
By James Reinders (Intel) (46 posts) on August 3, 2009 at 10:34 pm
To get an idea why I expect lambdas will be very popular with Intel TBB - we can look at the "with" and "without" syntax.
The "old" syntax before lambdas were available involved coding your "work to be done in parallel" into an operation() within a class. It was the toughest thing to teach about using Intel TBB - and it is THE reason C programmers complained about "C++ syntax" being needed with Intel TBB.
parallel_for(range, body, optional partitioner) w/out lambdas
#include "tbb/tbb.h"
using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) :
my_a(a)
{}
};
void ParallelApplyFoo( float a[], size_t n ) {
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}
Writing the same program, but using lambdas - is actually quite readable:
parallel_for(first, last, step, function) with lambdas
#include "tbb/tbb.h"
using namespace tbb;
void ParallelApplyFoo( float* a, size_t n ) {
parallel_for( blocked_range<size_t>(0,n),
[=](const blocked_range<size_t>& r) {
for(size_t i=r.begin(); i!=r.end(); ++i)
Foo(a[i]);
}
);
}
Starting with Intel TBB 2.2, there is another form of parallel_for allowed too - and seems to be considered a little easier to read by some:
parallel_for(first, last, step, function) with lambdas
#include "tbb/tbb.h"
using namespace tbb;
void ParallelApplyFoo(float a[], size_t n) {
parallel_for(size_t(0), n, size_t(1) , [=](size_t i) {Foo(a[i]);});
}
Final note - ridding yourself of compiler warnings.
There is a compiler warning that the above will produce currently with the Intel compiler - which you can just ignore, or it can be eliminated using:
//The pragma turns off warnings from the compiler about "use of a local type to declare a function".
#pragma warning( disable: 588)
Categories: Open Source, Parallel Programming, Software Tools
Tags: C++, C++ 0x, lambda expressions, lambdas, TBB, Threading Building Blocks
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (9)
| August 5, 2009 9:19 AM PDT
James Reinders (Intel)
|
No, lambda functions capture at the exact moment they are defined - in the context of the definition. That means the capture by reference grabs a pointer, and capture by value grabs the value at that instant. |
| August 5, 2009 9:26 AM PDT
James Reinders (Intel)
|
Here is an example of using lambdas that emphasizes that the capture is when the lambda is defined: template<typename F> void Eval( const F& f ) { int i = 77; f(); } void foo() { int i = 22; Eval( [=]{printf("Hello, Lambdas %dn",i); } ); } void bar() { int i = 99; auto f = [=]{printf("Hello, Lambdas %dn",i); }; f(); i = 88; { int i = 66; f(); } f(); } void bar2() { int i = 99; auto f = [&]{printf("Hello, Lambdas %dn",i); }; f(); i = 88; { int i = 66; f(); } f(); } void bar3() { int i = 99; auto f = [=]() mutable {printf("Hello, Lambdas %dn",i); }; f(); i = 88; { int i = 66; f(); } f(); } int _tmain(int argc, _TCHAR* argv[]) { foo(); bar(); bar2(); bar3(); return 0; } This prints values of 22, 99, 99, 99, 99, 88, 88, 99, 99, 99. The first 22 is printed inside Eval() ignoring the local value of 77 and sticking with the value captured in foo() of 22. Next, in bar(), the capture of 99 is done and holds while i is 99, 88, a new local i is 66, and back to 88 - we see 99, 99, 99 from the lambda. Next, in bar2(), we see the same code as bar() but capture by reference - here the change of variable i is tracked, but the precise i that was in scope. So we see 99, 88, 88 - note that the i=66 has no effect because it is not the i pointed to by the reference when the lambda is created. Finally, bar3() shows that the mutable keyword has nothing to do with whether a capture by value tracks te variable captured. It only affects whether the compiler allows changes inside the lambda body to the value. |
| September 7, 2009 8:08 PM PDT
sm345
|
The new parallel_for syntax( considered easier to read by some!) is in line with the microsoft syntax for their parallel for. That's good.. very good. But please don't get rid of the range concept in future releases. Its useful when I need to fine tune the partitioning. I assume either block or auto is the default partitioner with the new syntax...and its just syntactic sugar for the REAL classic parallel_for. |
| November 4, 2011 12:32 PM PDT
florin.d
| Is a posibility to have parallel_for with a negative step ? for example to parallelize this: for ( i = 10; i >= 0; i -=2 ) |
| November 4, 2011 1:49 PM PDT
James Reinders (Intel)
| Yes, negative steps are possible. |
| November 6, 2011 7:04 AM PST
florin.d
|
How can I write that loop or what libraries I have to include ? because this is what I get at runtime: terminate called after throwing an instance of 'std::invalid_argument' what(): Step must be positive Aborted It is sufficient to catch that exception ? |
| November 7, 2011 3:43 PM PST
James Reinders (Intel)
|
My appologies. I was thinking about cilk_for - which allows negative stepping. TBB does not allow a negative step. The TBB reference manual says: A parallel_for(first,last,step,f) represents parallel execution of the loop: for( auto i=first; i<last; i+=step ) f(i); The index type must be an integral type. The loop must not wrap around. The step value must be positive. If omitted, it is implicitly 1. There is no guarantee that the iterations run in parallel. Deadlock may occur if a lesser iteration waits for a greater iteration. The partitioning strategy is always auto_partitioner. The solution in TBB, is to reverse the initial and terminal value. The thinking is this: in a parallel for there is NO order... the purpose of parallel for is to say "do all these." So, when using TBB parallel_for - the order matters. A little inconvenient perhaps? We'd like to know. Let me know if reversing the logic will work for you, and if not - I'd like to know a bit more. Perhaps TBB should change? Of course, you could always try out cilk_for - which is really a subset of TBB put into the compiler where it can help optimize the parallelism you specify. |
| January 23, 2012 1:59 AM PST
gutha.raghugmail.com
| give me a example of parallel_reduce using lamda expressions |
Trackbacks (5)
- Intel Software Network Blogs » Version 2.2, Intel Threading Building Blocks, worth a look
August 4, 2009 9:13 AM PDT - Intel Software Network Blogs » parallel_for is easier with lambdas … | Technology News Update
August 4, 2009 9:56 AM PDT - TBB 3.0: New (today) Version of Intel Threading Building Blocks – Intel Software Network Blogs
May 4, 2010 8:23 AM PDT - TBB 3.0: New (today) Version of Intel Threading Building Blocks – Intel Software Network Blogs
May 4, 2010 8:31 AM PDT - Version 2.2, Intel Threading Building Blocks, worth a look – Intel Software Network Blogs
October 28, 2010 11:24 AM PDT




Rafael