Generic Parallel Algorithms for Intel® TBB - "They're Already in There" Part 2

Now that we've talked about some of the basic Generic Parallel Algorithms, let's go into some more in-depth ones. These *really* abstract away from having to do a lot of low-level, intricate threading for the underlying architecture.

parallel_invoke
What if you have a bunch of cores available and a bunch of smaller functions that individually don't warrant distribution across cores? Parallel_invoke is a great way to evaluate all of those functions, technically at the same time, across all available cores.

The expression parallel_invoke can take as input anywhere from 2 to 10 function arguments and if possible, will evaluate them all in parallel. You need to feed in either the function object directly or pointers to that function. Given the nature of invoking all of these functions at the same time and executing them in parallel, if you have a return value, parallel_invoke will ignore it. So for the purposes of debugging your work, I'd recommend returning void, just to rule out any behavior you think is not what you expected.

In this example we will evaluate f(), g(), and h() in parallel. It's interesting to note that we're holding the local state of g and h and passing them as input to parallel_invoke.

 #include \"tbb/parallel_invoke.h\"
 using namespace tbb;
 void f();
 extern void bar(int);

 class MyFunctor {
 int arg; 
 public:
 MyFunctor(int a) : arg(a) {}
 void operator()() const {bar(arg);}
 };
 void RunFunctionsInParallel() {
 MyFunctor g(2);
 MyFunctor h(3);
 tbb::parallel_invoke(f, g, h );
 }
 


Remember the C++0x Lambda Expression way of writing out an Intel® TBB parallel_for? You can also use those here to generate your own function objects and do away with the clunky functor syntax. Just keep in mind you need a newer compiler that will support the new C++ standard to use it. It doesn't hurt to always be using the newest version of Intel® TBB either.

 #include \"tbb/parallel_invoke.h\"
 using namespace tbb;
 void f();
 extern void bar(int);
 void RunFunctionsInParallel() {
 tbb::parallel_invoke(f, []{bar(2);}, []{bar(3);} );
 }
 


parallel_pipeline
First things first, the Intel® TBB tutorial has a good step by step of how to do this one for your real-world code. I'm only going to give a short description and simple example.

Basically, the C++0x Lambda Expressions enable you to express, build, and run your pipeline much more easily than in the past.

This non-practical example (too much overhead) shows how the proper syntax much better than me summarizing all the details and restrictions. You can look at those on your own time. Here we compute the root-mean-square of a sequence defined by [first,last). Operator & requires that the output type of its first filter_t argument matches the input type of its second filter_t argument.

 float RootMeanSquare( float* first, float* last ) {
 float sum=0;

 parallel_pipeline( /*max_number_of_live_token=*/16, make_filter( filter::serial, [&](flow_control& fc)-> float*{
 if( first return first++;
 } else {
 fc.stop();
 return NULL; }
 }
 ) &
 make_filter(filter::parallel, [](float* p){return (*p)*(*p);} ) &
 make_filter(
 filter::serial,
 [&](float x) {sum+=x;}
 )
 );
 return sqrt(sum);
 }
 


parallel_sort
This one is pretty simple. It performs an unstable sort of sequence [begin1, end1). Unstable sorts may not preserve relative ordering of elements with equal keys. It's a deterministic sort - sorting the same sequence will produce the same result each time. It's a comparison sort with an average time complexity of O(N log (N)), where N is the number of elements in the sequence. When worker threads are available, parallel_sort creates subtasks that may be executed concurrently, improving your performance.

The parallel_sort is well covered in the Intel® TBB tutorial and also a hands-on lab. No need to go into specifics here.

parallel_scan
This one computes what is called a parallel prefix. This is a pretty advanced idea in parallel programming that I feel only top programmers can code up effectively using threads. The explanation in the reference manual has some great graphs, a bunch of math, and an explanation of how Intel will make decisions based on how you call parallel_scan. I can't possibly go through that in a simple manner here, but it's good to know that if you need to do something like a running sum that can have serial dependencies that are very difficult to code for individual "what if" scenarios, I highly recommend spending an hour or two reading through that documentation and making sure you understand it before just calling it and seeing what happens.
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.