Understanding the working of parallel_do

Understanding the working of parallel_do

I've been through the reference pdf, the parallel_do example provided by TBB and a parallel_do example on the internet.
Thought I understood the concept. Tried my own program and it doesn't compile.

As I understand it, if we have a list of items (which can even be POD"s), and if the list can keep increasing, then we need to send the start and end iterators of that list of items, and a parallel_do body class will process those items.

So tried this:

#include 
#include 
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_do.h"

using namespace tbb;
using namespace std;

class Item
{
   int i;
public:
   Item() {}
   Item(int ii):i(ii) {}
   Item(const Item& old) {i = old.i;}
   void doSomething() {cout<<"Whoopee I'm doing something!"< v;
   v.push_back(Item());
   v.push_back(Item());
   
   parallel_do(v.begin(), v.end(), Doer());   
}

And got this error while compiling.

bash$  g++ -ltbb -o parallelDo parallelDo.cpp;
In file included from parallelDo.cpp:11:
/opt/intel/tbb/tbb30_20100406oss/include/tbb/parallel_do.h: In function void tbb::parallel_do(Iterator, Iterator, const Body&) [with Iterator = __gnu_cxx::__normal_iterator > >, Body = Doer]:
parallelDo.cpp:43:   instantiated from here
/opt/intel/tbb/tbb30_20100406oss/include/tbb/parallel_do.h:485: error: no matching function for call to select_parallel_do(__gnu_cxx::__normal_iterator > >&, __gnu_cxx::__normal_iterator > >&, const Doer&, void (Doer::*)(Item&), tbb::task_group_context&)
yaa

The reference says we've got to supply random access operators to parallel_do. Vector has random access operators. So why the error?

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Okay I got it. I didn't use const.

void operator()(Item& item) const
{
 ...
}

Works now :)

Hello. I have wrote something like your code (using parallel_do), and i have a problem with proper writing of the results. Instead of getting i.e. AB
C
Di getCA
B
D*every letter is one cout<<......<In my operator() i have 2 for loops and every of them has nested one.Something like this:void InputParallel::operator()(...) { for(....){ for(...) if ... else ... } for(....){ for(...) if ... else ... }I am not good at parallel_do and using threads in that way (i ve used pipeline a lot) so can you tell me how to organize my threads and tell them to run properly? Do you have any idea ?I m not using tasks explicitly , just parallel_do to optimize my execution time becouse those upper loops takes a lot of time if i read large input files (10000 x 10000) (maybe i should use hash_map but i dont know if that gonna speed up my program, becouse maps are often slow structures (as i know :)).One more question: Is it possible to optimize every of this two loops by using parallel_for inside of parallel_do ? Or it is smarter to put every of this nested loops in one parallel_do and get 2 parallel_do optimizations?Thanks for any kind of help!

There is no order guarantee with parallel_do().

Nested parallelism may help create "parallel slack" (more distribution opportunities), and 10000x10000 seems big enough that it does not mainly create more "parallel overhead" ("opportunity cost"?). I don't see how you would get 2 parallel_do loops without causing a lot of data movement (not good).

Thanks for your answer Raf. So, that means that i cant make my program works correctly? Results are okey, but order is bad... Can i work with threads explicitly to manage times when program is accessing variables?I m stuck know... And spent a lot of time to learn how to handle parallel_do , parallel_for etc.Thanks once again :)

If I understand correctly, you've used pipeline before, so why not do the heavy processing in intermediiate parallel stage (possibly with nested parallel algorithms) and use the existing support for ordered serial stages to do the output in the desired order?

Yes, i use pipeline with 3 stages and inintermediate stage i do parallel_do which significly speeds-up my program... But pipeline and parallel_do are not scalable or i m wrong? In Advisor tests, result is POOR , but code is fast.Which arenested parallel algorithms? parallel_for ? I ve tried with it and it really slows my code..You mean serial_in_order?If i dont use parallel_do, i get good output. So, i m sure that parallel_do is troublemaker.

"But pipeline and parallel_do are not scalable or i m wrong?"
There is obviiously a lot more communication than with parallel_for, but you also cannot manufacture scalablity out of thin air: a less scalable problem (need to produce results in order) implies a less scalable algorithm. Still, if most of the work is in a parallel stage, you should see decent scalability.

"Which arenested parallel algorithms? parallel_for ? I ve tried with it and it really slows my code.."
You can nest a parallel algorithm inside a task that is itself part of a parallel algorithm. Witih a nested parallel_for you have to watch out to ensure big-enough grainsize and not using range.end() in each iteration of an inner loop.

"You mean serial_in_order?"
For first and third stage, yes, if it is suitable to the problem."If i dont use parallel_do, i get good output. So, i m sure that parallel_do is troublemaker."
Scalability is limited unless most of parallel_do's work is generated through a feeder.

I hope these comments are useful to you even though they were written without good knowledge of the actual situation. Perhaps others have something to add?

Thanks a lot Raf for your time and will to answer. Parallel_do is not covered on the internet as parallel_for (for example) maybe because it is a new structure in tbb library. I tried to make a global map, and than write to map inside a parallel_do but i ve got even worse results. Everything is messed up, not just cout<<...<I know that correct order will produce less scalable program and less fast program, too.Hm, it s not that simple as i thought.... :)

Finally, i figured out whats happening and i really dont know what is the reason for such ,,crazy" behavior:This is my middle stage in pipeline (part of it) :

inBuffer = *static_cast(item);
parallel_do( inBuffer.otherSeq.begin(), inBuffer.otherSeq.end(), InputParallel(inBuffer.seqRef,inBuffer.minLength)); 

//calling parallel_do,everything ok
//inBuffer.otherSeq.begin() is first_sequence and before inBuffer.otherSeq.end() is second sequence//Note that i have just 2 sequencesWhen parallel_do is called , operator() is called :

void InputParallel::operator()( pair sequencesIter) const {
       // output sequence name
      cout << sequencesIter.first << "n";

//instead of first sequence (becouse of begin() here i can see 2nd sequence.. Even in arguments list, there is wrong sequence).Somehow, parallel_do swap this two seqeunces and thats the reason why i get results in wrong order.HOW ? I doubt that threads behavior is a problem , and i really tried a lot of things (added global map and than stored results , and then sorted it , but that is even worse ).

Would anybody be able to confirm the hunch that you can get any two out of the triplet (ordered result, low latency, scalability)? And where does pipeline (serial_in_order, parallel, serial_in_order) live in that range? Should more be done to be able to tune it (in general or in a specific limited configuration), or is this problem only imaginary? Just curious...

I would really stop worrying about the order of parallel_do(). At best the statistical outcome could move towards more order (short of perfect order), but then you would have to demonstrate that this is a useful-enough outcome for the TBB team to spend any development effort on it. At present, order (as well as fairness) is readily (and in my opinion rightly) sacrificed for performance (unless otherwise indicated).

I probably should give up becouse parallel_do is not good structure for my problem. It gives signaficant speed-up but it ruins my result ... But, i m curious, in which case the parallel_do is appropriate structure?What do you think, in nested for loop, which is the best structure and algorithm to speed up the execution? I have a lot of iterations...I have worked on a project (implementation of Artificial neural network) and pipeline was enough to get almost 3-3.5 speed up.. But here, speed up is about 20% so i must use some extra algorithms to speed up hotspots..Thanks a lot once again. I have really learned a lot about this structure, but obviously i must move on and find more suitable algorithm for my specific case.

"I probably should give up becouse parallel_do is not good structure for my problem. It gives signaficant speed-up but it ruins my result ..."
Are you certain that you aren't, e.g., passing values through that map instead of pointers?

"But, i m curious, in which case the parallel_do is appropriate structure?"
The canonical application doesn't care about order (otherwise use pipeline) and internally generates more work into the feeder (for better scalability).

,,Are you certain that you aren't, e.g., passing values through that map instead of pointers?"
I mustn't pass values? How do you mean?

Sorry, it seems I mixed up two things you wrote (apprehension on performance and problems with usability), fabricating problems with performance, leading me to this wild guess. :-)

Leave a Comment

Please sign in to add a comment. Not a member? Join today