Barrier Synchronization implementation

Barrier Synchronization implementation

I need to implement a template function of this type. Functions 1 to n has to be executed in parallel on multiple cores. Any suggestions on implementing it the most optimized way???

For(i=1 to 100)
{

Execute Function1;

Execute Function2;

Execute Function3;

Execute Function4;

.

Execute Function n-3;

Execute Function n-2;

Execute Function n-1;

Execute Function n;

//Barrier Synchronization
Wait till all n functions are completed
}

I did see parallel_invoke. But they have a limit on number of functions (10) that can be evaluated in parallel (http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=21640...).

39 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - knmaheshy2k

I need to implement a template function of this type. Functions 1 to n has to be executed in parallel on multiple cores. Any suggestions on implementing it the most optimized way???

For(i=1 to 100)
{

Execute Function1;

.

Execute Function n;

//Barrier Synchronization
Wait till all n functions are completed
}

I did see parallel_invoke.

How important is it that these functions all execute in parallel? There may be a few machines out there that can supply a hundred HW threads to run so many functions simultaneously, but not many. More likely is that you will see a serial cascade of tasks as threads complete previous function calls and go back to the scheduler for more work. Likely Function1 will complete long before Function100 is spawned.

That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks. Parallel_invoke tries to avoid this serial launch semantic by having a parallel invocation hierarchy, like a phone tree, that divides the task of spawning to several tasks.


Thanks for replying Reed.

1. How important is it that these functions all execute in parallel?



A: I need to extract parallelism by executing these functions in parallel.

2.There may be a few machines out there that can supply a hundred HW threads to run so many functions simultaneously, but not many.

A: I get the hint. I know which machines you actually mean. Its one of those green things, isn't it?? Lets say I'm working on something similar but by blue. :)

3.More likely is that you will see a serial cascade of tasks as threads complete previous function calls and go back to the scheduler for more work. Likely Function1 will complete long before Function100 is spawned.


A: I'm fine with it. I don't have much problems with that part.

4.That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks.


A: Can you explain a bit more plz?

Thanks in advance!

Quoting - knmaheshy2k

Thanks for replying, Robert.

4.That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks.


A: Can you explain a bit more plz?

Well, I don't have time today to fully verify this, so this out-of-my-head implementation example is guaranteed not to compile and is probably missing some filigree here or there, but this will convey the basic idea:

tbb::task_list list;
for (int i = 0; i < n ; ++i) {
list.push_back( *new( tbb::task::allocate_child() ) OrderedFunctionTask(i) );
}
set_ref_count(n + 1);
spawn_and_wait_for_all(list);

Where OrderedFunctionTask is a class derived from Task whose function object selects the appropriate function, perhaps to initialize a local function pointer within the object. That should at least get you started.

Quoting - Robert Reed (Intel)

Well, I don't have time today to fully verify this, so this out-of-my-head implementation example is guaranteed not to compile and is probably missing some filigree here or there, but this will convey the basic idea:

tbb::task_list list;
for (int i = 0; i < n ; ++i) {
list.push_back( *new( tbb::task::allocate_child() ) OrderedFunctionTask(i) );
}
set_ref_count(n + 1);
spawn_and_wait_for_all(list);

Where OrderedFunctionTask is a class derived from Task whose function object selects the appropriate function, perhaps to initialize a local function pointer within the object. That should at least get you started.

Thanks Robert. I got what you meant. Will be looking forward if someone else has an better solution.

Is parallelism required, i.e., would the program fail if run on a single thread? How big is n? Is it fixed or variable?

Quoting - Raf Schietekat
Is parallelism required, i.e., would the program fail if run on a single thread? How big is n? Is it fixed or variable?

Hi Raf,

Yes, Parallelism is required. It wouldn't fail if it runs on single thread. I'm doing some experimentations.

n can vary from 10 to 100 or more. Where as the for loop iteration 1:100 is just for illustration. It usually would be 1:100000. Do you have any insights on what numbers (n and for loop iternation) would actually faster parallely than serial execution?

Best Reply

Other suggestions:
- similarly to the task list approach suggested byRobert, you may use task_group class which has methods to start a task and to wait for all running tasks belonging to the task_group.
- if signatures of all functions are similar (i.e. contain the same number and types of arguments), you might put pointers to functions to an array, and run parallel_for over this array, calling your functions (via pointers) in parallel_for body.

The second suggestion, if possible to implement, could be somewhat more efficient, because the tasks to run will be produced by multiple threads, and stealing will be limited. In the task_group based solution, all tasks will be produced by the same thread and so all other threads will have to constantly steal a single task to execute.

But I also have a concern that parallelizing on the inner level likely has limited potential for scalability & speedup. May be reworking an algorithm to reduce or eliminate the need for barrier synchronization between outer loops would create additional parallelism. I do not know of course whether it is possible in your case.

"Yes, Parallelism is required. It wouldn't fail if it runs on single thread. I'm doing some experimentations."
The phrase "required parallelism" (or rather "required concurrency", I suppose) means specifically that the program would fail if processing would not at least preemptively switch between the different jobs (e.g., for a producer/consumer arrangement), so you actually don't require concurrency if it runs on a single thread. That's what TBB likes best (otherwise you might very easily run into trouble before you know it).

"n can vary from 10 to 100 or more. Where as the for loop iteration 1:100 is just for illustration. It usually would be 1:100000. Do you have any insights on what numbers (n and for loop iternation) would actually faster parallely than serial execution"
So it can vary at run time, and it's larger than the number of hardware/worker threads, right? You might very well go with #3, but Robert has already indicated the importance of recursive parallelism (here "parallel invocation hierarchy"), so I would instead consider making the functions accessible as range-addressable function objects (perhaps just with a numerical index, i.e., in an array), and using tbb::parallel_for instead of a tbb::task_list. To determine whether parallel is faster than serial, you have to consider both task size (relative to scheduling overhead) and available parallelism (are there enough tasks to distribute around and amortise the wait at the end?), and then I would suggest to just try different arrangements (there may also be other factors that affect the outcome), depending on how crucial optimisation is.

(Added) Hmm, this is what happens if you cannot obtain a lock before answering...

Quoting - Raf Schietekat
(Added) Hmm, this is what happens if you cannot obtain a lock before answering...

Great minds think alike :)

Thanks Alex and Raf! Good day!

I too agree that if possible use a functor list and a parallel_for or parallel_for_eachfor task distribution.

If you were to use a task list or spawn individual tasks you would have one task spawn worth of overhead per function (n spawns).

However, using parallel_for or parallel_for_each you would incure number of (participating) threads task spawn worth of overhead.

Reduction of task scheduling overhead is important for writing efficient code.

Now your code might look something like the following
Note the following is written in QuickThread as opposed to TBB however nearly the same functionality is available in TBB. You should be able to adapt this relatively easy to TBB. (info on QuickThread available on www.quickthreadprogramming.com)

#include "QuickThread.h"
#include "parallel_for.h"

struct YourContext
{
  // define your context
  int dummyToAvoidNullStruct;
};

void YourFn1(YourContext* context) { return; }
void YourFn2(YourContext* context) { return; }
// ...


void YourFn1000(YourContext* context) { return; }

typedef void(*YourFunctorTakingYourContext)(YourContext*);

YourFunctorTakingYourContext YourFunctorList[] =
{
   YourFn1,
   YourFn2,
   //...
   YourFn1000
};

const int YourFunctorListCount = sizeof(YourFunctorList) / sizeof(YourFunctorList[0]);

void SomewhereInYourCode()
{
  for(int i=0; i < 100; ++i)
  {
     YourContext context;
     // initialize context for this iteration
     //...
     parallel_for_each(
	0,YourFunctorListCount,
	[&](int ifn) { (*YourFunctorList[ifn])(&context); });
  }
}

Jim Dempsey

www.quickthreadprogramming.com

Quoting - knmaheshy2k
Thanks Alex and Raf! Good day!

Well, poo! If I had known n could get as high as 100,000, even I would have not suggested the linear process of building task list. ;-) A task list that large would quickly become an in-memory-only list (not cached anywhere), an anchor line dredging up tasks from the depths. Alexey's suggestion of a parallel_for seems the best in that it uses the available HW threads in concert to divide and dispatch the functions in parallel and in scalable blocks that can be worked on independently. I would not suggest parallel_for_each as someone recommended, since it has the same semantics as parallel_do and the same linear access as the task_list I originally suggested.

So whatcha building? Sounds like a possible implementation of a time step in some sort of simulator.

Quoting - Robert Reed (Intel)
So whatcha building? Sounds like a possible implementation of a time step in some sort of simulator.

Bulls eye!! You got it right Reed!! Cant disclose here anything much, but can let you know offline. mnanjund, you know what it is.. :)

Quoting - knmaheshy2k

Bulls eye!! You got it right Robert!! Cant disclose here anything much, but can let you know offline. mnanjund, you know what it is.. :)

I don't need to know more details but I'd suggest that with dispatch in hand, the next problem you might encounter is the potential for side effects from the function calls. Presumably each represents a functional block whose inputs have been set at the start of the parallel section and each will have outputs to set before the end of the section. How those steps land and how they might affect cache residency among the threads would be my next area of concern. You can buffer the hell out of everythingbut still those buffers need to reside somewhere in the address space and may be the subject of contention themselves as the outputs from various functions are recombined to form the inputs for the next step.

Robert,

My comments suggested using parallel_for or parallel_for_each, the choice left up to the programmer. I illustrated parallel_for_each. The choice of template depends on the variances in estimated compute time for each function. When the function execution time is relatively the same then parallel_for can be used. When execution time varies greatly between functions then parallel_for_each might be in order. parallel_for_each in QuickThread may have lower overhead than parallel_for_each in TBB. The scheme in QT for parallel_for_each is to divide the iteration space up by the current number of available threads (current thread may be included or excluded from set), schedule an additional parallel_for_each for n-1 partitions and directly call for one of the partitions. The task for the n-1 partitions starts (now having one less available thread) the makes same division choice. At some point all threads (may optionally be subset of all threads on system) are active. At this point the sub ranges functions objectsare directly called (no furthertask enqueue/dequeue). As the sub ranges are processed by a shell function the scheduler is examined for idle threads (peek at one word). When idle thread is noticed by any of the remaining running shell functions then it will split its remaining range as done at the beginning. While parallel_for_each has more overhead than parallel_for it has the benefit of dynamic load balancing which helps whenthe functions in your functor list have indeterminant run times and when the system is running other applications than yours.

The parallel_for syntax would be as follows

#include "QuickThread.h"
#include "parallel_for.h"

struct YourContext
{
  // define your context
  int dummyToAvoidNullStruct;
};

void YourFn1(YourContext* context) { return; }
void YourFn2(YourContext* context) { return; }
void YourFn1000(YourContext* context) { return; }

typedef void(*YourFunctorTakingYourContext)(YourContext*);

YourFunctorTakingYourContext YourFunctorList[] =
{
   YourFn1,
   YourFn2,
   //...
   YourFn1000
};

const int YourFunctorListCount = sizeof(YourFunctorList) / sizeof(YourFunctorList[0]);

void SomewhereInYourCode()
{
  for(int i=0; i<100; ++i)
  {
    YourContext context;
    // initialize context for this iteration
    //...
    parallel_for(
      0,YourFunctorListCount,
      [&](int ifnBegin, int ifnEnd) { 
        for(int ifn = ifnBegin; ifn < ifnEnd; ++ifn)
          (*YourFunctorList[ifn])(&context); });
  }
}

As an alternative you can also use a chunking factor

int iChunk = 2; // your number of functions per slice of iteration space

parallel_for(
iChunk,
0,YourFunctorListCount,
[&](int ifnBegin, int ifnEnd) {
for(int ifn = ifnBegin; ifn < ifnEnd; ++ifn)
(*YourFunctorList[ifn])(&context); });

Jim Dempsey

www.quickthreadprogramming.com

Thanks Jim. I will work keeping your comments in mind.

Quoting - jimdempseyatthecove
My comments suggested using parallel_for or parallel_for_each, the choice left up to the programmer. I illustrated parallel_for_each. The choice of template depends on the variances in estimated compute time for each function. When the function execution time is relatively the same then parallel_for can be used. When execution time varies greatly between functions then parallel_for_each might be in order. parallel_for_each in QuickThread may have lower overhead than parallel_for_each in TBB. The scheme in QT for parallel_for_each is to divide the iteration space up by the current number of available threads (current thread may be included or excluded from set), schedule an additional parallel_for_each for n-1 partitions and directly call for one of the partitions. The task for the n-1 partitions starts (now having one less available thread) the makes same division choice. At some point all threads (may optionally be subset of all threads on system) are active. At this point the sub ranges functions objectsare directly called (no furthertask enqueue/dequeue). As the sub ranges are processed by a shell function the scheduler is examined for idle threads (peek at one word). When idle thread is noticed by any of the remaining running shell functions then it will split its remaining range as done at the beginning. While parallel_for_each has more overhead than parallel_for it has the benefit of dynamic load balancing which helps whenthe functions in your functor list have indeterminant run times and when the system is running other applications than yours.

Sorry, Jim. I was going by the syntax of the TBB parallel_for_each, which matches that of the Microsoftversion. This is after all a forum on Threading Building Blocks, not QT. For each of these interfaces, access to the list is defined by STL-styleInput Iterators, whichsupport sequential,notrandom access lists and thus my previous comments. That thing in QT you call parallel_for_each in the example above is some different beastie altogether.

I also don't follow the logic suggested above regarding the choice of parallel_for versus parallel_for_each. TBB parallel_for does support load balancing, so if a couple of those functions take longer than the rest, the HW threads that call them get busy on those particular tasks and leave the untouched tasks for other HW threads to steal.If the tasks are all very small, I'd probably experiment with different grain sizes in the simple_partitioner to see where the balance point lives.

No problem Robert.

Correct me if I am wrong... The load balancing of parallel_for partitions the iteration space up into a set of tasks with each task receiving a slice of the iteration space. The more slices the more task enqueue/dequeue overhead. And the number of slices is dependent on the thread pool size and/or with the specification of a grain size. Due to the partitioning of the iteration space occuring at the beginning (within template) you can identify a grain size that may keep the majority of threads busy but also potentially at the expense of incurring more tasking overhead. In the original post it was illustrated 1000 functions in the loop. While 1000 tasks would keep all threads busy, it would also incure the greatest amount of overhead for task queuing.

One technique that could be employed in TBB (as well as QuickThread) would be to time each function (the developer can do this during testing). Then using the timing information, create nfunctorlists (n=size of thread pool). Then launch n tasks, one for each functor list. With some finess, the timing can be done at run time with functors moved from one list to another.

The end-game goal being keeping the task scheduler out of the way of performing useful work.

BTW QuickThread has parallel_list for processing linked list iteration spaces (similar to but different from TBB's parallel_while)

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
Correct me if I am wrong... The load balancing of parallel_for partitions the iteration space up into a set of tasks with each task receiving a slice of the iteration space. The more slices the more task enqueue/dequeue overhead. And the number of slices is dependent on the thread pool size and/or with the specification of a grain size. Due to the partitioning of the iteration space occuring at the beginning (within template) you can identify a grain size that may keep the majority of threads busy but also potentially at the expense of incurring more tasking overhead. In the original post it was illustrated 1000 functions in the loop. While 1000 tasks would keep all threads busy, it would also incure the greatest amount of overhead for task queuing.

Well, load balancing is not an attribute specific to the parallel_for but rather, isan attribute of the underlying scheduler. And parallel_for itself does not do the partitioning (there are several partitioners with different traits available) or the mapping of those partitions to the iteration spaces (handled by the blocked_range classes). The number of slices of the iteration space is dependent both on the blocked_range selected and the initial size of the space, though the auto_partitioner, which is the default in TBB 2.2, does also use the thread pool size as a factor in determining when to stop slicing. The slicing itself is much as you described previously for the QT parallel_for_each, a divide and conquer that splits the former iteration space in half. However, the partitioner only splits tasks on demand, not"within the template" as suggested above. Grain size is relevant only for the simple partitioner, where it results in less scheduler overhead, not more (leaves lists of tasks that can be executed in sequence with just a brief pass through the scheduler). The original post used 100 functions (not 1000) but eventuallyspeculated the possibility of 100,000. While there is more scheduling overhead for 1000 (or 100,000) tasks, that overhead is distributed both among the thread pool and in time (you might call it "lazy partitioning"). It can get in the way sometimes-I know ofone case where an OpenMP parallel for with thebasic scheduler operating over an array with very little work to do per element beats the equivalent TBB parallel for by several percent. However, in the context of calling 100 to 100,000 functions with presumably varying amounts of work, it should be a different story.

Quoting - jimdempseyatthecove
One technique that could be employed in TBB (as well as QuickThread) would be to time each function (the developer can do this during testing). Then using the timing information, create n functor lists (n=size of thread pool). Then launch n tasks, one for each functor list. With some finess, the timing can be done at run time with functors moved from one list to another.


I'm not sure what's intended here. Presumably dividing into n "functor" lists is intended to assign one list per thread. Not described is how to form these lists, though my guess based on contextis to balance the loads across the threads. Assuming that the functions in question individually execute with uniform durations (not guaranteed), it seems the goal would be to distribute the functions so as to have a combination of functions in each list that take roughly the same amount of time. That itself seems to be a combinatorial problem, especially if executed dynamically, that would be its own source of overhead.

Robert,

OpenMP has a chunk argument for the parallel for schedule statement. When static scheduling is selected the slicing is fixed, you can elect to use a smaller chunk size than which evenly divides the iteration space in to number of thread number of pieces. When dynamic scheduling is selected the chunk size is reduced each time a thread completes a slice (sub range of iteration space). The art is in picking the correct scheduling method and (initial) chunk size such that the number of slices scheduled is small yet manages to keep all thread busy and/or finishes the complete iteration space in the shortest time. TBB partitioning (iteration space slicing) occurs up front.When a slice task starts its iterator has a fixed slice of the original iteration space. The parallel_for template (combined with internal functionality of your scheduler) will make an up-front decision as to the number of slices (i.e. tasks) to make. The art (for TBB) is in picking the number (size) of slices such that all slices finish in the shortest amount of time. And the programmer may use hints (schedule w/wo chunk in OpenMP, choice of iterator w/wo grain size in TBB). The problem though, which is addressed to some extent with the OpenMP dynamic and chunk is that all cores/HW threads of the system are not always available and dedicated to the application. The iteration space slice-up is purely speculative.

An alternative is to not use a general purpose technique but instead write a specific purposed method (with potential of becoming generalized). A list of functor lists might be one such example for handling a large number of arbitrary functions that can be run in parallel in each iteration of a process loop.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
OpenMP has a chunk argument for the parallel for schedule statement. When static scheduling is selected the slicing is fixed, you can elect to use a smaller chunk size than which evenly divides the iteration space in to number of thread number of pieces. When dynamic scheduling is selected the chunk size is reduced each time a thread completes a slice (sub range of iteration space). The art is in picking the correct scheduling method and (initial) chunk size such that the number of slices scheduled is small yet manages to keep all thread busy and/or finishes the complete iteration space in the shortest time. TBB partitioning (iteration space slicing) occurs up front.When a slice task starts its iterator has a fixed slice of the original iteration space. The parallel_for template (combined with internal functionality of your scheduler) will make an up-front decision as to the number of slices (i.e. tasks) to make. The art (for TBB) is in picking the number (size) of slices such that all slices finish in the shortest amount of time. And the programmer may use hints (schedule w/wo chunk in OpenMP, choice of iterator w/wo grain size in TBB). The problem though, which is addressed to some extent with the OpenMP dynamic and chunk is that all cores/HW threads of the system are not always available and dedicated to the application. The iteration space slice-up is purely speculative.

Jim, as much truth as there is in your statement above (certainly regarding OpenMP),I think the more that we talk, the further we get from knmaheshy2k's original problem.

TBB blocked range partitioning occurs up front much like OpenMP partitioning occurs up front. An OpenMP schedule static prepartitions the iteration space and then has rules about doling out chunks to the thread team in round-robin fashion, in order of thread number, guaranteeing the same assignments occur in multiple, compliant loops, etc. That is, part of the scheduling happens up front (the chunking) and part of it happens dynamically (the doling). Only if no chunk size is specified can schedule staticget everything done up front. Likewise, when a thread gets a blocked range as its task, it will do binary subdivision down one leg of the range, leaving the rest of the iteration space unpartitioned but scheduled as tasks. As you say, there is a certain art to picking the best policies for partitioning particular problems to maximize performance, whether you're talking about OpenMP or TBB.

Add to your comments for others to see...

Too often this partitioning occures behind the curtain so to speak. The programmer, when tuning the application, may find it appropriate to insert performance diagnostic code to expose the partitioning and collect runtime sample data. This is not typicaly done with profilers as such but can be insturmented with conditional compiled code. Without the insturmentation you are "driving blind" when trying to diagnose partitioning problems. Note, what may run fine on your 4-core development system may end up running not so fine on your 8-core production system (or 6 core, 16 core, ...).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
Too often this partitioning occurs behind the curtain so to speak. The programmer, when tuning the application, may find it appropriate to insert performance diagnostic code to expose the partitioning and collect runtime sample data. This is not typically done with profilers as such but can be instrumented with conditional compiled code. Without the instrumentation you are "driving blind" when trying to diagnose partitioning problems. Note, what may run fine on your 4-core development system may end up running not so fine on your 8-core production system (or 6 core, 16 core, ...).

I agree. Though much progress has been made beyond the point of "debug via printf," in certain areas there are still dark corners that need to be plumbed, particularly as you add concurrency to the mix. I've used my blog in the past to do describe exactly such efforts and hope to continue that in the future. The caution that goes with such considerations, whether it's to includehand-built instrumentation or rely on thatinserted by another tool, is to use a light touch so as to minimize the influence on the experiment. Any time instrumentation is added, the resulting program is not the original and may not behave completely like the original. It's a lesson I considered but not seriously enough in a previous blog series on the very subject of the TBB scheduler. I made some assumptions for convenience in data collection that might have had a substantial effect on the results. Next time I'll take more caution. Our development teams at Intel continue to investigate ways to instrument and report onconcurrent code to minimize the need to instrument by hand, but I don't believe that the need will ever go away completely.

>>Any time instrumentation is added, the resulting program is not the original and may not behave completely like the original.

This is quite true. In a recent stress test (two weeks ago) where the "computation code" was replaced with a no-op, with the intent on studing scheduling overheads, I've noticed as much as 25% variation in overhead dependent on the addition or removal of one or two lines of code and/or addition/removal of static data. The variation was due not for functional execution (code was the same) but rather alignment of code and/or data. Too often I find it a mistake to zoom in too close to make observatons. Even polished glass has a rough surface when viewed with high enough magnification.

Insturmentation will work well for some situations, profiling for other situations, and in some circumstances only benchmarking with neitherinsturmentation nor profiling. For final phase of tuning use profiling and/or insturmentation to provide you insight for potential areas of opportunities. Then consider experimentation with layout tweaks while running release code alone.

Jim Dempsey

www.quickthreadprogramming.com

Quoting -

- if signatures of all functions are similar (i.e. contain the same number and types of arguments), you might put pointers to functions to an array, and run parallel_for over this array, calling your functions (via pointers) in parallel_for body.


I tried all other ways suggested in the thread and met success at all except 1. But happen to do some mistake due to lack of understanding with respect parallel_for with functors. I'm posting brief code here. Can someone let me know how exactly the main parallel_for loop should be written.

// Pointer to functions
typedef void (*pointer2Function)();

// Functions that have to be executed in parallel
void function0(){....}
void function1(){....}
void function2(){....}
void function3(){....}
void function4(){....}
void function5(){....}

int main()
{

pt2Function function[7] = {NULL};

function[0] = &function0;
function[1] = &function1;
function[2] = &function2;
function[3] = &function3;
function[4] = &function4;
function[5] = &function5;
function[6] = &function6;

/*
I want do do this basically.
for(i=0;i{
// Execute these functions in parallel
function0();
function1();
function2();
function3();
function4();
function5();
function6();
//Wait till all functions are executed completely and then proceed
wait(all_functions_complete);
}
*/

// Parallel_for implementation for the above said
for(i=0;i{
parallel_for(blocked_range(0,6),
{for(int ifn = 0; ifn < 7; ifn++)
*function[ifn];}
);

}

}

I'm getting errors. I'm doing some silly mistake which I'm not able to catch. Can someone point it plz?

Quoting - knmaheshy2k

I tried all other ways suggested in the thread and met success at all except 1. But happen to do some mistake due to lack of understanding with respect parallel_for with functors. I'm posting brief code here. Can someone let me know how exactly the main parallel_for loop should be written.

// Pointer to functions
typedef void (*pointer2Function)();

// Functions that have to be executed in parallel
void function0(){....}
void function1(){....}

[snip]

// Parallel_for implementation for the above said
for(i=0;i {
parallel_for(blocked_range(0,6),
{for(int ifn = 0; ifn < 7; ifn++)
*function[ifn];}
);

}
}

I'm getting errors. I'm doing some silly mistake which I'm not able to catch. Can someone point it plz?

The TBB parallel_for takes as its second argument a function object. It looks like the code here attempts to create such a function object using a lambda construct, which will require the Intel V11 compiler enabled with the std=c++0x switch in order compile, and will need correct syntax. What the code above misses in defining the lambda construct is the bit that describes how to map the local context into the closure. Off the top of my head (meaning I may not remember this exactly correctly)that parallel_for should probably look something like this.

parallel_for(blocked_range(0,6), [&](const blocked_range &r) {
for (intj = r.begin();j != r.end(); ++j) (*function[j])(); });

The missing bits that I added are [&] (declares that any local identifier is linked by reference to current context (how you get the linkage for the function table)), and a parameter for the function object which is how the blocked_range is passed to the function and partitioned for multiple threads to execute the function object. r is used in the body of the lambda rather than the constants specified in your example and a local index j then is used to select the appropriate function from the table. This may not be exactly right, but it should be pretty close.

Before I noticed #25 I prepared this (for existing C++ standard):

struct Body {
void operator() (blocked_range& range) const {
for(int j=range.begin(), j_end=range.end(); j (*function[j])();
}
}
};

parallel_for(blocked_range(0,6), Body());

@Robert and Raf: Thanks a bunch.

I got the error. It works perfectly fine.

I've an observation made from my experiements. I executed the same experiments but with different implementation. Can someone explain me the reason for this?

Task Group || Parallel for
On single core Slower Faster
On multi core Faster Slower

How large is your function list? If it is small (2:10), you might experiment with parallel_invoke.
If larger, you might be able to make a nested parallel_invoke construction. Although with a nested construction you would not have complete asynchronous operation of all tasks (i.e. you have implicit barrier aferter each parallel_invoke). With a small number of functions, the parallel_invoke might have lower overhead.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

How large is your function list? If it is small (2:10), you might experiment with parallel_invoke.
If larger, you might be able to make a nested parallel_invoke construction. Although with a nested construction you would not have complete asynchronous operation of all tasks (i.e. you have implicit barrier aferter each parallel_invoke). With a small number of functions, the parallel_invoke might have lower overhead.

Jim Dempsey

Jim,

Function list is definitely more than 10. It may be around 30-50 or sometimes even 100. What I'm not understanding is, against all the predictions, on single core parallel_for is faster than task_group, but the phenomenon reverses on multicore. Can you or anyone explain it?

#30 "you have implicit barrier [after] each parallel_invoke" (my [correction])
What do you mean exactly by "barrier", and in what way is it more costly? The worker thread can still go out stealing other tasks, can't it?

#31 "on single core parallel_for is faster than [task_list]" (my [correction])
Most peculiar. What partitioners have you tried, how many cores, how much faster?

Quoting - Raf Schietekat
#31 "on single core parallel_for is faster than [task_list]" (my [correction])
Most peculiar. What partitioners have you tried, how many cores, how much faster?

I didnt specify the partitioner, so I guess it took the default grainsize as 1, which gave room for max parallelism but also had lot of over head. But when i changed from
parallel_for(blocked_range(0,RANGE), Body());
to
parallel_for(blocked_range(0,RANGE,4), Body());

Parallel for was faster than task group. I have a 4 core machine with Hyper Threading ON.

Is my understanding correct?

Also I had one more question. Are both the statements below same?

parallel_for(blocked_range(0,RANGE), Body());}

parallel_for(blocked_range(0,RANGE,1), Body());}

Quoting - knmaheshy2k

Also I had one more question. Are both the statements below same?

parallel_for(blocked_range(0,RANGE), Body());}

parallel_for(blocked_range(0,RANGE,1), Body());}

It is always safe to specify the grainsize for parallel_for, especially given the fact that default partitioners could change between versions. This will automatically takecare of the number of hardware threads and hence the code should be optimally scalable for any number of functions that you may want to call concurrently.

#33 "But when i [increased grain size to 4] Parallel for was faster than [task_list]. I have a 4 core machine with Hyper Threading ON." (my [edit]s)
That's weird. Are you running a pre-2.2 version, where simple_partitioner was the default? And you do really mean task_list, right? (Note to Intel: the open-source reference 1.16 of 2009-07-03 is self-contradictory about which partitioner is the default, even on the same page 23, because before in table 9 it indicates auto_partitioner as the default, luckily with a footnote to indicate the change, the text first implies that simple_partitioner is the default, so this should be adapted as well.)

"Are both the statements below same?"
Grain size is an optional parameter with 1 as the default, so yes.

Quoting - Raf Schietekat
#33 "But when i [increased grain size to 4] Parallel for was faster than [task_list]. I have a 4 core machine with Hyper Threading ON." (my [edit]s)
That's weird. Are you running a pre-2.2 version, where simple_partitioner was the default? And you do really mean task_list, right?

No. I actually meant task group. And I'm running TBB-2.2 August/Sept 2009 release.

Quoting - knmaheshy2k

No. I actually meant task group. And I'm running TBB-2.2 August/Sept 2009 release.

Ah, OK. All that newfangled stuff... :-) This seems like a (potentially slower) task_list, because it adds children while the task_group is assumed to be already running (allocate_additional_child_of()).

Quoting - Raf Schietekat

Ah, OK. All that newfangled stuff... :-) This seems like a (potentially slower) task_list, because it adds children while the task_group is assumed to be already running (allocate_additional_child_of()).

Well, it depends on usage, of course. If you add all tasks froma single location, then it will certainly be slower if tasks are being stolen at the same time. But if tasks are being added by recursive parallelism, even at the same depth, maybe it is such an advantage that tasks are allocated and spawned locally that this becomes more important than the penalty of shared-reference-count manipulation, relative to task_list all from a single location. But that still doesn't explain why this would be faster than parallel_for, of course.

Leave a Comment

Please sign in to add a comment. Not a member? Join today