What are the differences between parallel_for and parallel_reduce in Intel TBB?

What are the differences between parallel_for and parallel_reduce in Intel TBB?

I think parallel_reduce is better when we have to pass data between different grains or threads. Can someone explain how parallel_reduce work? I have pasted example code.A loop can do reduction, as in this summation:float SerialSumFoo( float a[], size_t n ) {float sum = 0;for( size_t i=0; i!=n; ++i )sum += Foo(a[i]);return sum;}If the iterations are independent, you can parallelize this loop using the template classparallel_reduce as follows:float ParallelSumFoo( const float a[], size_t n ) {SumFoo sf(a);parallel_reduce( blocked_range(0,n), sf );2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile withTBB_DEPRECATED=1 to get the old default.A loop can do reduction, as in this summation:float SerialSumFoo( float a[], size_t n ) {float sum = 0;for( size_t i=0; i!=n; ++i )sum += Foo(a[i]);return sum;}If the iterations are independent, you can parallelize this loop using the template classparallel_reduce as follows:float ParallelSumFoo( const float a[], size_t n ) {SumFoo sf(a);parallel_reduce( blocked_range(0,n), sf );2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile withTBB_DEPRECATED=1 to get the old default.return sf.my_sum;}The class SumFoo specifies details of the reduction, such as how to accumulatesubsums and combine them. Here is the definition of class SumFoo:class SumFoo {float* my_a;public:float my_sum;void operator()( const blocked_range& r ) {float *a = my_a;float sum = my_sum;size_t end = r.end();for( size_t i=r.begin(); i!=end; ++i )sum += Foo(a[i]);my_sum = sum;}SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}void join( const SumFoo& y ) {my_sum+=y.my_sum;}SumFoo(float a[] ) :my_a(a), my_sum(0){}};Note the differences with class ApplyFoo from Section 3.2. First, operator() is notconst. This is because it must update SumFoo::my_sum. Second, SumFoo has asplitting constructor and a method join that must be present for parallel_reduce towork. The splitting constructor takes as arguments a reference to the original object,and a dummy argument of type split, which is defined by the library. The dummyargument distinguishes the splitting constructor from a copy constructor.TIP: In the example, the definition of operator() uses local temporary variables (a, sum,end) for scalar values accessed inside the loop. This technique can improveperformance by making it obvious to the compiler that the values can be held inregisters instead of memory. If the values are too large to fit in registers, or havetheir address taken in a way the compiler cannot track, the technique might not help.With a typical optimizing compiler, using local temporaries for only written variables(such as sum in the example) can suffice, because then the compiler can deduce thatthe loop does not write to any of the other locations, and hoist the other reads tooutside the loop.When a worker thread is available, as decided by the task scheduler,parallel_reduce invokes the splitting constructor to create a subtask for the worker.When the subtask completes, parallel_reduce uses method join to accumulate theresult of the subtask. The graph at the top of Figure 5 shows the split-join sequencethat happens when a worker is available:Figure 5: Graph of the Split-join SequenceAn arc in the Figure 5 indicates order in time. The splitting constructor might runconcurrently while object x is being used for the first half of the reduction. Therefore,all actions of the splitting constructor that creates y must be made thread safe withrespect to x. So if the splitting constructor needs to increment a reference countshared with other objects, it should use an atomic increment.If a worker is not available, the second half of the iteration is reduced using the samebody object that reduced the first half. That is the reduction of the second half startswhere reduction of the first half finished.CAUTION: Because split/join are not used if workers are unavailable, parallel_reduce does notnecessarily do recursive splitting.CAUTION: Because the same body might be used to accumulate multiple subranges, it is criticalthat operator() not discard earlier accumulations. The code below shows an incorrectdefinition of SumFoo::operator().class SumFoo {...public:float my_sum;void operator()( const blocked_range& r ) {...float sum = 0; // WRONG should be "sum = my_sum"....for( ... )sum += Foo(a[i]);my_sum = sum;}...};With the mistake, the body returns a partial sum for the last subrange instead of allsubranges to which parallel_reduce applies it.The rules for partitioners and grain sizes for parallel_reduce are the same as forparallel_for.parallel_reduce generalizes to any associative operation. In general, the splittingconstructor does two things: Copy read-only information necessary to run the loop body. Initialize the reduction variable(s) to the identity element of the operation(s).The join method should do the corresponding merge(s). You can do more than onereduction at the same time: you can gather the min and max with a singleparallel_reduce.NOTE: The reduction operation can be non-commutative. The example still works if floatingpointaddition is replaced by string concatenation.

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

"Can someone explain how parallel_reduce work?"
Please study the Tutorial and Reference Manual, and then ask any remaining specific questions.

Where is grain loop
whrere is within grain loop.

The body's operator(), as defined by you, processes a "chunk" (not "grain"), typically using a normal for loop based on the subrange's begin() and end() values.

Does that address your question, or what else did you intend?

void operator()( const blocked_range& r ) {float *a = my_a;float sum = my_sum;size_t end = r.end();for( size_t i=r.begin(); i!=end; ++i )sum += Foo(a[i]);my_sum = sum;}the above code is excuted for each thread , isn't it.When do we differentiate with actual loop iterations and grainsize?what does r.begin() and r.end() hold?Regards,Van Houten.

Each thread will typically process multiple chunks, as a strategy for workload distribution. See Tutorial "Automatic Chunking" and "Controlling Chunking" for background about how partitioner and grainsize both affect chunk size (for parallel_reduce as well as parallel_for). r.begin() and r.end() delineate the subrange describing the chunk.

Sir, Can you give an example of delineation?Regards,Van Houten.

At the risk of getting into a circular definition for a generic English word that doesn't even appear in the documentation: I just meant the begin/end values (for a one-dimensional range anyway), e.g., 3 and 7 delineate the range of integers i such that 3<=i<7 (inclusive/exclusive considered implied). Please have a look at the Reference Manual, which documents the "Range Concept" in the "Algorithms" section.

You could also get a copy of the "Intel Threading Building Blocks" book and read it at your leisure.

Sir, How do we differntiate between number of grains and number of loop iterations?what is n? Is n = grainsize or n = number of grains * grainsize.Regards,Van Houten.

It's "chunks", not "grains". I'm not too fond of using the word "iterations" for parallel execution, but it's the chunks that are eventually processed. n seems to be the end of the original range in this example, that happens to begin at 0. If the range begins at m, n-m would be the range length, so in this case it's equal to n. Chunks can have widely different lengths (especially with auto_partitioner), of at least grainsize/2.0, unless the original range length is already less than that, of course. Obviously no simple multiplicative relationship applies, even with simple_partitioner.

Sir, I think chunksize and grainsize is the same, only difference is grainsize is a value we might give and chunksize is a value which the Intel TBB partitioner chooses between grainsize/2 and grainsize.Regards,Van Houten

I think n is the total number of iterations and r.begin() and r.end() are the subranges for each chunk.Regards,Van Houten.

"I think chunksize and grainsize is the same, only difference is grainsize is a value we might give and chunksize is a value which the Intel TBB partitioner chooses between grainsize/2 and grainsize."
Chunk size may exceed grainsize for other partitioners than simple_partitioner.

"I think n is the total number of iterations and r.begin() and r.end() are the subranges for each chunk."
n is the end of the input range, and only equal to the number of iterations because in this example the input range happens to begin at 0. r is the subrange.

Sir,I checked the documentation of parallel_reduce, in that the iteration is divided into two halves,which are handled by two threads, then what next iteration space is divided into two halves, can some explain this procedure?regards,Van Houten.

I'm afraid I don't fully understand the question. The input range is dynamically divided into chunks based on partitioner, grainsize and an element of chance based on stealing. This also affects the body splits and merges. Adjacent chunks are typically executed on the same thread using the same body anyway because they were often created just to provide enough "parallel slack" for workload balancing. Just look at the documented guarantees and respect the requirements, and let TBB take care of the rest.

Sir, Is it like this, does Intel TBB divides each chunk into two iteartion spaces for two threads for hyper threadingone for main thread and one for worker thread, as each chunk is a work for one hardware thread or core.Regards,Van Houten.

Sir, So you are saying chunk 1 and 2 will be given to thread1 chunk 3 and 4 will be given to thread2etc ....Is chunk same as task?Regards,Van Houten.

Sir, Is dividing the iteration space into two halves means giving two adjoining chunks to the same thread.Regards,Van Houten.

A chunk is a final subrange processed in Body::operator(). TBB is currently agnostic about where threads run, so there's nothing special about hyperthreads. Chunks may be given (affinity_partitioner), taken (stealing) or processed locally. Tasks may be recycled to process more than one chunk, and I suspect that some algorithms do that (I would have to check the code). Adjacent ranges are likely to be executed on the same thread, but obviously not always or there would be no concurrency.

"Tasks may be recycled to process more than one chunk, and I suspect that
some algorithms do that (I would have to check the code)."
Oh yes, task reuse to process more than one chunk seems to tend to occur
through the scheduler bypass mechanism, which is even more efficient
than recycling, as decided by the partitioner. Tasks also get recycled,
often together with use of a continuation, but after reminding myself of
the main mechanism I didn't stick around to analyse those uses at this
time.

However, you don't need to know any of that to use the
packaged algorithms, and often you can write your programs without exposing yourself to tasks... Just look at the documented requirements and guarantees, and let TBB take care of the rest.

Sir, I think each chunk is divided into two equal subranges, but these subranges and not subdivided again as it defeats the purpose or it will cause repetition.Regards,Van Houten.

Please see the documentation and answers already given.

I agree with Raf's recommendation that you take another look at the documentation, and particularly the tutorial manual. The questions you pose seem to suggest a lack of basic understanding of the process. Recursive subdivision is exactly what takes place in both parallel_for and parallel_reduce, dividing the range up into enough pieces that all the available HW threads (be they hyper-threads or multiple cores) can get a piece of the action. The first thread in starts the division process and continues dividingdown one branch, using one of several partitioning strategies to decide when to stop, and leaving behind chunks of subrange of varying size as it further subdivides down one branch. Those are up for beingstolenby the other HW threads, who continue the subdivision process on their own branches. So check out the manual, get a good understanding of the basic parallism strategies, and that will hopefully provide some answers.

Leave a Comment

Please sign in to add a comment. Not a member? Join today