TBB Parallel_for Grain Size Experiments

While working with some of the Threading Building Blocks example code, I decided to experiment with the grain size parameter in the parallel_for construct. I selected the sub_string_finder_extended.cpp program in the examples/GettingStarted subdirectory for my initial experimentation, because this program is fairly simple and compact: the parallelized SubStringFinder() class consists of 26 lines of basic string search and comparison operations; the entire program (including a serial version of the substring finder) consists of about 140 lines.

parallel_for and grain size

Here's the parallel_for code in the original sub_string_finder_extended.cppmain() program:

 parallel_for(blocked_range(0, to_scan.size(), 100),
SubStringFinder( to_scan, max, pos ) );

This code calls the parallel_for template, defining the range for the problem as being from 0 to to_scan,size(), and the grain size as 100. The grain size defines the approximate number of elements that will exist in each sub-range. In other words, if the grain size is set to 100, then the composite work task will be divided up into subtasks where each subtask will perform the required work operation on about 100 array elements.

Specifying grainsize

One might wonder: why was 100 selected as the default grain size for this particular problem? And, how does one know how to select a proper grain size?

The "Threading Building Blocks Tutorial" manual has a fairly detailed discussion of grain size (see section 3.2.1). Here's its definition of grainsize:

grainsize specifies the number of iterations for a "reasonable size" chunk to deal out to a processor

That's still not real specific. Further down the page, the manual offers:

A rule of thumb is that grainsize iterations of operator() should take at least 10,000-100,000 instructions to execute.

Now this starts making some sense. The rule of thumb is essentially telling us that we want our processor to be able to chug through at least 10,000 processing instructions within the context of a performing a single subtask.

So, the selection of grain size will be dependent on the complexity of the function being applied within the parallel_for loop, and the amount of looping that occurs within the called code. The optimal grain size, then, would seem to vary with different applications. Indeed, if you go to the TBB examples directory and execute:

grep parallel_for */*/*.cpp

you get a list of parallel_for template calls with grainsize settings of 1, 10, 100, and 1000, as well as values defined by variables set elsewhere in the program.

How big an effect?

This would appear to imply that the selection of the proper grain size is critical to the performance of any application that applies Threading Building Blocks. However, a bit further down in the Tutorial manual's discussion of grainsize, we see:

TIP: You do not have to set the grainsize too precisely.

This indeed is what my own experimentation with the sub_string_finder_extended program showed. First I added a line to show me the size of the blocked range:

cout << "to_scan.size() = " << to_scan.size() << endl;

It turns out that to_scan.size() is 17711 for the default string that is created and tested by the application. So, the default grainsize setting was breaking these 17711 calls to the parallelized SubStringFinder() function into blocks of about 100 iterations. Hence, we'd expect there to be about 178 total subtasks created by the parallel_for.

But what if the grainsize was set to 10, or to 1000, or to 1, or to 10000? I constructed a loop to test these and other possibilities:

 cout << "to_scan.size() = " << to_scan.size() << endl;

iGrainSize = 1;
for (jGrainSize = 0; jGrainSize < 10; jGrainSize++) {
parallel_t0 = tick_count::now();
parallel_for(blocked_range<size_t>(0, to_scan.size(), iGrainSize),
SubStringFinder( to_scan, max, pos ) );
parallel_t1 = tick_count::now();
cout << " Done with parallel version for Grainsize " << iGrainSize <<
" seconds: " << (parallel_t1 - parallel_t0).seconds() << endl;

if (iGrainSize < 10000) {
iGrainSize *= 10;
} else {
iGrainSize += 2000;

Here are the results of running this on my AMD 64 dual-core Gentoo machine:

to_scan.size() = 17711
Done with parallel version for Grainsize 1 seconds: 5.6435
Done with parallel version for Grainsize 10 seconds: 5.64274
Done with parallel version for Grainsize 100 seconds: 5.65927
Done with parallel version for Grainsize 1000 seconds: 5.80206
Done with parallel version for Grainsize 10000 seconds: 5.80161
Done with parallel version for Grainsize 12000 seconds: 5.80175
Done with parallel version for Grainsize 14000 seconds: 5.80137
Done with parallel version for Grainsize 16000 seconds: 5.80179
Done with parallel version for Grainsize 18000 seconds: 11.2642
Done with parallel version for Grainsize 20000 seconds: 11.2644

Considering the results

What this shows us is that, indeed, the performance was not highly sensitive with respect to the selected grainsize. A grain size of 10 had the best performance, but the calculations took only 2.8% longer when grainsize was set to 16000.

The very noticeable discontinuity is at a grain size of 18000. Here, the parallel_for is in effect told to process the entire loop range as a single task (since the size of the range is smaller than the grain size). So, one of my processing cores sits idle while the other one chugs through the entire analysis.

A little mystery?

Looking at the output results another way: does it seems strange that performance did not notably decline starting with grainsize values greater than 8856 (half the size of the loop range)? The fact that this did not occur suggests that when the parallel_for template "sees" that only two subtasks will be created, it divides the total task into two evenly sized subtasks.

In other words, given 17711 loop elements, if the program specifies a grainsize between 8856 and 17710, the parallel_for applies an actual grain size that divides the full loop in half; that is, the loop is divided into a subtask that performs 8855 iterations and another that performs the remaining 8856 iterations. That's what the output data suggests, anyway.


I've been wondering about the TBB parallel_for grain size setting for quite a while, having read about it in many places and heard it talked about at the OSCON TBB tutorial. So, when I tired of inexplicable errors arising as I tried to link Perl and C++ using SWIG (the project I've intended to work on this week), I took a break from that and did some grainsize experimentation.

The hint in the TBB Tutorial manual that tells us we "do not have to set the grainsize too precisely" is indeed true. But it is important to know how many iterations your analysis involves. Knowing that, make sure to set your grain size low enough so that all of your processors/cores will be utilized.

For more complete information about compiler optimizations, see our Optimization Notice.


Kevin Farnham's picture

cdarke: I believe the grain size that you specify is used as a "suggestion" rather than as a fixed value. If you specify a grain size of number_of_iterations - 1, TBB will take that value and "modify" it to try to create 2 equally-sized tasks (on a dual-core machine). The process of attempting to create two equally-sized sets of work will result in a modification of the specified grain size: the applied grain size will be number_of_iterations / 2.

So, on a dual core machine, specification of any grain size between number_of_iterations / 2 through number_of_iterations - 1 will produce an "effective" grain size of number_of_iterations / 2, as TBB attempts to equalize the amount of work divvied out to each processor.

cdarke's picture

Thanks for an interesting post. In my own simple experiments I found that setting grain size to any number equal to or greater than the number of interations produced bad performance - no surprise since it is effectivley serialised. What amazed me was that reasonble parallel performance was noticable at number_of_iterations - 1 (dual-core).

Kevin Farnham's picture

Hi shoumeng yan. Are you asking why a grainsize that divides the task into two ranges, each half the size of the full range, doesn't produce the most efficient performance? If I was writing this code using pthreads, that's certainly what I would do, because, as you say, each task involves overhead. So, dividing the work into exactly two tasks on a two-core processor should produce the best performance. This should be the case if the total amount of work for each index within parallel_for range takes identically the same amount of processing time. But if some index values take different amounts of processing time, then evenly dividing the total range in half could produce a situation where one process is idle for a while, while the other one completes its larger amount of processing. I don't know for sure if this is what happens in my test case; but that's a possible reason why dividing a range exactly in half might not produce maximum processing speed for a given task on a dual-core system.

(name withheld)'s picture

I am wondering why the best grainsize is 10 instead of half iterations. I think more tasks will bring more scheduling overhead.
Right? could you explain?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.