TBB Parallel_for Task Division Curiosity

I closed my "Grain Size Experiments" post with some thoughts about "a little mystery" -- the fact that when I set my parallel_for grain size to any value above 50% but below 100% of my total range, the grain size that is actually used is one that evenly divides the work between the two processors (working on my dual-core Gentoo system).

Specifically, the sub_string_finder example in the TBB "Getting Started" manual runs a parallel_for over a range of 17711 items. If I specify a grain size anywhere greater than 8856 and less than 17711, TBB automatically changes the grain size, or applies an effective grain size, of (about) 8856. Doing this neatly divides the full task into two equally sized subtasks. This is an example of why the documentation says:

The parallel_for subdivides the range into sub-ranges that have approximately<grainsize> elements.

Some parallel_for task division experiments

I was curious to find out how the parallel_for breaks up the full range and what order in chooses to process the subranges. So, I modified the SubStringFinder class in my sub_string_finder_extended.cpp file, having it print the range it works on each time it is called. I also added timing (primarily for future experiments). Here's my new SubStringFinder class, with my added lines in bold:

class SubStringFinder {
const string str;
size_t *max_array;
size_t *pos_array;
void operator() ( const blocked_range<size_t>& r ) const {
tick_count t0 = tick_count::now();
for ( size_t i = r.begin(); i != r.end(); ++i ) {
size_t max_size = 0, max_pos = 0;
for (size_t j = 0; j < str.size(); ++j)
if (j != i) {
size_t limit = str.size()-( i > j ? i : j );
for (size_t k = 0; k < limit; ++k) {
if (str[i + k] != str[j + k]) break;
if (k > max_size) {
max_size = k;
max_pos = j;
max_array[i] = max_size;
pos_array[i] = max_pos;
tick_count t1 = tick_count::now();
cout << r.begin() << " - " << r.end() << " dT "
<< (t1 - t0).seconds() << endl;

SubStringFinder(string &s, size_t *m, size_t *p) :
str(s), max_array(m), pos_array(p) { }

So, at the end of each execution of the SubStringFinder class, I'll see the element range that was processed and the number of seconds it took to process the range.

Specifying a grainsize value of 2000, here are the results (running on my dual-core system):

Done building string.
to_scan.size() = 17711
8855 - 9962 dT 0.745212
0 - 1106 dT 0.766304
9962 - 11069 dT 0.737665
1106 - 2213 dT 0.744243
11069 - 12176 dT 0.762377
2213 - 3320 dT 0.744962
3320 - 4427 dT 0.722956
12176 - 13283 dT 0.735174
4427 - 5534 dT 0.738268
13283 - 14390 dT 0.733346
14390 - 15497 dT 0.713311
5534 - 6641 dT 0.716494
15497 - 16604 dT 0.681498
6641 - 7748 dT 0.764419
16604 - 17711 dT 0.582137
7748 - 8855 dT 0.753216
Done with parallel version for Grainsize 2000 seconds: 5.95126

Judging from this, it appears that the full 17711 range is first divided in half. Then, each half-range is divided into subranges of size about 1107 elements. Then, the subranges in each half-range are processed one by one.

The implication is that TBB has recognized that my Gentoo system has two processing cores, so it divides the entire task into two approximately equal sub-tasks (elements 0-8854 and elements 8855-17710). From there, it looks at my specified grain size (2000), and selects an effective grain size that divides each half-range into an equally-sized set of subtasks. In this case, the selection of a sub-task size of 1106 or 1107 elements creates 8 calls to the SubStringFinder class for each half-range (i.e., for each processor); 16 calls to SubStringFinder in all.

My guess, then, is that the following subtasks are performed, in the listed order, by each of my processing cores (with subtasks in the same row being performed approximately simultaneously):

Core 1Core 2
0 - 11058855 - 9961
1106 - 22129962 -11068
2213 - 331911069 - 12175
3320 - 442612176 - 13282
4427 - 553313283 - 14389
5534 - 664014390 - 15496
6641 - 774715497 - 16603
7748 - 885416604 - 17710

Further questions

It's interesting that when I specify a grain size of 2000, the parallel_for chooses to apply an actual grain size of 1107 or so. A grain size of about 2214 would have also worked (dividing each half-task into 4 sub-tasks, rather than 8). 2214 is certainly closer to my requested grain size of 2000 than 1107 is. Yet, TBB chose to apply the smaller grain size.

Another question, or curiosity, is: does the fact that the half-tasks were divided into 8 subtasks imply that TBB prefers, or requires, a number of subtasks that is a power of 2?

I looked into the TBB source code, and found a do_split() function in file blocked_range.h. A comment describes do_split() as being an "Auxilary function used by forking constructor."do_split() returns the middle index of a blocked range. This is the only code I found, in a quick grep-based perusal of the source, that seems to actively subdivide element ranges.

Thinking this code, and also about the TBB "task-stealing" I've read and heard about, it seems like TBB might indeed "prefer" powers of two when it splits up tasks. The result for tasks that are uniform, that is, where the computations take very nearly the same amount of time for each element in the array, will be a number of subtasks that is a power of 2.

But what will happen if different elements occupy different amounts of processing time? I've worked on lots of problems that have this characteristic. I'll experiment with this condition, and report my findings in an upcoming post!

Kevin Farnham
O'Reilly Media

For more complete information about compiler optimizations, see our Optimization Notice.