Threading Building Blocks Parallel_for Task Stealing Demo

To experiment with Threading Building Blocks task stealing, I further modified the sub_string_finder_extended.cpp program that I've used for my recent parallel_for testing. I had previously specified a grain size of 2000, which resulted in TBB applying an effective grain size of about 1107 and dividing the parallel_for loop into 16 subtasks. Running on my AMD dual-core system, each core processed 8 of the subtasks.

Inducing TBB task stealing


To make a program where task stealing would be required, I used the knowledge I had gained from my earlier testing about how TBB divides up the sub_string_finder parallel_for tasks. I inserted a simple loop around the main loop in the SubStringFinder() function. In cases where the index subrange is in the first half of the overall range, the new loop is executed 4 times; in the other cases, the new loop is executed 40 times. I also added a time output to the printout for each call to SubStringFinder().

Here's the modified code in the SubStringFinder() class:


class SubStringFinder {
const string str;
size_t *max_array;
size_t *pos_array;
public:
void operator() ( const blocked_range<size_t>& r ) const {
tick_count t0 = tick_count::now();
int n;
int n1;
if (r.begin() >= 8855) {
n1 = 40;
} else {
n1 = 4;
}
for (n=0; n<n1; n++) {

for ( size_t i = r.begin(); i != r.end(); ++i ) {
size_t max_size = 0, max_pos = 0;
for (size_t j = 0; j < str.size(); ++j)
if (j != i) {
size_t limit = str.size()-( i > j ? i : j );
for (size_t k = 0; k < limit; ++k) {
if (str[i + k] != str[j + k]) break;
if (k > max_size) {
max_size = k;
max_pos = j;
}
}
}
max_array[i] = max_size;
pos_array[i] = max_pos;
}
}
tick_count t1 = tick_count::now();

time_t tNow = time(NULL);
cout << r.begin() << " - " << r.end() <<
" tNow " << tNow <<
" dT " << (t1 - t0).seconds() << endl;

}
SubStringFinder(string &s, size_t *m, size_t *p) :
str(s), max_array(m), pos_array(p) { }
};


In effect, I'm making the 8 subtasks for the upper index range do 10 times as much work as the 8 subtasks for the lower index range. Hence, if TBB divides the total task into subtasks and assigns the subtasks in the same way as previously, then all 8 lower-index subtasks will be compete before the first upper-index subtask is complete. On my dual-core system, one processor will have no work to do, while the other processor has 7 of its 8 assigned tasks awaiting action.

Seems like an ideal situation for some task stealing, right?

The results


Here's the printed output when I ran the program on my dual-core system:


./sub_string_finder_extended
Done building string.
to_scan.size() = 17711
0 - 1106 tNow 1186917389 dT 3.03536
1106 - 2213 tNow 1186917392 dT 2.90797
2213 - 3320 tNow 1186917395 dT 2.89654
3320 - 4427 tNow 1186917398 dT 2.85414
4427 - 5534 tNow 1186917401 dT 2.91334
5534 - 6641 tNow 1186917404 dT 2.82746
6641 - 7748 tNow 1186917406 dT 2.97567
7748 - 8855 tNow 1186917409 dT 2.92582
8855 - 9962 tNow 1186917415 dT 29.3574
13283 - 14390 tNow 1186917438 dT 28.2574
9962 - 11069 tNow 1186917445 dT 29.2353
14390 - 15497 tNow 1186917465 dT 27.4037
11069 - 12176 tNow 1186917475 dT 29.9923
15497 - 16604 tNow 1186917491 dT 26.2405
12176 - 13283 tNow 1186917504 dT 28.9419
16604 - 17711 tNow 1186917514 dT 22.2974
Done with parallel version for Grainsize 2000 seconds: 127.536


It's a bit hard to see what this data implies. The short answer is: task stealing took place. If you add up all the dTs, the amount of time that each execution of SubStringFinder(), the result is 245 seconds. Meanwhile, the last line of output tells us that the total execution time for the parallel version was 128 seconds. Clearly, both processors were tasked most of the time.

Conclusion


The experiment created a parallel_for loop with a specified grain size, resulting in Threading Building Blocks dividing the loop into 16 subtasks. The subtasks were intentionally structured so that all 8 tasks assigned to one processor in a dual-core system would complete before the first task assigned to the second processor completed, leaving 7 unstarted tasks available but assigned to the second processor.

The simple printouts show that the idle processor took over, "stole," some of the tasks that were originally assigned to the other processor. In other words, TBB adapted its tasking in real time, adjusting subtask assignments so that the available processing power was fully utilized.

In my next post I'll look at the timing for each subtask more closely, and try to ascertain exactly what each processing core was working on at each moment during the experiment.

Kevin Farnham
O'Reilly Media

For more complete information about compiler optimizations, see our Optimization Notice.