parallel_for problem on Windows XP

parallel_for problem on Windows XP


I'm having a problem with TBB 2.2 on Windows XP SP3 on a dual core machine. The same program is running fine on Ubuntu. It happens with parallel_for, when there is more than 8 elements in my collection. I could simplify my code to the bare minimum:


class Parallel

	void operator() (const tbb::blocked_range& range) const
		std::cout << range.begin() << std::endl;

int main()
	int SIZE = 8;

	parallel_for(tbb::blocked_range(0, SIZE, 1), Parallel());

Edit: Problem is, I can't force the grain size to 1. After more tests it seems that the grain size that I'm trying to set to 1 changes when SIZE is greater than 8, is this normal behavior? Is it possible to force grain size?

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If I remember correctly, at some point the default partitioner changed from simple_partitioner to auto_partitioner. The simple_partitioner will always divide down to the grainsize specified in the range; the auto_partitioner may stop dividing earlier (which is a good thing) but will also not go further than the simple_partitioner would. Supposedly both don't subdivide at all if there's only one thread to work with. I suppose I'll be corrected if necessary...

What you're observing seems consistent with auto_partitioner, so I suppose that you have a TBB version that still defaults to simple_partitioner on Ubuntu. If you really want to reach the specified grainsize, you should probably explicitly use simple_partitioner in parallel_for(), but that would probably harm performance (in most cases), so you might as well be happy that TBB now does what you've observed.

(Correction) You may also have a different number of cores on your Ubuntu machine, so that you'll see the same thing happening from a larger SIZE, even with the same TBB version.

Quoting Raf Schietekat
Supposedly both don't subdivide at all if there's only one thread to work with. I suppose I'll be corrected if necessary...

No, parallel_for is agnostic of the number of threads, and each partitioner will just behave "normally" on a single thread. I.e. simple_partitioner will split chunks down to given grainsize, and auto_partitioner will exercise its strategy of combining (which takes the number of worker threads into account, but does not check for the single thread case).

Other than that, I agree with Raf.

Did I confusethis with parallel_scan() then? But isn't it an obvious simple optimisation everywhere?

In its most generic overload, parallel_for is also agnostic of the type of "iteration space" - it was designed to work with any splittable "ranges" (not necessarily dimensional). So range splitting can have some logic that should not be omitted. You will say that such specific cases would not be able to use auto_partitioner; correct, but there is the simple_partitioner as well.

The single thread optimization could be applied to the PPL-compatible overloads of parallel_for, but it would not buy much, becausethese overloads always use auto_partitioner which will do just couple splits in the absence of stealing.

We changed the default partitioner in TBB 2.2 from simple_partitioner to auto_partitioner because the latter is usually the right thing to use. We don't make such breaking changes lightly, butwhen giving talkson TBB, I kept apologizing that tbb::simple_partitioner was the default.

As others have commented, you can force partitioning by using tbb::simple_partitioner. Below is what the affected line looks like afte the change:

parallel_for( tbb::blocked_range(0, SIZE, 1), Parallel(), tbb::simple_partitioner() );

Leave a Comment

Please sign in to add a comment. Not a member? Join today