parallel_for and parallel_for_each usage with STL vector

parallel_for and parallel_for_each usage with STL vector

Hi,

I am playing around with TBB's parallel_for and parallel_for_each in combination with an STL vector. I used the following test code (also in the attachement) with the corresponding questions below:

#include <iostream>
#include <algorithm>
#include <vector>
#include "tbb/parallel_for_each.h"
#include "tbb/tick_count.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"

int rnd() {
return (rand() % 100);
}

int main()
{
int addition = 5;
tbb::tick_count before = tbb::tick_count::now();
std::vector<int> numbers(100000000);
std::generate(numbers.begin(), numbers.end(), rnd);
tbb::tick_count after = tbb::tick_count::now();
std::printf("time spent (generate random): t%g secondsn", (after - before).seconds());

//std::for_each version
before = tbb::tick_count::now();
std::for_each(numbers.begin(), numbers.end(), [&] (int &v) { v += addition;});
after = tbb::tick_count::now();
std::printf("time spent (std::for_each): t%g secondsn", (after - before).seconds());

//tbb::parallel_for_each version
before = tbb::tick_count::now();
tbb::parallel_for_each(numbers.begin(), numbers.end(), [&] (int &v) { v += addition;});
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for_each): t%g secondsn", (after - before).seconds());

//tbb::parallel_for iterator version
before = tbb::tick_count::now();
tbb::parallel_for(
tbb::blocked_range<std::vector<int>::iterator>(numbers.begin(),
numbers.end()),
[&] (tbb::blocked_range<std::vector<int>::iterator> number) {
for (std::vector<int>::iterator it = number.begin(); it != number.end(); it++) {
(*it) += addition;
}
});
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for iterator): t%g secondsn", (after - before).seconds());

//tbb::parallel_for index version
before = tbb::tick_count::now();
tbb::parallel_for(size_t(0), size_t(numbers.size()),
[&] (size_t index) { numbers[index] += addition; });
after = tbb::tick_count::now();
std::printf("time spent (tbb::parallel_for index): t%g secondsn",(after - before).seconds());

return 0;
}

std::for_each version:
The execution time of this version is 0.488757 seconds.

tbb::parallel_for_each version:
The measurement of adding 5 to every element using the parallel_for_each is by far the slowest with 8.59165 seconds. Does parallel_for_each execute each iteration in its own thread producing the overhead to slow down this version? Does this mean, parallel_for_each should only be used if each iteration needs a large number of CPU cycles?

tbb::parallel_for iterator version:
I am not sure whether this implementation using iterators is the simplest one. Is there a shorter / simpler implementation of using parallel_for with iterator of an STL vector? The execution time for this version is: 1.12382 seconds. This is slower than using std::for_each and gets me to the question, under which condition parallel_for using iterator is a practical solution?

parallel_for index version:
The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?

Further, my assumption is that using an std::vector in this test code is safe, also the parallel writes at different indexes, as the elements are independent and the vector is not expected to grow. Is this assumption correct?

Thanks for feedback,
martin

The measurements are done on Win7 using Microsoft Visual C++ Compiler Version 16.00.30319.01, TBB 4.1, on Intel i7-2620M CPU @ 2.70 GHz

AttachmentSize
Downloadtext/x-c++src main.cpp1.92 KB
14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

*** do not use rand inside test load code ***
Functions such as this have a critical section. Meaning your parallel test runs a serial section (plus overhead to serialize the critical section).
Consider changing your test load function to something that does not use a critical section.

Generally speaking:

Use for( or for_each when the amount of computation is small (not worth the overhead to parallelize).
Use parallel_for when you have a large number of objects and each object has equal work of small work.
Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work.
Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work.
When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case.

Jim Dempsey

www.quickthreadprogramming.com

#0 "tbb::parallel_for_each version:
The measurement of adding 5 to every element using the parallel_for_each is by far the slowest with 8.59165 seconds. Does parallel_for_each execute each iteration in its own thread producing the overhead to slow down this version? Does this mean, parallel_for_each should only be used if each iteration needs a large number of CPU cycles?"
Perhaps it could easily be specialised for random iterators using parallel_for() instead of parallel_do(), whose implementation for random iterators is not very sophisticated compared to parallel_for() (equivalent of simple_partitioner resp. implicit auto_partitioner)? Or maybe parallel_do() itself could be improved (also implicit auto_partitioner for random iterators, maybe some concept of grainsize)? Perhaps you could make a comparison with parallel_for() explicitly configured with simple_partitioner and grainsize 1 to see whether this fully explains the difference. It is far less clear whether eliminating just the feeder requirement from the other specialisations would allow any easy improvement, probably not. Note that using separate threads would be a great deal slower still, as parallel_do() is not exactly a slouch. Some feedback from the TBB team would be appreciated: should I contribute such a specialisation?

#0 "tbb::parallel_for iterator version:
I am not sure whether this implementation using iterators is the simplest one. Is there a shorter / simpler implementation of using parallel_for with iterator of an STL vector? The execution time for this version is: 1.12382 seconds. This is slower than using std::for_each and gets me to the question, under which condition parallel_for using iterator is a practical solution?"
You should probably "hoist" the evaluation of end() out of the loop, by defining "it_end = number.end()" next to "it" (unfortunately you cannot make it_end const without taking that variable out of the loop header), which may allow considerable optimisation from the compiler. Is the documentation not clear about that (I didn't check that myself at this time)? You can probably also pass the blocked_range as a const reference, but I'm not sure what the effect will be. Please report your findings.

#0 "parallel_for index version:
The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?"
That seems abysmally bad, unless you're running this on a single thread? You should get very good scalability for this simple kernel.

#0 "Further, my assumption is that using an std::vector in this test code is safe, also the parallel writes at different indexes, as the elements are independent and the vector is not expected to grow. Is this assumption correct?"
I think the C++ specification is deficient by neglecting to explicitly provide this guarantee (unless I overlooked it), but normally it's a safe assumption.

#1 "*** do not use rand inside test load code ***
Functions such as this have a critical section. Meaning your parallel test runs a serial section (plus overhead to serialize the critical section).
Consider changing your test load function to something that does not use a critical section."
This is probably harmless in the test setup.

#1 "Generally speaking:
Use for( or for_each when the amount of computation is small (not worth the overhead to parallelize).
Use parallel_for when you have a large number of objects and each object has equal work of small work.
Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work.
Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work.
When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case."
Hmm, using a parallel construct instead of a serial construct allows cancellation, use grainsize is only indirectly related to equal/unequal work (extremely unequal work precludes auto_partitioner and simple_partitioner works best with a properly configured grainsize), parallel_for_each is never quicker than parallel_for (assuming random iterators) but may be more convenient than parallel_for with explicit simple_partitioner (assuming stated conditions), and I'm curious what speedup you could possibly get from switch and parallel_invoke().

Perhaps you could make a comparison with parallel_for() explicitly configured with simple_partitioner and grainsize 1 to see whether this fully explains the difference

I did this test as proposed and I get an execution time of 10.2749 seconds which is approximately 2 seconds slower than using parallel_for_each. So having grainsize 1 and using the simple_partitioner disables the automatic chunking and spawns a thread for every iteration. This is slower than parallel_for_each which let me assume that parallel_for_each works a little differently as you explained in the answer.

You should probably "hoist" the evaluation of end() out of the loop, by defining "it_end = number.end()" next to "it" (unfortunately you cannot make it_end const without taking that variable out of the loop header), which may allow considerable optimisation from the compiler

I did the proposed optimizations:


	tbb::parallel_for(

			tbb::blocked_range::iterator>(numbers.begin(),

					numbers.end()),

			[&] (const tbb::blocked_range::iterator> number) {

				const std::vector::iterator it_end = number.end();

				for (std::vector::iterator it = number.begin(); it != it_end; it++) {

					(*it) += addition;

				}

			});


The execution time did not significantly change.

Is the documentation not clear about that (I didn't check that myself at this time)?

I couldn't find anything in the documentation that explains the usage of iterators with TBB. A similar question I found here: http://software.intel.com/en-us/forums/topic/304696

The execution time for this version is 0.452654 seconds which tops the serial std::for_each version and is what I expected. Is this the proposed way to use parallel_for with an STL vector?"
That seems abysmally bad, unless you're running this on a single thread? You should get very good scalability for this simple kernel

As far as I know, the default number of thread is equals to the number of cores. Which are 4 in my case.
But I figured out, that the version written with the range_concept seems to be faster. Execution time: 0.317561 seconds


	tbb::parallel_for(

			tbb::blocked_range(size_t(0), size_t(numbers.size())),

			[&] (const tbb::blocked_range range) {

				for (int it = range.begin(); it != range.end(); it++) {

					numbers[it] += addition;

				}

			});

So having grainsize 1 and using the simple_partitioner disables the automatic chunking and spawns a thread for every iteration.

A "task", not a "thread" (essential difference...).

I did the proposed optimizations:

Oh yes, there's another level in the iterator. If you can get rid of that by iterating over pointers (vector now requires consecutive layout of elements), with boundary values evaluated at the start of the loop of course, perhaps that's what the compiler needs to work its magic.

I couldn't find anything in the documentation that explains the usage of iterators with TBB. A similar question I found here: http://software.intel.com/en-us/forums/topic/304696

I specifically meant the "hoisting" of end() from the serial loop. TBB team?

But I figured out, that the version written with the range_concept seems to be faster.

You should get near-perfect scalability with end() out of the loop and using pointers. I wouldn't be satisfied with anything noticeably slower than about 0.13 seconds (for_each timing divided by 4).

Raf,
>parallel_for_each is never quicker than parallel_for
>

Never say never (again), James Bond, 1983

When a system is heavily loaded, not only with the current application, but others as well, thread availability as well as work runtime is highly undeterministic. Therefor, when work is relatively large as compared to task management overhead, then unpredictible thread scheduling is often best served using a non-deterministice task scheduling method such as parallel_for_each. While some implimentations of parallel_for_each (e.g. TBB) may choose to impliment via one task schedule per element, others (e.g. QuickThread) task schedules once per waiting thread, then uses atomics thereafter. Once started, the overhead per iteration is that of as little as an atomic incriment.
Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:
Therefor, when work is relatively large as compared to task management overhead, then unpredictible thread scheduling is often best served using a non-deterministice task scheduling method such as parallel_for_each.

Well, I really meant more specifically that, in TBB, and in situations with random iterators, based on reasoning also about their implementations, you can always expect to at least match parallel_for_each() performance with a properly configured parallel_for(). But I concede that this may be the wrong expectation where those assumptions break down. Overloading from external causes is an unsolved problem. Still, one should probably always bundle work into units (tasks) that are significantly larger than their scheduling overhead, so I'm not sure that this is a useful exploration of something not easily solved otherwise (by said bundling). As for atomic increments, for these to be faster than the specialised implementation with recursive parallelism for random iterators, doesn't the situation first have to degrade close to the point of only having one thread, or is my assumption of central sharing (or anything else) invalid?

>>As for atomic increments, for these to be faster than the specialised implementation with recursive parallelism for random iterators, doesn't the situation first have to degrade close to the point of only having one thread, or is my assumption of central sharing (or anything else) invalid?

One cannot make a generalization that works best under all circumstances. Each technique has its own characteristics which can be used by the programmer in selecting which to use. Regardless of paradigm, each has a means of partitioning a workset (e.g. stl vector). Division generally falls into:
a) divide number of elements in vector by number of threads (accounting for 0, and remainder) to produce number of tasks.
b) divide number of elements in vector by grainsize value (accounting for 0, and remainder) (OpenMP technique) to produce number of tasks.
c) when number of elements in vector larger than grainsize split vector in two and recursively enqueue as two tasks, else perform work on porton of vector
c) enqueue one task per element in vector (TBB parallel_for_each)
d) enqueue one task per min(number of elements in vector, number of threads in threadset) where each thread performs an atomic pick of next element in vector (QuickThread parallel_for_each)
e) user hybrid of parallel_for_each, where each iteration (pick) consumes specified grainsize number of elements in vector.
f) divide number of elements in vector by number of idle threads in thread pool (QuickThead can do this)
...other
The selection decision of technique generally is often determined through testing. The code changes are trivial, running the tests can be time consuming. As you build experience, you can often choose the better candidate by understanding the problem at hand.
Jim Dempsey

www.quickthreadprogramming.com

Thanks for all the hints.

I could get a great speedup using pointers instead of the iterator. Actually this difference is based on the implementation of the vector, I think. So the following code executes in
0.126289 seconds.


tbb::parallel_for(

	tbb::blocked_range(numbers.data(),

			numbers.data() + numbers.size()),

	[&] (const tbb::blocked_range number) {

		const int* it_end = number.end();

		for (int* it = number.begin(); it != it_end; it++) {

			(*it) += addition;

		}

	});

Quote:

Actually this difference is based on the implementation of the vector, I think.

Probably.

Conceivably another compiler with a smarter optimiser may still be able to overcome the obstacles in the implementation, so perhaps somebody could try this on Windows with Intel's compiler (assuming that compiler means compiler, not the whole development environment with libraries and all as Windows people seem to tend to assume)?

Meanwhile, it would be nice to have a more compact way to do this simple thing in TBB without having to make any assumptions about the environment, so how about an ad-hoc tbb::parallel_for specialisation for std::vector::iterator-based ranges that transparently substitutes pointer-based iteration? You're on a roll, so I'll let you try this first.

And why wouldn't there even not be functions to make a TBB range directly from a container instead of explicitly going through the legacy begin()/end() interface (let alone those cases where the container is the result of a lengthy expression), emulating the new range-based for statement in C++11 for semantics but with appropriate specialisations to transparently achieve the same effect as described above?

I know this is an old thread but.....

time spent (generate random): 4.61568 seconds
time spent (std::for_each): 0.671217 seconds
time spent (tbb::parallel_for_each): 715.845 seconds
time spent (tbb::parallel_for iterator): 203.174 seconds
time spent (tbb::parallel_for index): 0.234936 seconds
 

Intel Xeon e5620 2.40ghz, 16gb Ram

 

 

 

parallel_for_each() supports input iterators or higher, and is implemented on top of parallel_do(), but has not been specialised for random-access iterators, with a more efficient implementation on top of parallel_for(). I think that the Reference Manual should at least have a warning about that, with the advice to use parallel_for where possible. A specialisation would still be beneficial for use in a template where the actual category is not known beforehand.

For parallel_for with an iterator, could you provide results for both debug and release?

(BTW, in the online Reference Manual, "Index_type last" should be "Index last", instead.)

With a few modifications (tried), parallel_for_each can be as fast as parallel_for (verified), given a random-access iterator of course. Any interest?

I'm not sure why we even need two different function templates. Is it to reassure ourselves that we're getting top scalability from parallel_for? That would be a good reason, I suppose. But I don't think we need any promise that parallel_for_each will always be (much) slower...

(2015-02-07 Added) parallel_do() already differentiates its implementation based on the category of the iterator arguments.

Leave a Comment

Please sign in to add a comment. Not a member? Join today