tbb::parallel_for and std::for_each

tbb::parallel_for and std::for_each

Usually we use tbb::parallel_for and std::for_each for the same goal. It is to applay some actions to all array elements.
But what will happaned if we would like to use it together?
Is it safe?
Here example

/*!
 * @brief Parallel class. It's functor class, that perform some action using operator().
 *  The class used by tbb::parallel_for
 */
class Parallel{
public:
    /*!
     * @brief Constructor. The class doesn't have default constructor.
     */
    Parallel(int* data) : m_data(data){}
    /*!
     * @brief this operator called by tbb::parallel_for concurently,
     *  so it should be thread safe. Otherwise it's dangerous to get thread issues.
     * @param range it's range, or as I prefer to name it diapason,
     *  each thread use the different values of it, when all thread will be executed
     *  all elements of m_data will be changed by "foo" function.
     */
    void operator()(const tbb::blocked_range& range) const{
        std::for_each(m_data + range.begin(), m_data + range.end(), &foo);
    }
private:
    int* m_data;
};

Is it thread safe?

Thanks in advance.

P.S. I've attached full file of my example.

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Sure (but only if it's also safe with a for loop, of course).

Thanks very much, I have once more quastion

I've tried to use tbb::parallel_for_each by the same way as std::for_each

void doParallelForEach(int* arr, size_t size){
    tbb::parallel_for_each(arr, arr + size, &foo);
}

But the resulted arrays have the different value, and time execution is more bigger then tbb::paralel_for and std::for_each.

I've updated the source file.

parallel_for_each doesn't benefit from the larger effective grainsize that auto_partitioner automatically chooses, so it will have a lot more overhead, which means that it's not suitable for small units of work like this. Besides that, other than in a little experiment like this, only use it if you can only get values one at a time from the iterator and cannot feed in more work yourself; with a random iterator, definitely use parallel_for instead.

I don't know why the outcome is different. It's very weird too: the values are off only by a few units, and when I look at parallel_for_each.h in tbb22_20091101oss (latest development release) I see that parallel_for_each_body::operator() uses "std::iterator_traits::value_type value", but for iterator "int*" that would mean just "int" rather than "int&", so I don't see how the program is able to modify the values at all. If nobody else has a suggestion, perhaps I could have another look later.

Best Reply

Thanks for reporting this issue. The root problem was that the C98 standard was ambiguous about whether std::for_each could be used to mutate a sequence. Since std::for_each is in the section about non-modifying sequence operations, we assumed that it could not modify the sequence. But 25.2.4 of the C++0x draft n3000.pdf adds a note that removes the ambiguity by saying such modifications are allowed. We need to fix parallel_for_each to agree with the note.

As Raf pointed out, the root problem is a missing &.

Index: parallel_for_each.h
===================================================================
--- parallel_for_each.h (revision 6378)
+++ parallel_for_each.h (working copy)
@@ -35,7 +35,7 @@
         parallel_for_each_body(const Function &_func) : my_func(_func) {}
         parallel_for_each_body(const parallel_for_each_body &_caller) : my_func(_caller.my_func) {}

-        void operator() ( typename std::iterator_traits::value_type value ) const {
+        void operator() ( typename std::iterator_traits::value_type& value ) const {
             my_func(value);
         }
     };

As Raf points out, the performance is poor because parallel_for_each (really the underlying parallel_do) does not use the auto_partitioner trick that parallel_for does, and thus suffers from thesmall grain sizes in theexample. We should perhaps revisit our decision to not rely on auto_partitioner for parallel_do.

The example routine "doParallel"based on parallel_for is probably the best solution.

"As Raf pointed out, the root problem is a missing &."
But then how did the program have any effect at all? I had already tried to substitute reference for value_type in parallel_for_each.h, but this does not seem to do the trick, although I haven't studied parallel_do.h to find out what might have to change there as well. And it's one thing to modify the referent of an input iterator in a serial algorithm, but doing that in a parallel algorithm goes against the arguably more important requirement of only using single-pass algorithms, where parallel equivalents would have to take copies, so now I don't think that this should even be attempted.

Indeed parallel_for_each is a single pass algorithm, so there's no problem with modifying the referent.I've added a test to src/test/test_parallel_for_each that checks that each referent is modified exactly once.

For the record, the fix did not make it into the commercial TBB 2.2 update 2 release, which came out today, but the fix should show up in subsequent releases.

"Indeed parallel_for_each is a single pass algorithm, so there's no problem with modifying the referent. I've added a test to src/test/test_parallel_for_each that checks that each referent is modified exactly once."
I think that the important thing with an input iterator is not that each referent is only used once (which would be "single use", a phrase that does not occur in C++ 1998), but rather that you only pass through the same iterator once ("single pass"). Maybe the referent is kept in a buffer that is overwritten when the iterator is incremented; if you have a copy of the previous value of the iterator, and use it later than its successor, you will see something else than with a forward iterator. This is summarised as "a==b implies "++a==++b", true for forward iterator, but not (necessarily) for an input iterator. Parallel execution requires complete freedom to reorder execution, so either parallel_do/parallel_for_each can only be used with forward iterators, or they have to take a copy of the referent if they want to work with an input iterator. How would the algorithm be allowed to behave differently based on the characteristics of the actual iterator arguments?

The tbb::parallel_do that underlies tbb::parallel_for_each uses 3 different algorithms, depending upon the kind of iterator:

input iterator: copies the referent. This case will do the wrong thing if the reference is modified, but it's a non-mutable iterator anyway. Note that N3000 (current C++0x draft) does not allow a std::for_each to modify the referent in this case.

forward iterator: copies the iterator. It's two passes by the iterator, but only the second pass dereferences the iterator. Hence user modifications of the referent are okay.

random access iterator: does recursive subdivision of the half-open interval [first,last). User modifications of the referent are okay.

#6 "Indeed parallel_for_each is a single pass algorithm, so there's no problem with modifying the referent."
Ah, but in #8 you show that you are detecting the actual category of the iterators, and disregarding this input-iterator restriction for forward iterators and up.

Are you using traits to make that distinction? How reliable are they? Are they at least conservative, i.e., no input iterator ever marked forward?

Still, I'm having trouble with the mutability thing that's referred to in N3000 for for_each. Earlier the specification only applies that characteristic to forward iterators and up (that hasn't changed in N3000 at first sight), so how does that work with mere input iterators? And will the compiler catch any misuse based on const (all that the provided function can express is its parameter type, it cannot check the iterator's category) or is the user entirely responsible?

Basically, why I'm still looking at this, when I tried to use reference or value_type& in tbb22_20091101oss, like in #4, I didn't see any improvement in the example, I think from #2. Have you verified that things work with just this change and should I therefore check my findings again?

STL libraries commonly dispatch on traits information, hence it is assumed to be reliable.

As far as I know, the compiler will not catch misue of a C++0x for_each trying to modify the reference of an input iterator. So yes, it falls on the user to get it right. Perhaps the best way is for the user to write their input iterator i so that *i is not mutable.

With the patch applied to parallel_for_each.h, Denis' example runs correctly for me. Below is the output I got on a two-socket (8 core) i7 box:

$ main.exe
It is execution time in seconds of doSerialWithFor: 0.065957
It is execution time in seconds of doSerialWithForEach: 0.065682
It is execution time in seconds of doParallel: 0.010081
It is execution time in seconds of doParallelForEach: 0.150337
$

That's a respectable 6.5x speedup between doSerialWithFor and doParallel. As previously noted, parallel_for_each suffers from not being able to automatically adjust the grain size.

Ah, I must have done something wrong when I tested the modification earlier, because now it works all right. Thanks for the tutorial info. :-)

Thanks for answers and discussion.
Raf Schietekat, sorry, I've removed your last post accidentally.

Leave a Comment

Please sign in to add a comment. Not a member? Join today