enumerable_thread_specific and nested parallel_for

enumerable_thread_specific and nested parallel_for


I'm working on the following problem: I need to iterate over a 3D grid and create 8 std::vector<TInfo>s, where TInfo instances represent a set of operations. This is kind of a coloring algorithm where each of the 8 vectors corresponds to a color, and all the TInfo instances in the same vector can be  processed concurrently without data races. This been said, I can do something like the following (not necessarelly valid c++ code):

std::vector<TInfo> tinfoVectors[8];
obtainTInfoVectors(grid,tinfoVectors); // This is a sequential procedure
for (int i = 0; i < 8; i++)
    parallel_for(blocked_range<size_t>(0,tinfoVectors[i].size()), FtorProcessTInfo(tinfoVector[i]));

I want to parallelize obtainTInfoVectors. This function goes through all the grid cells and creates 8 TInfo​ instances per cell, and pushes the instances into the right tinfoVectors. This means that in a multi-threaded implementation, different threads will try to add elements at the same time to tinfoVectors​ creating data races. Two solutions come to my mind:

  1. Replace std::vector with tbb::concurrent_vector: the problem is that obtainTInfoVectors​ very intensivelly adds elements to the vectors, and I don't really expect to have much gain using tbb::concurrent_vector. Thus, I discarded this option.
  2. To use tbb::enumerable_thread_specific to have a tinfoVectors per thread. This, I beleive, is more efficient.

Using  tbb::enumerable_thread_specific, the code can change to something like this:

typedef std::vector<TInfo> TInfoVectors[8];
enumerable_thread_specific<TInfoVectors> tinfoVecsTLS;
for (int i = 0; i < 8; i++)
    for (t = tinfoVecsTLS.begin(); t != tinfoVecsTLS.end(); t++) //LOOP1
        parallel_for(blocked_range<size_t>(0,(*t)[i].size()), FtorProcessTInfo((*t)[i]));//LOOP2

However, since parallel_for has some implicit synchronization, I'm afraid that LOOP1 can introduce unnecessary overhead. If I could flatten LOOP1 and LOOP2 into one loop and perform a parallel_for would be ideal. I'm aware of flattened2d​, but since it only supports forward iterator is not suited for parallel_for, and using parallel_do instead would result in unnecessary overhead also. So I'm thinking of make LOOP1 parallel as well, and get something like the following:

struct FtorProcessTInfoVecTLS{
    typedef enumerable_thread_specific<TInfoVectors>::const_range_type range_t;
    int i;
    FtorProcessTInfoVecTLS(int i_p)
    : i(i_p)
    void operator()(range_t & r) const {
        // the range only has one element
         parallel_for(blocked_range<size_t>(0,(*r.begin())[i].size()), FtorProcessTInfo((*r.begin())[i]));//LOOP2
for (int i = 0; i < 8; i++)
    parallel_for(tinfoVecsTLS.range(1), FtorProcessTInfoVecTLS(i),simple_partitioner());//LOOP1

However, I'm not sure how nesting the parallel_for inside another parallel_for would work, nor if it is the best way to do it. Any comment or suggestions are very welcome.

Thanks in advance!

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi, nesting works perfectly with TBB. If you are concerned about every last bit of the performance and the work is very small, use the same task_group_context for all parallel_fors

Leave a Comment

Please sign in to add a comment. Not a member? Join today