Difference in compact parallel_for and tbb::parallel_for using blocked_range according to step size

Difference in compact parallel_for and tbb::parallel_for using blocked_range according to step size

I have two parallel_for calls that have a step_size of 2. The first version uses the compact notation of parallel_for with step_size. My assumptions: On default this uses the auto_partitioner und grainsize is set to 1. This version looks as follows:

void addValueOnElementsB(int data[], int size, int const value) {
   tbb::parallel_for(0, size, 2, [data, value](const int& i) {
      data[i] = data[i + 1] * value;
   });
}

I tried to do the same in parallel_for using a blocked_range where iteration over a subrange uses a += 2 for the same step size. On default here, auto_partitioner and grain_size = 1 are used too. This version looks as follows:

void addValueOnElementsC(int data[], int size, int const value) {
   tbb::parallel_for(tbb::blocked_range<int>(0, size), [data, value](tbb::blocked_range<int> r) {
      for (int i = r.begin(); i < r.end(); i += 2) {
         data[i] = data[i + 1] * value;
      }
   });
}

I couldn't manage to get the same results. Actually with auto_partitioner I cannot deterministicly influence the chunk size, therefore an equal result can not be expected for every execution at all. E.g. a resulting chunk size of 1 results definitively in a different result compared to version 1 as every element is changed. Assuming "size" to be 20, grainsize is set to 2 and the simple_partitioner is used. Nothing prevents from having a chunksize equal to 1 which leads as analysed to a different result. If my analysis is correct, how does version 1 works? Does it make use of parallelization? Is it possible to use parallel_for using blocked_range and different step size than 1?

11 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

The compact form always uses auto_partitioner and grainsize 1, those are not overridable defaults.

To do the same with a blocked_range, make sure to start with a correctly aligned i, something like ((r.begin()-start+step_size)/step_size)*step_size-step_size+start if I'm not mistaken (here start is 0 and step_size is 2 of course). Some chunks may not contain any iterations this way, so it's not the most efficient thing to do, but that can be changed by iterating over a range scaled down by step_size and using the correct offsets after scaling up again, which I'll leave to you, or using an appropriate grainsize (at least 2*step_size-1). The simple_partitioner is only used if explicitly specified.

Note that grainsize has different meanings depending on how the iteration domain is scaled: original as in the straightforward blocked_range implementation, or scaled as in the compact form or the suggested alternative.

(Expanded 2013-01-17)

(Added 2013-01-18) See correction below.

Thanks for this explanation.

I achieved the implementation with the scaled down range and the upscaling to get to the correct offset. Actually I had a look into the TBB source to find out :).So here follows the implementation using blocked_range with step size 2 and is equivalent to the first version of my first post (of course it could be more generic):

void addValueOnElementsC(int data[], int size, int const value) {
  int end = ((data + size) - data - 1) / 2 + 1; //scale down the range (last - first - 1) / step_size + 1
  tbb::parallel_for(tbb::blocked_range<int>(0, end), [data, value](const tbb::blocked_range<int>& r) {
     //scale up to access the correct offsets: k = first + range_iteration * step_size, with increment k += 2
     for(int i = r.begin(), k = 0 + i * 2; i < r.end(); i++, k = k + 2) {
        data[k] = data[k + 1] * value;
     }
  });
}

I see now too what you meant with die different interpretation of the grainsize if the step size is not one. Thanks!

Sorry, I should have suggested (((r.begin()-start)+(step_size-1))/step_size)*step_size+start instead (hopefully...).

The solution you looked at has a similar floor-to-ceil transformation (except that it's just once and for the right end), which you don't need here (size should be divisible by step_size for this elemental function) and maybe don't even want (if size is uneven you would be reading past the end of data[]), although it's probably better to check size (more informative) than to simplify ad hoc and silently ignore any widowed element (to borrow a term from typesetting).

You are aware that you have a fundamental error in your body of your loop.

At the junction of slices of the iterations space, when the first element of a following slice is updated before the computation of the last element of the prior slice, then the last element of the prior slice is incorrect.

Jim Dempsey

www.quickthreadprogramming.com

When step_size is obeyed (correctly), there's no interference (even elements are set from odd elements).

Then what happens on the second pass (not shown) for the odd elements?

What is not indicated from the post is if the array data[] represents a single state or represents two states alternateing with even's and odd's on iteration passes.

Jim Dempsey

www.quickthreadprogramming.com

Consider:

void addValueOnElementsC(int data[], int size, int const value) { 
   tbb::parallel_for(tbb::blocked_range<int>(0, size/2), [data, value](tbb::blocked_range<int> r) { 
       for (int i = r.begin(); i < r.end(); ++i) {
           int i = i * 2;
           data[j] = data[j + 1] * value;
       } 
    }); 
 }

Jim Dempsey

www.quickthreadprogramming.com

I just took step_size at face value without concerning myself with how the function gets called. Did I overlook something?

The block range gives no assurance that the resultant (sub)range is a multiple of 2 (as the body of the code assumes/requies). Should the partitioning produce an odd sizes partition, then the algorithm falls appart (odd partition indexes past end of array, following partition begins at odd cell).

Halving the range at the partitioning, then doubling in the body assures always starting at even index (of complete array for all partitions). IOW actual data partitions are forced to multiple of 2 (in this situation).

Jim Dempsey

www.quickthreadprogramming.com

In my advice above I favoured checking "size" over simply ignoring any widowed element(s) (if something like that occurs, something worse might be happening, so it would seem better to fail early and loudly), but I still wanted to answer the question in general (without the divisibility constraint related to this specific elemental function).

BTW, an obvious alternative is to pass size/step_size (or the value N from which "size" was computed as N*step_size), or to simply assume that all this is clear from the neighbouring code anyway.

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!