I have two parallel_for calls that have a step_size of 2. The first version uses the compact notation of parallel_for with step_size. My assumptions: On default this uses the auto_partitioner und grainsize is set to 1. This version looks as follows:

void addValueOnElementsB(int data[], int size, int const value) { tbb::parallel_for(0, size, 2, [data, value](const int& i) { data[i] = data[i + 1] * value; }); }

I tried to do the same in parallel_for using a blocked_range where iteration over a subrange uses a += 2 for the same step size. On default here, auto_partitioner and grain_size = 1 are used too. This version looks as follows:

void addValueOnElementsC(int data[], int size, int const value) { tbb::parallel_for(tbb::blocked_range<int>(0, size), [data, value](tbb::blocked_range<int> r) { for (int i = r.begin(); i < r.end(); i += 2) { data[i] = data[i + 1] * value; } }); }

I couldn't manage to get the same results. Actually with auto_partitioner I cannot deterministicly influence the chunk size, therefore an equal result can not be expected for every execution at all. E.g. a resulting chunk size of 1 results definitively in a different result compared to version 1 as every element is changed. Assuming "size" to be 20, grainsize is set to 2 and the simple_partitioner is used. Nothing prevents from having a chunksize equal to 1 which leads as analysed to a different result. If my analysis is correct, how does version 1 works? Does it make use of parallelization? Is it possible to use parallel_for using blocked_range and different step size than 1?