Submitted by Alex Katranov (Intel) on

Computer Arithmetic has a lot of peculiarities [1]. One of these pitfalls is associativity failure in floating point arithmetic. For example, the two sums of fractions calculations below will not produce the same result when using `float`

s:

In a sequential program, it is not a big problem since the calculation order is exactly specified so the result is predictable and repeatable. The situation is not so clear in parallel programming.

To make the example parallel, I used the parallel_reduce template function from Intel® Threading Building Blocks (Intel® TBB):

[cpp:nocontrols]std::vector<float> arr( N, 1.0f/(float)N );

float sum = tbb::parallel_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,

[]( const tbb::blocked_range& r, float sum ) {

return std::accumulate( r.begin(), r.end(), sum );

},

std::plus<float>() );

std::cout << sum << std::endl;[/cpp:nocontrols]

As in the examples above, the code calculates the sum of N fractions, but it uses multiple processor cores if available. As it is well known, we face a disappointing fact of different results being possible for different orders of calculations. If we run it 10 times and N=1000 we will get something like this:

0.999991

1

0.999999

0.999996

0.999998

0.999998

0.999998

1

0.999997

0.999998

It’s worth mentioning that the result differs from run to run! In spite of the fact that the developer specifies the calculations – when it is calculated in parallel the order of calculation gets out of control.

On the other hand, it is not as bad as all that. Although the OS operates on threads and fills the application with indeterminism, it is still possible to manage the order of calculations. One of the new features of Intel TBB 4.0 is the parallel_deterministic_reduce template algorithm. The algorithm has the same interface as parallel_reduce except that it does not allow you to specify a partitioner. (For parallel_reduce it is possible to pass a partitioner as the last argument.) We will discuss why this restriction exists later. But for now, let’s replace the parallel_reduce with parallel_deterministic_reduce and look at how the result changes:

[cpp:nocontrols]std::vector<float> arr( N, 1.0f/(float)N );

float sum = tbb::parallel_deterministic_reduce( tbb::blocked_range( arr.begin(), arr.end() ), 0.0f,

[]( const tbb::blocked_range& r, float sum ) {

return std::accumulate( r.begin(), r.end(), sum );

},

std::plus<float>() );

std::cout << sum << std::endl;[/cpp:nocontrols]

Again run it 10 times:

1

1

1

1

1

1

1

1

1

1

The key point here is that the result is the same from run to run.

The sources of non-determinism in parallel_reduce derive from partitioning and body splitting. Let’s consider each of these subjects:

- Partitioning. The simple_partitioner determines exactly how many and which subranges are created. It splits the iteration range until each subrange is smaller than a given grain size. Thus the behavior only depends on the range size and grain size specified by the developer. However, other types of partitioning in Intel TBB are non-deterministic: to improve performance of the algorithms, range splitting provided by these partitioners depends on run-time stealing events, which we cannot predict.

- Body splitting. For performance reasons parallel_reduce minimizes body copies: it splits the body only when consecutive subranges are processed by different threads. Thus body splitting, like “advanced” partitioning, also depends on non-deterministic task stealing.

The example shows that parallel_reduce is really inapplicable for non-associative operations like floating point arithmetic. To achieve a repeatable result from a reduction with non-associative operations parallel_deterministic_reduce has been developed. From the considerations of partitioning (given above), it follows that only the simple_partitioner can be used for parallel_deterministic_reduce; and thus, no choice of an alternative partitioner is possible. Consequently, parallel_deterministic_reduce always challenges us with choosing an appropriate grain size. And smart body splitting has been disabled for the sake of deterministic behavior, so for each subrange a new body is created. This fact complicates the challenge of grain size selection even more: on the one hand, a small grain size increases the number of body copying and overall overhead, but on the other hand, a big grain size may lead to imbalance and underutilization. Fig. 1 shows the relative performance of parallel_deterministic_reduce (simple_partitioner with various grain sizes) in comparison with parallel_reduce (auto_partitioner with default grain size). An appropriate grain size provides the same performance of parallel_deterministic_reduce as parallel_reduce, - but an incorrectly chosen grain size may lead to significant performance degradation, as shown in Fig.1 at the extremes of the grain size axis.

Fig.1. Comparison of parallel_reduce (auto_partitioner) and parallel_deterministic_reduce (simple_partitioner) on Pi calculation example.

To demonstrate the split-join order behavior of parallel_deterministic_reduce, a small example is given with range [0, 20) and grain size = 5, similar to examples for parallel_reduce in the Intel TBB Reference manual:

A tree of subranges

For each right node a new body is created by the body split constructor. The slash marks (/) in the tree show where the body split is performed. Thus, for the current example the parallel_deterministic_reduce will always produce 4 subranges and 4 different bodies associated with them. Each of these subranges may be executed in parallel. When both children of a node finish, the corresponding bodies are merged: the right child body “added” to the left child body (in our examples via the `std::plus<float>()`

binary function).

To conclude, parallel_deterministic_reduce provides a deterministic number and deterministic sizes of subranges, and it exactly defines which pairs of subranges are merged. It’s important to note that a repeatable result obtained with parallel_deterministic_reduce may still be different from that obtained via serial execution. Moreover, the results may be different for various grain sizes, since range splitting depends on the grain size. Also, the algorithm is not targeted to improve the accuracy of calculations. The exact result of 1 in the above example of fraction sum calculation has been obtained by chance. For other examples the algorithm can cause a decrease in accuracy. Overall, parallel_deterministic_reduce is not a replacement to parallel_reduce but an alternative solution for those who need repeatability.

## Comments (1)

Topanton.pegushin said on

Hi, nice job,

though for the sake of the argument you should have put ten 0.999998-s as an output for parallel_deterministic_reduce even if you really got ten ones :). All through the post up until I got to the point "It's important to note ..." it felt like I was watching a magic trick, I kept thinking "come on, I know arriving to all ones does not depend on the way you split the range and the way tasks get stolen".

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today