parallel_reduce's join gives different results

parallel_reduce's join gives different results

I used a example parallel_reduce program. What I found is when I use :
tbb::task_scheduler_init init(4);
parallel_reduce's split and join functions are not called. My work load is not that big, so this might happen.
But if I switch to:
int n = task_scheduler_init::default_num_threads();
which is 6 when I print out, all the split and join are called.

This will cause further problem, as if I have this in join:

void join( const SumFoo& y ) {

std::cout<<"join "<

my_sum+=y.my_sum/2;

}

I will get different results when join is called comparing to when join is not called.
Why is that? Isn't this dangerous as the logic can be wrong when join fucntion is skipped?

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

"my_sum+=y.my_sum/2" is not a valid reduction because it is not associative. With an associative operation, you would not care very much about split/join (other than for performance and differences at the limit of precision).

Even when the operations are associative the sequence in which the reductions are made can vary when using floats or doubles. When the data being "reduced" are approximations with rounded precision then the eventual result may vary in the lsb(s) depending on sequence of reduction. parallel_reduce join will produce equivilent results (within some epsilon).

Jim Dempsey

www.quickthreadprogramming.com

Thanks. But putting my_sum+=y.my_sum/2 is just my experiment.
My question is,
why with
"

tbb::task_scheduler_init init(4);

parallel_reduce's split and join functions are not called.

But if I switch to:

int n = task_scheduler_init::default_num_threads();

split and join are called. The work load is the same and I repeated tested many time and it is always the case.

Try a range that's long enough and you'll probably have split/join with 4 threads as well. If pressed I would guess that with 6 threads an auto_partitioner (the default) generates more chunks than with 4 threads, giving more parallel overhead and more opportunities for a thief, but I can't be certain that this is the explanation here. It also doesn't seem that important, if you can confirm that there's not a lot of work to begin with.

(Added 2012-02-18) And of course that's exactly what you did: "My work load is not that big, so this might happen."

hello,

did you use tbb::parallel_deterministic_reduceto get these results? Simple reduce is not deterministic.You can find it in Apendix D.3 in the reference.

--Vladimir

Leave a Comment

Please sign in to add a comment. Not a member? Join today