parallel_reduce strange behavior

parallel_reduce strange behavior

Hello every one,
I was testing very basic examples from Intel TBB book, and the behavior is very strange.
I ran the following code (which do just parallel multiply, summing andminimum).
The problems:---------------1) The parallel_reduce in the summing on my laptop (2 core duo 2.4 GHz), when I specify to (task_scheduler_init::deferred) andinit.initialize(1) ) it is faster than the automatic which uses the two cores!!!2) I ran the same code on my PC (Xeon 8 cores 3.0 GHz), a) the summing result is different!!!! b) on 8 cores it's much much slower than on 2 cores!! either with automatic or simple or manual grain size !!Laptop -> linux ubuntu, 32 bits, tbb 2.2PC -> fedora, 64 bits, tbb 2.1How comes?Many thanks for your help.#include "tbb/task_scheduler_init.h"#include "tbb/blocked_range.h"#include "tbb/parallel_for.h"#include "tbb/parallel_reduce.h"#include #include using namespace std;using namespace tbb;const int dim = 100000000;void Foo( double *x , double *y , double *z );void Foo( double *x , double *y , double *z ){*z = *x * *y * sqrt(*x) * sqrt(*y) * log10(*x * *y) / 3 ; // any function}class VectorMultiply{private: double *const a , *const b , *const c ;public: VectorMultiply( double *x , double *y , double *z ) : a(x) , b(y) , c(z) {} void operator()( const blocked_range& r ) const { double *m , *n , *l ; m = a ; n = b ; l = c ; for( size_t i=r.begin(); i!=r.end(); ++i ) {// l[i] = m[i] * n[i] * sqrt(m[i]) * sqrt(n[i]) * log10(m[i]*n[i]) * m[i] * n[i] * sqrt(m[i]) * sqrt(n[i]) * log10(m[i]*n[i]) / 3 ;// Foo( &a[i] , &b[i] , &c[i]) ; c[i] = a[i] + b[i] ;// c[i] = a[i] * b[i] * sqrt(a[i]) * sqrt(b[i]) * log10(a[i] * b[i]) / 3 ;// c[i] = a[i] * b[i] * sqrt(a[i]) * sqrt(b[i]) * log10(a[i]*b[i]) * a[i] * b[i] * sqrt(a[i]) * sqrt(b[i]) * log10(a[i]*b[i]) / 3 ; } }};class FindSum{private: double * my_a ;public: double sum ; FindSum() : my_a(NULL) , sum(0) {} FindSum( double * x ) : my_a(x) , sum(0) {} void operator()( const blocked_range& r ) { for ( size_t i = 0 ; i != r.end() ; i++ ) sum += my_a[i] ; } FindSum( FindSum& x , split ) : my_a(x.my_a) , sum(0) {} void join( const FindSum& y ) { sum += y.sum ; } double result() { return sum ; } ~FindSum() {}};class FindMin{private: const double * const my_a ;public: double value_of_min ; long index ; FindMin( const double * a ) : my_a(a) , value_of_min(100000000) , index(-1) {} FindMin( FindMin& y , split) : my_a(y.my_a) , value_of_min(100000000) , index(-1) {} void join( const FindMin& y ) { if ( y.value_of_min < value_of_min ) { value_of_min = y.value_of_min ; index = y.index ; } } void operator()( const blocked_range& range ) { const double *a = my_a ; double value; for ( size_t i = range.begin() ; i != range.end() ; i++ ) { value = a[i] ; if ( value < value_of_min ) { value_of_min = value ; index = i ; } } } double result() { return value_of_min ; }};class Vector{private: double *a , *b , *c ; double min ;public: Vector() { a = new double [dim] ; b = new double [dim] ; c = new double [dim] ; } Vector( double *x , double *y , double *z ) : a(x) , b(y) , c(z) {} void Fooo() { for ( int i = 0 ; i < dim ; i++ ) a[i] = i ; for ( int i = 0 ; i < dim ; i++ ) b[i] = i ; } void multiply() { parallel_for( blocked_range(0,dim,1000) , VectorMultiply(a,b,c) ) ;// parallel_for( blocked_range(0,dim) , VectorMultiply(a,b,c) , auto_partitioner() ) ; } void sum() { cout< FindSum temp(c) ; parallel_reduce( blocked_range(0,dim,1000) , temp ) ; cout< "<< temp.result() ; } void minimum() { cout< FindMin mini(c) ; parallel_reduce( blocked_range(0,dim,1000) , mini ) ; cout< "<< mini.result() ; } void print() { for ( int i = 0 ; i < 20 ; i++ ) cout< }};int main(){ task_scheduler_init init ; //(task_scheduler_init::deferred) ;// init.initialize(1) ; Vector a ; a.Fooo() ; a.multiply() ; a.print() ; a.sum() ; a.minimum() ; return 0 ;}

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Tip for this forum: before pasting code, click on the "yellow highlighter" icon. It will bring up a window that you can paste the code into and not lose indentation. Don't forget to select "C++" before pasting the code.

The problem in the code is a tiny error in class FindSum. Here is the correction:

    for ( size_t i = 0 r.begin() ; i != r.end() ; i++ )

By the way, a little more performance might be obtained by using a local temporary variable to accumulate the sum, like this:

    double tmp = sum;
    for ( size_t i = r.begin(); i != r.end() ; i++ )
        tmp += my_a[i];
    sum = tmp;

The reason it might help is that sometimes a compiler cannot tell that my_a[i] and this->sum are not aliases for the same location, and thus has to be conservative about optimization. By accumulating the sum in a non-address-taken local temporary, you make clear that the location being updated inside the loop is not aliased to my_a[i].

Leave a Comment

Please sign in to add a comment. Not a member? Join today