parallel_reduce problem

parallel_reduce problem

Hi,

The following code is a bare-bones simplification of a much larger project.It calculates the mean value of a large vector expression.

When the mean_calculator class contains the vector-expression itself (instead of a ref to it), which is morre expensive because of the copy-c/tor,

the first time it is ran the result is pretty random, while the subsequent calls give the same number (as far as I can tell).

When, though it contains a (const) ref to the vector-expression, anything goes as far a result is concerned.

The mean_calculator  only operates on a part of the vector and, more importantly, only reads info.

Any idea for why this behavior ?

Thank you very much in advance,

Petros

PS: I know, that MKL has Summary Statistics, but this is not what I am after. This is only an example

that I need to understand.

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If your code is using lambda functions, and you are using references (e.g. [&]), be certain that you are not writing to these. If you must write to the references then use then make these items atomic. (though this may not meant the code will be deterministic)

Jim Dempsey

www.quickthreadprogramming.com

Here is the code (btw using vs2010 on win7, 64bit, 64bit build):

#include

template < typename _E >
class mean_calculator {
public:
double dest_;
mean_calculator( _E const & e )
:e_(e)
, dest_( double(0) )
, commonOperand_( double( e.size() ) )
{}
~mean_calculator()
{}
void operator()( const tbb::blocked_range< size_t > & r ) {
double dest = dest_ ;
for( register size_t i = r.begin(); i != r.end(); ++i ) {
#if defined(_DEBUG)
const double e_i = e_[i];
#endif
dest += e_i ;
}
dest /= commonOperand_ ;
dest_ = dest ;
}
mean_calculator( mean_calculator & elr,
tbb::split & splt )
:dest_( double(0) ),
e_(elr.e_),
commonOperand_( elr.commonOperand_ )
{}
void join ( const mean_calculator & elr ) {
dest_ += elr.dest_ ;
}
private:
_E e_;
double commonOperand_ ;
};

template
double myMean( _e const & e) {
const size_t size = e.size() ;
mean_calculator< _e> elr( e );
tbb::parallel_reduce( tbb::blocked_range(0, size, 125000 ),
elr,
tbb::auto_partitioner() );
return elr.dest_;
}

struct SillyVector {

double *px ;
SillyVector(){
px = new double[1000000] ;
}
~SillyVector(){
delete [] px ;
}
SillyVector(SillyVector const & o){
px = new double[1000000] ;
copy( o.px, o.px+1000000, px ) ;
}

size_t size() const {
return 1000000 ;
}
double & operator[]( const size_t i ){
return px[i] ;
}
double const & operator[]( const size_t i ) const {
return px[i] ;
}
} ;

int main() {

SillyVector v ;
for ( size_t i= 0 ; i != v.size() ; ++i )
v[i] = double(rand()) / double( RAND_MAX ) ;

task_scheduler_init init ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;
std::cout << myMean(v) << std::endl ;

return
1 ;
}

Update:
Google-ing around I discovered that the problem is that the line : dest_ += other.dest_ is not associative since the rhs is being divided in the operator()
body !!
This is brilliant! I really learned something valuable !
Thank you (especially Jim who responded, and Raf who had solved it in the other occasion).

PS: It might not be entirely useless to mention that, it seems that the new deterministic reduce will avoid this problem rendering possible overflow issues for very big arrays harmless !!
I wonder if the performance penalty is too big. Is my understanding that, the requirement for associativity is lifted in the deterministic reduce correct ?

Non-associativity is a usual suspect for these symptoms, but I didn't want to respond without more information. The solution does not lie in requiring any specific split&join behaviour, but in assuring correct associativity (aggregate sum_ and weight_ separately in the same Body, divide only at the very end, don't bother calculating weight_ if all elements count evenly); use parallel_deterministic_reduce() only if you want to eliminate differences across different runs of the program related to the limited precision of floating-point numbers, but not to compensate for an incorrectly implemented Body.

>> It might not be entirely useless to mention that, it seems that the new deterministic reduce will avoid this problem rendering possible overflow issues for very big arrays harmless !!

Very big arrays (actually any reasonable sized array) have other issues. A simple deterministic reduce may repeatedly return a less precise result. Example, a reduction sum, will produce more accurate results when you sort the values to be summed by magnitude, then sum from small magnitude to large magnitude. You have a tradeoff between speed and precision.

Jim Dempsey

www.quickthreadprogramming.com

How about using an aggregation type that is simply more precise than the element type? It could even be simulated, by aggregating different magnitudes onto different aggregates during the loop and consolidating them at the end, which would probably be a lot faster than with a sort phase, especially over an amount of data that would not fit inside the cache.

Aggregates would be a suitable tradeoff. You you not even need to make the aggregate lists/vectors, just have each reduce object for, each branch, sum into the aggregate slots, and the last reduce sum the final aggregates (smallest to largest).

Jim Dempsey

www.quickthreadprogramming.com

I'm using aggregation and reduction as synonyms, with an aggregate something like a sum (the partial or final result), which seems to be allowed by Merriam-Webster's, and the simulated increased precision consisting of a small array of aggregates, so perhaps we really mean the same thing? The final reduction can probably be done in any order, because the long tail has already been consolidated, if I can use that term for a majority of small values that would otherwise be individually knocked out.

>>a small array of aggregates, so perhaps we really mean the same thing?

Yes

>>The final reduction can probably be done in any order,

No - final reduction is from lower magnitude to larger magnitude.
Code such that the small array of aggregates is in ascending magnitude order. Then standard sum with ++i over array will produce the desired result.

Jim Dempsey

www.quickthreadprogramming.com

JIm, Raf,
This is a bit counter-intuitive to me. I would expect that if data are unsorted, then chances for overflow will be smaller. If on the other hand the data is sorted, then I am almost guaranteed for overflow. Jim's suggestion to use, say a long double, for the partial aggregates seems very well suited for my needs.
I would like however to clarify two points :
a) is the deterministic version of reduction slower, as a rule ?
b) when I write the operator body, if I do not copy the member element destination to a local variable, there are race issues - having to do with the splitting and the body of the function run concurrently. If on the other hand I do the local copying and in the end I assign back to the class member destination, things seem to work OK. In principle, I do not see any reason for this to be the case - am I missing something ? Is it that it is simply much more unlikely that the data race will occur, or something else is happening ? The tbb documentation does the same "trick", btw.
Thank you both very much for your interest and suggestions,
Petros

Quote:

jimdempseyatthecove wrote:
No - final reduction is from lower magnitude to larger magnitude.

Could that really make a difference when adding fewer than, say, ten final values (as an outlandishly high limit)?

Quote:

Petros Mamales wrote:
This is a bit counter-intuitive to me. I would expect that if data are unsorted, then chances for overflow will be smaller. If on the other hand the data is sorted, then I am almost guaranteed for overflow.

Hmm, don't take chances: prevent or verify. There are different issues: integral aggregates can overflow, floating-point values can be lost if they're too small to matter against the current aggregate value. Not sorting might help with the first (assuming signed values), but only small-to-large sorting can benefit with the second.

Quote:

Petros Mamales wrote:
Jim's suggestion to use, say a long double, for the partial aggregates seems very well suited for my needs.

Jim's? :-)

Quote:

Petros Mamales wrote:
a) is the deterministic version of reduction slower, as a rule ?

Yes, although I can't tell by how much: you'll have to tune the grainsize carefully to get close, and there's no Body reuse (might matter for fancy coding with multiple bins).

Quote:

Petros Mamales wrote:
b) when I write the operator body, if I do not copy the member element destination to a local variable, there are race issues - having to do with the splitting and the body of the function run concurrently. If on the other hand I do the local copying and in the end I assign back to the class member destination, things seem to work OK. In principle, I do not see any reason for this to be the case - am I missing something ? Is it that it is simply much more unlikely that the data race will occur, or something else is happening ? The tbb documentation does the same "trick", btw.

Have you observed such race issues (not the same as not allowing for Body reuse, which seems to be the case here)? Normally you don't copy the current aggregate value during the splitting constructor (initialising to a neutral value instead), so you don't have to do anything special there (a counterexample in the documentation might be useful). Using a local variable often helps the optimiser make dramatic improvements, but that's a different issue, and you don't need to start with a copy of the current value as long as you include it at the end.

Raf,
Thank you for responding.
On b) Yes, I have seen data race like this - even in the example I submitted. If I don't copy to a local variable ( something -very- silently suggested in the documentation ) - I have also done the mistake to initialize the partial aggregate by a copy instead of the operation identity - I guess, in order to do something right you have to do all possible mistakes ;-)).
And I don't understand this, to be honest.
Thank you very much
Petros
PS1: I wrote "chances" because I use this in Monte-Carlo simulations . And sometimes one does not really know if the numerical method failed or
something like this happened - bitter experience.. ;-)
PS2: Apologies for the wrong assignment of credit ;-)). Everyone knows that I meant you ! ;-)

Strange. Unless I'm mistaken, either you're doing something wrong or there is something wrong with parallel_reduce()... maybe you're mistaking symptoms from an incorrect implementation for a race? Your responsibility is to initialise the Body's aggregate value(s) with a neutral element in the splitting constructor (without reading the "parent"'s aggregate value, because that could be a C++ race), build on the existing value in zero or more invocations by TBB of the operator (whether you make a copy of the current aggregate value, and, if so, at the beginning or the end, should not matter for correctness except as described below), and merge aggregates during the parent's join() operation, and the toolkit will keep those 3 things sequential within a single Body instance. Across instances, local operator invocations and joins from child Body instances are also supposedly run sequentially; child Body instances can be split-constructed concurrently because they are not supposed to read the aggregate value(s). The only mistake I see in the code you provided is that the division is performed on the existing aggregate as well as the new values, so it will be performed multiple times if the Body is reused, and various elements of the input range will have been divided by the size to various integral powers equal to... or larger than 1. Try what happens if you don't make a copy but start with a fresh local aggregate, set it from the vector elements, divide it there if you really want to (the division can safely be distributed over the sum even if it seems a bit wasteful to do so), and then add it to the member variable at the end of the operator call (you can experiment with a separate copy if you want to, as long as you don't subject it to an extra division). Does that improve matters?

(2012-10-24 Minor editing of first full sentence, after the ellipsis: removed "or", mistaken->mistaking, period->question mark.)

>>long doubles

Don't rely on "long doubles" being any different than "double".

One technique is to use an array of accumulators:

double sum_array[128]; // zero in ctor of reducer

Then your reduction function accumulates into sum_array[by magnitude]


  union {

     double d;

     char c[8];

  };

  d = x;

  sum_array[c[7]&127] += x;

The final reduction sums the elements of the sum_array

sum=0.0;
for(int i=0;i < 128; ++i) sum += sum_array[i];

Jim Dempsey

www.quickthreadprogramming.com

The above uses the most significant 7 bits of the 11 bit exponent.
You may also consider:


  union {

     double d;

     short s[4];

  };

  d = x;

  int i = (s[3] & 0x7FFF) >> 4;

  sum_array[min(i,127)] += x;

Jim Dempsey

www.quickthreadprogramming.com

That's an awful lot of doubles, and there's still at least theoretically a limit on the supported number of input elements (related to the precision of a double, not to the size of the array)... Good call on the precision of long double! How difficult would it be to implement a floating-point type of arbitrary precision, and how expensive to use it, I wonder?

(Added 2012-10-25) I guess it depends on the cost of a shift. If that's cheap enough, and I understand that there has been some improvement in Intel's upcoming Haswell microarchitecture that might be relevant (?), then I would consider as a first approach an array of 64-bit integers (32-bit digit and 32-bit carry, so to say, with each carry overlapping with a higher-value digit), with positive and negative kept separate to avoid excessive communication. Each input double will be split into 2 digit-carry pairs. Inter-digit carry communication would be kept to a minimum until final normalisation, by only overflowing to the next digit when wraparound is imminent. Perhaps it's worth exploring if you find yourself bored: it's a brave new world out there with data movement now a more dominating concern, and perhaps an idea as kooky as this might actually be practical enough to prefer over worries about loss of precision? If nothing else, it could show how big the error would otherwise be.

Every time I try to reply I get this :
The requested URL was rejected. Please consult with your administrator.

Your support ID is: 8287664565030860172
can you please help?

Raf,
at this point things get a bit hairy and one has to decide to use an honest-to-god multi-precision library.
For others' reference purposes I will mention mpir and the following site that provides with windows msvc2010 binaries (since building it from scratch is a bit of work):
http://www.holoborodko.com/pavel/mpfr/#intro

Thank you both you and Jim for your comments and help,
Petros

Couldn't you somehow assure/force the uninterrupted use of an 80-bit floating-point aggregate in your environment, which seems simpler, if you even need the precision at all? Anyway, I think we have now established (off-line) that TBB is behaving as it should when used correctly, so that's a relief.

(Added 2012-10-29) It seems even worse than I thought: Visual C++ reportedly also sets internal precision to 64 bits (down from 80 bits), although this can be overridden. Of course, an intrepid C++ programmer could always implement a more precise floating-point class... in assembler, so I suppose somebody may already have made this available in a library somewhere.

Leave a Comment

Please sign in to add a comment. Not a member? Join today