Problems using parallel_scan

Problems using parallel_scan

devbisme's picture

I've been trying to use parallel_scan to create the running minimum over a vector of floating-point numbers. running_min[i] records the smallest number seen in locations 0..i of the vector.

I'm doing the obvious:

  1. The vector is divided into subranges.
  2. The pre_scan version of the () operator computes the minimum value found in each subrange.
  3. The reverse_join method communicates the minimum found in subrange k over to subrange k+1, starting from k=0 (i hope). This tells subrange k+1 what the minimum value is in the entire vector preceding it.
  4. The final_scan version of the () operator computes the running minimum of each subrange using the minimum passed in by the reverse_join as the starting minimum.

This is not working. I run a serial version of the algorithm and then the parallel version, and their answers are different. For a ten-thousand element vector, the blocked_range partitions are [0,625), [625,1250), [1250,2500), [2500, 5000), and [5000, 10000), and the first difference always occurs at location running_min[625], right where the first two subranges meet. It appears the reverse_join is not passing the minimum from the previous subrange on to the next subrange.

I instrumented the code, but it was difficult to pick out the diagnostic messages with two physical threads running in parallel. However, I did notice that the final_scan was actually reporting completion before the pre_scan was, and reverse_join was never executing. In order to get a clearer picture, I created a task_scheduler_init(1) to limit it to a single thread. Then the diagnostics showed that only the final_scan was being called, and the pre_scan and reverse_join were never executed.

I'm using an AMD Athlon 64 X2 and Windows XP.

So I have a lot of problems. Here are some questions:

  1. Has anyone used parallel_scan and had it work correctly? (There is no example in the TBB tutorial, and the TBB reference has an incomplete example.)
  2. Is there a better way to instrument my code so I can see what is happening when multiple threads are operating?
  3. Why is the blocked range partitioning the vector that way? Shouldn't it cut it into more-or-less equal pieces?
  4. When parallel_scan calls the reverse_join() method, does it join subranges in order starting from the first subrange and proceeding to the last?
  5. Does running with a single thread change the behavior of parallel_scan? Shouldn't it still call pre_scan and reverse_join()?

Any help is appreciated.

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
mikedeskevich's picture
devbisme:
Has anyone used parallel_scan and had it work correctly? (There is no example in the TBB tutorial, and the TBB reference has an incomplete example.)



Yes I have, but for a simple running sum, I agree that the documentaion for parallel_scan is lacking, I still don't have a great understanding of how it works. I can't even find good explainations of the generic parallel prefix anywhere, so maybe it's something that parallel programming researchers take for granted and don't like to explain to users very much. Email me at mdeskevich at the domain listed in my profile and I'll send you a working example.

devbisme:
Why is the blocked range partitioning the vector that way? Shouldn't it cut it into more-or-less equal pieces?



I don't think it matters too much with task stealing - did you use the auto_partitioner or did you set a grainsize? But if one of the shorter ranges gets done earlier, then the larger range should be resplit and given to a free task. At leat that's how I understand things.

devbisme:

When parallel_scan calls the reverse_join() method, does it join subranges in order starting from the first subrange and proceeding to the last?




I would hope so. Think about it this way, when you join the subranges, you need to propagate the information from the left hand sub array all the way through to the right hand sub array. So they need to be joined in such a way that minimises propagating the information. Done any other way, they couldnt guarantee only 2x extra work.

devbisme:

Does running with a single thread change the behavior of parallel_scan? Shouldn't it still call pre_scan and reverse_join()?



In my experience, yes - but I don't know how. I have code that when run with 1 thread runs as fast as a serial scan, but if it were doing the parallel version, then it would run at 1/2 speed since it does twice as much work. So TBB does something smart there, that reverts to serial-like behavior when running under 1 thread.

Mike
devbisme's picture

Mike:

Thanks for your reply.

I just tried a simple running sum and it fails in the same way as my running minimum. So it would help if you can email me yours and I'll see if it works on my system. My email is devb@xess.com.

devbisme's picture

I have been able to make the running minimum program work by merging the prescan and final scan versions of the () operator such that it works correctly no matter which scan it is. My original () operator looked like this:

template
void operator() (const blocked_range &r, Tag) {
if(!Tag::is_final_scan()) // prescan
{ // find the minimum within a subrange
for(size_t i=r.begin(); i!=r.end(); i++)
mmin = (v[i] < mmin) ? v[i] : mmin;
}
else // Tag::is_final_scan() == true
{ // calculate the running minimum over the subrange
size_t i = r.begin();
run_min[i] = (v[i] for(++i; i!=r.end(); i++)
run_min[i] = (v[i] }
}

This didn't work.

Then I re-wrote the operator so it didn't depend upon the scan phase. During the prescan, the operator now computes the minimum in the subrange and it also (needlessly) sets the elements of the running minimum. And during the final scan, it sets the elements of the running minimum and (needlessly) computes the minimum in the subrange:

template
void operator() (const blocked_range &r, Tag) {
size_t i = r.begin();
run_min[i] = (v[i] for(++i; i!=r.end(); i++)
run_min[i] = (v[i] mmin = run_min[i-1]; // subrange min is in the last element of the running min
}

This works.

So I can successfully compute the running minimum as long as I don't depend upon knowing the actual scan phase.

Any comments?

Raf Schietekat's picture

Among the obvious things you did, you didn't list initialising mmin to MIN_INT (or whatever the relevant value is) in the constructors (though I assume you did), or the exact form of reverse_join communication (which should functionally be this.mmin=std::min(this.mmin,arg.mmin), not just this.mmin=arg.mmin, which might be the naive encoding of "This tells subrange k+1 what the minimum value is in the entire vector preceding it.", but which should only be done in assign()).

But all of that is immaterial if reverse_join() is never called, of course, and it also seems strange that a pre-scan should finish after the last final scan or that the partitions should be so unevenly distributed (can you reproduce all of this?); it would be counterproductive performance-wise for the implementation to do a pre-scan with only one thread available, so that observation can be disregarded here.

Turning to what code you did provide, in the original version you do not update mmin at the end of the final scan as required for correctness (the rewritten version does, but not "needlessly"), and you read back from the output vector (which is probably bad for performance unless the compiler is smarter than I presume); try this instead (mmin does not need to be maintained during the loop, but since it has to be set at the end it might as well be for clear exposition, although it would be a performance optimisation to use a temporary variable instead with a write-back to mmin at the end, as in the specification's example code):

template
void operator() (const blocked_range &r, Tag) {
for(size_t i=r.begin(); i!=r.end(); i++) {
mmin = (v[i] < mmin) ? v[i] : mmin;
if(Tag::is_final_scan()) {
run_min[i] = mmin;
}
}
}

Then maybe, just for fun, try doing both reductions at once, to see if this improves performance (I would presume there might be competing forces at work here).

To anyone reading this and having write access to the documentation I suggest that the semantics column in the specification be augmented with whatever requirements are now only mentioned in the example code and in the "typical patterns" below, i.e., the post-condition for both operator() versions of updating the aggregation value (whereas the example code's optimisation is merely optional, as well as having a template method as mentioned in "typical patterns"), and perhaps exactly what is meant by "merge" for reverse_join(), or "state" for assign() (only reduction state).

Arch D. Robison (Intel)'s picture

I'm the author of the parallel_scan code and documentation. Thanks for the discussion - it's definitely pointed out where the documentation falls short. I'll revise and send a draft around.


The parallel_scan does indeed avoid doing the full two passes when only a few threads are available. It does this by inspecting stealing behavior. With one thread, there is no stealing, and hence only the "final_scan" pass is run. With two threads,only the second half (approximately) of the data will be subjected to both passes, though this will vary depending upon stealing luck.

Arch D. Robison (Intel)'s picture

parallel_scan can be used for more than just obvious running aggregates. I bring this up to show a complexity of documenting it. For example, it can be used to implement an APL-style "compress" operation. E.g., copy elements of array C to array B only if corresponding elements of array A are non-zero.


int j=0;
for( int i=0; i if( A[i] )
B[j++] = C[i];


The parallel_scan incarnation computes the running sum for C[i] in the pre_scan and final_scan, but instead of storing the sum in an array, uses the sum as the index into B during the final_scan.


I'm wondering if perhaps the best way to state the semantic requirements would be in terms of equivalences. I.e., sequences of code that must produce the same final state. For example, for a range r1 of type R and body b1 of type B, parallel scan requires that the followingsequence (items on same line run in parallel):

    R r2(r1,split()); B b2(b1,split());
    b1(r1,pre_scan_tag()); b2(r2,pre_scan_tag());
    b2.reverse_join(b1);

put b2 in the same state as b1 would have after the following invocation:

    b1(r1,pre_scan_tag);

An alternative might be to state all the ways that the initial state for a body before it does a final scan over a range r can be established by pre_scans and final_scans over the prefix to range r.


Arch D. Robison (Intel)'s picture

I'm still pondering this. The eqivalences approach seems to oblique for practitioners. I'm now leaning towards explaining all the ways parallel_scan mightgather "look ahead" via pre_scan and reverse_join.


- Arch

Arch D. Robison (Intel)'s picture

Attached is arevised explanation of parallel_scan and its requirements. Comments?

Attachments: 

AttachmentSize
Download parallel_scan.pdf70.01 KB
mikedeskevich's picture

Thanks, I understand the parallel_scan much better now. Much improved over the chapter in the O'Reilly book!

Singh Jasdeep's picture

#include <iostream>
#include <stdlib.h>
#include <time.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_scan.h"
#include "tbb/tick_count.h"
#include "tbb/compat/thread"
using namespace std;
using namespace tbb;

template <class T>
class Body
{
T reduced_result;
T* const y;
const T* const x;

public:

Body( T y_[], const T x_[] ) : reduced_result(0), x(x_), y(y_) {}

T get_reduced_result() const {return reduced_result;}

template<typename Tag>
void operator()( const blocked_range<int>& r, Tag )
{
T temp = reduced_result;
//cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
for( int i=r.begin(); i<r.end(); ++i )
{
temp = temp+x[i];
if( Tag::is_final_scan() )
{
y[i] = temp;
//cout<<i<<","<<y[i]<<endl;

}

}
reduced_result = temp;

}

Body( Body& b, split ) : x(b.x), y(b.y), reduced_result(0)
{
cout<< " output of split is is \t " << endl;
}

void reverse_join( Body& a )
{
reduced_result = a.reduced_result + reduced_result;
// cout<< " output of reduced_result now is " << reduced_result << endl;
}

void assign( Body& b )
{
reduced_result = b.reduced_result;
// cout<<"final value assigned"<<endl;
}
};

template<class T>
float DoParallelScan( T y[], const T x[], int n)
{
Body<int> body(y,x);
tick_count t1,t2,t3,t4;
t1=tick_count::now();
parallel_scan( blocked_range<int>(0,n), body , auto_partitioner() );
t2=tick_count::now();
cout<<"Time Taken for parallel scan is \t"<<(t2-t1).seconds()<<endl;
return body.get_reduced_result();
}

template<class T1>
float SerialScan(T1 y[], const T1 x[], int n)
{
tick_count t3,t4;

t3=tick_count::now();
T1 temp = 0;

for( int i=0; i<n; ++i )
{
// cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
temp = temp+x[i];
y[i] = temp;
}
t4=tick_count::now();
cout<<"Time Taken for serial scan is \t"<<(t4-t3).seconds()<<endl;
return temp;

}

int main()
{
task_scheduler_init init1(4);

int y1[1000],x1[1000];

for(int i=0;i<1000;i++)
x1[i]=i+1;

cout<<fixed;

cout<<"\n serial scan output is \t"<<SerialScan(y1,x1,1000)<<endl;

cout<<"\n parallel scan output is \t"<<DoParallelScan(y1,x1,1000)<<endl;

return 0;
}

Please help why parallel_scan is taking more time than if executed serially.

Raf Schietekat's picture

Please do not cross-post (also in Parallel_Scan taking more time than serial).

Login to leave a comment.