Parallel_Scan taking more time than serial

Parallel_Scan taking more time than serial

I am executing code with the help of Parallel_Scan and through serially .With Serial its actually faster than using Parallel_Scan.

Code which i am using is:

#include <iostream>
#include <stdlib.h>
#include <time.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_scan.h"
#include "tbb/tick_count.h"
#include "tbb/compat/thread"
using namespace std;
using namespace tbb;

template <class T>
class Body
T reduced_result;
T* const y;
const T* const x;


Body( T y_[], const T x_[] ) : reduced_result(0), x(x_), y(y_) {}

T get_reduced_result() const {return reduced_result;}

template<typename Tag>
void operator()( const blocked_range<int>& r, Tag )
T temp = reduced_result;
//cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
for( int i=r.begin(); i<r.end(); ++i )
temp = temp+x[i];
if( Tag::is_final_scan() )
y[i] = temp;


reduced_result = temp;


Body( Body& b, split ) : x(b.x), y(b.y), reduced_result(0)
cout<< " output of split is is \t " << endl;

void reverse_join( Body& a )
reduced_result = a.reduced_result + reduced_result;
// cout<< " output of reduced_result now is " << reduced_result << endl;

void assign( Body& b )
reduced_result = b.reduced_result;
// cout<<"final value assigned"<<endl;

template<class T>
float DoParallelScan( T y[], const T x[], int n)
Body<int> body(y,x);
tick_count t1,t2,t3,t4;
parallel_scan( blocked_range<int>(0,n), body , auto_partitioner() );
cout<<"Time Taken for parallel scan is \t"<<(t2-t1).seconds()<<endl;
return body.get_reduced_result();

template<class T1>
float SerialScan(T1 y[], const T1 x[], int n)
tick_count t3,t4;

T1 temp = 0;

for( int i=0; i<n; ++i )
// cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
temp = temp+x[i];
y[i] = temp;
cout<<"Time Taken for serial scan is \t"<<(t4-t3).seconds()<<endl;
return temp;


int main()
task_scheduler_init init1(4);

int y1[1000],x1[1000];

for(int i=0;i<1000;i++)


cout<<"\n serial scan output is \t"<<SerialScan(y1,x1,1000)<<endl;

cout<<"\n parallel scan output is \t"<<DoParallelScan(y1,x1,1000)<<endl;

return 0;

Please help to find where i am getting wrong.

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Try different grainsize values (blocked_range parameter, 1 by default, works with auto_partitioner as well as simple_partitioner)?

(Added) Do you see any difference if you don't evaluate blocked_range::end() inside the loop?

I have tried with different grain sizes , with serial it takes only 3 usec and with parallel it is taking a minimum of 703 usec.Please check whether my coding style is correct so that we can find where something is getting wrong.

The main issue here is problem size: try again with a lot more data, but don't get your hopes up too far because memory bandwidth might be a bottleneck.

Thanks for replying.

i have also increased the problem size but the results are same as before , for serial it becomes like 0.6 sec and for parallel its 4.2 sec.i am stucked , i have this type of algorithm and wants to implement parallel_scan in that but its not proving beneficial.If you have any better code where you have checked its performance it will be really very helpful.

Did you remove end() and if() from the loop? Yes, that would mean two separate loops.

i have removed if from the loop it worked and timings reduced now to almost 2 sec .can u pls tell how to remove end because if i am not using end how to calculate inside loop .if i m not wrong r.end u are talking about.Thanks .

for( int i=r.begin(), end = r.end(); i != end; ++i )


when i am replacing the for used with this , at run time it throws exception and terminate.

it gives exception " Assertion h!=small_local_task || p.origin ==this failed on line 617 of file z:\itt\branchtbb41\tbb\1.01src\tbb\scheduler.h  " 

And the for loop which u have given in that i will never be equal to end so it has raised exception .Can u Pls tell some alternative to this.

Sorry, I was on my way out and in a hurry when I wrote that code. You should now be able to see the corrected version.

No , still its the same code which you have written earlier.

Please check again: "for( int i=r.begin(); i<r.end(); ++i )" (your version) -> "for( int i=r.begin(), end=i<r.end(); i!=end; ++i )" (my earlier mistake) -> "for( int i=r.begin(), end=r.end(); i!=end; ++i )" (what it should be). (You can keep "<" instead of "!=" if you prefer.)

OOPS srry i missed i have checked ...its correct and it worked also ,,,,,,,,,,,,,we have finaally acheived a speedup of 2X.Thanks :)

Leave a Comment

Please sign in to add a comment. Not a member? Join today