parallel_reduce to count ,the run time is not same?

parallel_reduce to count ,the run time is not same?

On my computer,when num_steps = 100000000,no matter the GrainSize = 50000000 or GrainSize = 10000000 or auto ,it takes 5.*s in most cases,a few cases2.*s.
When num_steps = 1000000000,it takes 50s in most cases,a few cases20s.
Environment:
AMD5000
Windows7
VS2008

#include 
#include 
#include "tbb/parallel_reduce.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"

using namespace std;
using namespace tbb;

int Nthreads = 2;
int   GrainSize = 50000000;
long long num_steps =  100000000;

class CMyPi
{
	double *const my_step;
public:
	double sum;
	void operator()(const blocked_range& r);
	CMyPi(CMyPi& x, split);
	void join(const CMyPi& y);
	CMyPi(double *const step);
};

CMyPi::CMyPi(double *const step):my_step(step)
{
	sum = 0.0;
}
CMyPi::CMyPi(CMyPi &x, tbb::split):my_step(x.my_step)
{
	sum = 0.0;
}
void CMyPi::join(const CMyPi &y)
{
	sum += y.sum;
}
//   step = 1.0/(double)num_steps;
//   for (i=0; i < num_steps; i++)
//   {
//      x = (i+0.5)*step;
//      sum = sum + 4.0/(1.0 + x*x);
//    }
void CMyPi::operator ()(const blocked_range& r)
{
	double x = 0.0;
	for(int i = r.begin();i!=r.end();++i)
	{
		x = (i+0.5)* (*my_step);
		sum+=4.0/(1.0+x*x);
	}
}

int main(int argc, char* argv[])
{
	clock_t start, stop;
	double pi;
	double width = 1./(double)num_steps;

	CMyPi step((double *const)&width);
    task_scheduler_init init(task_scheduler_init::deferred);

	start = clock();
	init.initialize(Nthreads);   //TBB
	parallel_reduce(blocked_range(0,num_steps,GrainSize), step);
   // parallel_reduce(blocked_range(0,num_steps), step, auto_partitioner());
	pi = step.sum*width;
	stop = clock();

	cout << "The value of PI is " << pi << endl;
	cout << "The time to calculate PI was " << (double)(stop-start)/CLOCKS_PER_SEC << " secondsn";
	system("pause");
	return 0;
}

//#include 
//static long num_steps=100000;
//double step;
//void main()
//{  int i;	
//   double x, pi, sum = 0.0;
//   step = 1.0/(double)num_steps;
//   for (i=0; i < num_steps; i++)
//   {
//      x = (i+0.5)*step;
//      sum = sum + 4.0/(1.0 + x*x);
//    }
//    pi = step * sum;
//    printf(Pi = %fn,pi);
//}
5 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Quoting - hengyunabc

On my computer,when num_steps = 100000000,no matter the GrainSize = 50000000 or GrainSize = 10000000 or auto ,it takes 5.*s in most cases,a few cases2.*s.

   task_scheduler_init init(task_scheduler_init::deferred);

   start = clock();
   init.initialize(Nthreads);   //TBB
   parallel_reduce(blocked_range(0,num_steps,GrainSize), step);
   // parallel_reduce(blocked_range(0,num_steps), step, auto_partitioner());
   pi = step.sum*width;
   stop = clock();

Is there a reason why you start the timing BEFORE the one-time creation of the TBB thread pool and associated data structures? What happens if you move the start clock after the Nthreads initialize?

Quoting - Robert Reed (Intel)

Is there a reason why you start the timing BEFORE the one-time creation of the TBB thread pool and associated data structures? What happens if you move the start clock after the Nthreads initialize?

I am remiss.
But it seems that I have found the reason.
When thread numberis 8 ,the run time always is 2.*s.
When thread number is 4,the run time is a little longer than 8.
When thread number is 2,longest.
I don not know why.
On XP,when thread number is 2,the run time always is 5.*.
But,on windows 7,the run time sometimes is 2.*s.
It is strange.
I heard that the process scheduling policy on windows 7 is better than XP.
Maybe there are some links between them.

I haven't looked at the problem (got to run now), but I did notice that you don't give TBB the chance to detect and use the actual level of parallelism in your machine, and I didn't see you mentioning it. Are you aware that using too many threads can decrease performance? Have you seen non-optimal behaviour if you don't provide an argument to task_scheduler_init::initialize()?

Quoting - hengyunabc

I am remiss.
But it seems that I have found the reason.
When thread numberis 8 ,the run time always is 2.*s.
When thread number is 4,the run time is a little longer than 8.
When thread number is 2,longest.
I don not know why.
On XP,when thread number is 2,the run time always is 5.*.
But,on windows 7,the run time sometimes is 2.*s.
It is strange.
I heard that the process scheduling policy on windows 7 is better than XP.
Maybe there are some links between them.

I think maybe Raf hit upon something. Do you know how many HW threads your machine(s) has/have? Are you running XP and Windows 7 on the same machine, or the same class of machine?

I'm confused by "when thread number is 2, longest" when the two examples of XP and Windows 7 are not "longest" compared to 4 threads.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi