New to Intel Tbb - Question regarding performance.

New to Intel Tbb - Question regarding performance.

Hi I'm new to Intel TBB and am trying to get more performance out of my app.

My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

Any way i have a cicle that used to look like this:

for(int i = 0; i	for(int j=0; j	{
		if(oM[i] != oM[j]){
			oM[i]->calculaPhys(oM[j]);
			veces++;
		}
	}
}

I modified that code to use a parallel_for like this:

class ApplyPhys {
	objectMass *const my_Obj;
public:
	ApplyPhys( objectMass obj[]): my_Obj(obj){}
	void operator()( const blocked_range& r ) const{
		objectMass *obj = my_Obj;
		for( size_t i = r.begin(); i != r.end(); i++){
			funcTest(&obj[i]);
		}
	}

};

void funcTest(objectMass * obj){
	for (int i=0; i		if(obj != oM[i]){
			if(obj->mass != 0){
					veces++;
				obj->calculaPhys(oM[i]);
			}
			if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){
					obj->mass += oM[i]->mass;
					oM[i]->mass =0;
			}
		}
	}
}

parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());

Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

What could I do to increase performance?

25 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Actually I can see an improvement of 2X by using:

task_scheduler_init init (2);

and

task_scheduler_init init (1);

So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

Quoting - dreamthtr
Hi I'm new to Intel TBB and am trying to get more performance out of my app.

My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

Any way i have a cicle that used to look like this:

for(int i = 0; i	for(int j=0; j	{
		if(oM[i] != oM[j]){
			oM[i]->calculaPhys(oM[j]);
			veces++;
		}
	}
}

I modified that code to use a parallel_for like this:

class ApplyPhys {
	objectMass *const my_Obj;
public:
	ApplyPhys( objectMass obj[]): my_Obj(obj){}
	void operator()( const blocked_range& r ) const{
		objectMass *obj = my_Obj;
		for( size_t i = r.begin(); i != r.end(); i++){
			funcTest(&obj[i]);
		}
	}

};

void funcTest(objectMass * obj){
	for (int i=0; i		if(obj != oM[i]){
			if(obj->mass != 0){
					veces++;
				obj->calculaPhys(oM[i]);
			}
			if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){
					obj->mass += oM[i]->mass;
					oM[i]->mass =0;
			}
		}
	}
}

parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());

Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

What could I do to increase performance?

robert-reed (Intel)'s picture

Quoting - dreamthtr
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:

obj->calculaPhys(oM[i]); 

But clearly when the code collapses the sticky masses:

if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ 
   obj->mass += oM[i]->mass; 
   oM[i]->mass =0; 
} 

A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.

Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

Quoting - Robert Reed (Intel)

The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:

obj->calculaPhys(oM[i]); 

But clearly when the code collapses the sticky masses:

if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ 
   obj->mass += oM[i]->mass; 
   oM[i]->mass =0; 
} 

A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object.Even if that object is itselfbeing operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducinga lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.

Alexey Kukanov (Intel)'s picture

Quoting - dreamthtr

Actually I can see an improvement of 2X by using:

task_scheduler_init init (2);

and

task_scheduler_init init (1);

So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

So you see 2x speedup by TBB version with 2 threads over TBB version with 1 thread; And the TBB version with 1thread is much slower than your serial version, right?Then might be data sharing is less of a problem at the moment;rather some compiler optimizations might get missed.

robert-reed (Intel)'s picture

Quoting - dreamthtr
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:

Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.

robert.jay.gould's picture

Quoting - Robert Reed (Intel)

Quoting - dreamthtr
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

FYI, I was poking around onanother task and ran across this bit on the Wikipedia page regarding parallel algorithms:

Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant thatyou can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you'reconsidering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.

I'm assuming your goal here is accuracy, in which case as Robert Reed points out trueparallelizationwill be hard/impossible(?). However we do this sort of calculation all the time in games (where we can be quite sloppy with our simulations), and the standard solution is to break space up into a grid/hash and calculate a step for each sub-systemindependently. Whencalculated in this fashion parallelization is trivial. Of course it's not scientifically accurate (at all), but good enoughfor entertainment purposes .

Raf Schietekat's picture

"it's not scientifically accurate"
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)

robert-reed (Intel)'s picture

Quoting - Raf Schietekat
"it's not scientifically accurate"
Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)

You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.

jimdempseyatthecove's picture

dreamthtr,

The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB

/*
// single threaded code
for(int i = 0; i(lessThan)CANTIDAD; i++) 
  for(int j=0; j(lessThan)CANTIDAD; j++) 
    { 
       if(oM[i] != oM[j]){ 
         oM[i]->calculaPhys(oM[j]); 
           veces++; 
        } 
   } 
}  
*/

// parallel task performing independent slices of volume
// no atomic operations required
void CalcPhysTask(size_t i, size_t j, size_t nVolumes)
{
	size_t	stride = CANTIDAD / nVolumes;
	size_t	iBegin = stride * i;
	size_t	iEnd = iBegin + stride;
	if((i+1) == nVolumns) iEnd = CANTIDAD;
	size_t	jBegin = stride * j;
	size_t	jEnd = jBegin + stride;
	if((j+1) == nVolumes) jEnd = CANTIDAD;
	for(i = iBegin; i(lessThan)iEnd; i++) 
	  for(int j=jBegin; j(lessThan)jEnd; j++) 
		{ 
		   if(oM[i] != oM[j]){ 
			 oM[i]->calculaPhys(oM[j]); 
			   veces++; 
			} 
	   } 
	}  
}

void CalcPhys()
{
	size_t	nThreads = qt_get_num_threads();
	// nVolumns = Arbitrary tuning parameter
	// Determine an optimal number of volumes
	// The processing loops are
	// for(i=0; i(lessThan)nVolumes; ++i)
	//		for(j=i; j(lessThan)nVolumes; j++)
	//
	// When nVolumes-i becomes less than number of threads
	// then some cores become under utilized
	// However, if you create too many volumes
	// then you incure excessive task enqueuing overhead
	// This will have to be experimented depending on system
	size_t	nVolumes = CANTIDAD / nThreads / 128;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 16;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 4;
	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads;
	if(nVolumes (lessThan) nThreads)
	{
		if(CANTIDAD (lessThanOrEquql) 1)
			return;
		nVolumes = nThreads = CANTIDAD;
	}
	size_t	i, j;
	for(i=0; i(lessThan)nVolumes; ++i)
	{
		{ // scope of qtControl
			qtControl	qtControl;
			for(j=i; j(lessThan)nVolumes; j++)
			{
				parallel_task(&qtControl, CalcPhysTask, i, (i+j)%nVolumes, nVolumes);
			}
		} // end scope of qtControl (synchronization point)
	}
}

Replace the (lessThan) and (lessThanOrEqual) with the appropriate glyphs.The paste C++ code is having problems with left angle bracket.

The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

Please excuse the typo's (sorry about that)

Jim

www.quickthreadprogramming.com
robert-reed (Intel)'s picture

Quoting - jimdempseyatthecove
The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB

[snip]


The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).

As I understand this suggestion, the idea here is to band the interactions into regions of the array to provide some separation of the dependencies. However, what's going on inside calculaPhys? As I interpret this "qt" stuff, it appears to me that the "parallel_task" call is launching multiple workers, all working on a particular i-band, each managing the calculations for a separate j-band. The j-bands may all be read-only but multiple threads all working on the same i-band are going to have the same problems with update that the original parallel implementation would have. Some sort of locking or atomic update will still be required to manage the accesses of the workers operating on a particular oM[i].

I also think that the use of QuickThreads to describe a solution on the TBB forum (leaving the TBB solution as an exercise to the reader) where that QT implementation offers no unique advantage (over, say a TBB parallel_invoke) is gratuitous.

robert-reed (Intel)'s picture

Quoting - Robert Reed (Intel)
You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.

Turns out there is some literature on the subject, and several of the methods use ideas similar to the ones we've been kicking around here. For example the Barnes-Hut algorithm uses an octree subdivision process to localize the force interactions. However, most of these methods seem to be focused on reducing the n2 complexity rather than partitioning for parallelization. Barnes-Hut for example subdivides until there's only a single mass in each partition and repartitions for every time step.

jimdempseyatthecove's picture

Robert,

Some cleaned up code and additional comments
(BTW someone should fix this forum past C++ code feature so that it handles less than signs)

// parallel task performing independent slices of volume
// no atomic operations required
void CalcPhysTask(size_t i, size_t j, size_t nVolumns)
{
	size_t	stride = CANTIDAD / nVolumns;
	size_t	iBegin = stride * i;
	size_t	iEnd = iBegin + stride;
	if((i+1) == nVolumns) iEnd = CANTIDAD;
	size_t	jBegin = stride * j;
	size_t	jEnd = jBegin + stride;
	if((j+1) == nVolumns) jEnd = CANTIDAD;
	for(i = iBegin; i (lessThan) iEnd; i++) 
	  for(int j=jBegin; j (lessThan) jEnd; j++) 
		{ 
		   if(oM[i] != oM[j]){ 
			 oM[i]->calculaPhys(oM[j]); 
			   veces++; 
			} 
	   } 
	}  
}

void CalcPhys()
{
	size_t	nThreads = qt_get_num_threads();
	// Decide on optimal number of volumes
	//
	// 4 threads, 4 fixed partitions
	// p1   p2   p3   p4
	// 0,0  0,1	 0,2  0,3
	// 1,1  1,2  1,3  
	// 2,2  2,3  
	// 3,3  
	// utilization 10/16 (62.5% - 10 x task schedules)
	//
	// 4 threads, 8 fixed partitions
	// p1   p2   p3   p4   p5   p6   p7   p8
	// 0,0  0,1	 0,2  0,3  0,4  0,5  0,6  0,7
	// 1,1  1,2  1,3  1,4  1,5  1,6  1,7
	// 2,2  2,3  2,4  2,5  2,6  2,7
	// 3,3  3,4  3,5  3,6  3,7
	// utilization 26/32 (81.25% - 24 x task schedules)
	//
	// 4 threads, 16 fixed partitions
	// ...
	// utilization 58/64 (90.625% - 58 x task schedules)
	// 4 threads, 32 fixed partitions
	// ...
	// utilization 122/128 (95.3125% - 122 x task schedules)
	//
	// An alternate way is to use variable sized partitions
	// 4 threads, 4 variable partitions (3 splits)
	// p1             p2              p3              p4   p5   p6   p7
	// 0,0  split 0:2 A,a  split 0:1  A,a  split 0    A,a  A,b  A,c  A,d
	// 1,1  into  A:D B,b  into  A:D  B,b  into  A:D  B,b  B,c  B,d
	// 2,2  split 1:3 C,c  split 2:3  C,c  split 3    C,c  C,d
	// 3,3  into  a:d D,d  into  a:d  D,d  into  a:d  D,d  
	// utilization 58/64 (90.625% - 22 x task schedules)
	// Note, reduction in proportion of task schedules
	// below technique uses fixed partitions
	// Experiment with this
	size_t	nVolumes = nThreads * 8;
	size_t	i,j;
	for(i=0; i (lessThan) nVolumes; ++i)
	{
		{ // scope of qtControl
			qtControl	qtControl;
			for(j=0; ((j (lessThan) nThreads) && ((i+j) (lessThan) nVolumes)); j++)
			{
				parallel_task(&qtControl, CalcPhysTask, j, i+j, nVolumes);
			}
		} // end scope of qtControl (synchronization point)
	}
}

Additional notes,

calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)

The above techniqueproduces all particle to particle interactions. You can add optimizations shch that when the particles are far away you do something different (to reduce the numbers of interaction calculations).

When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggrigate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

dreamthtr,

Here is an OpenMP version using the diminishing volumes technique (read comments following code sampe) as always test the code before assuming correct. You can rework this into TBB

#pragma omp parallel reduction(+: veces)
{
	int nThreads = omp_get_num_threads();
	int myThreadNum = omp_get_thread_num();
	size_t	skew = 0;
	while(skew (lessThan) CANTIDAD)
	{
		size_t	iBegin;
		size_t	iEnd;
		size_t	jBegin;
		size_t	jEnd;
		size_t range = (CANTIDAD - skew) / nThreads;
		if(range == 0) range = 1;
		iBegin = range * myThreadNum;
		iEnd = iBegin + range;
		if(iEnd > CANTIDAD) iEnd = CANTIDAD;
		jBegin = iBegin + skew;
		jEnd = iEnd + skew;
		if(jEnd > CANTIDAD)
		{
			iEnd -= jEnd - CANTIDAD;
			jEnd = CANTIDAD;
		}
		for(i = iBegin; i (lessThan) iEnd; i++) 
		  for(int j=jBegin; j (lessThan) jEnd; j++) 
			{ 
			   if(oM[i] != oM[j]){ 
				 oM[i]->calculaPhys(oM[j]); 
				   veces++; 
				} 
		   } 
		}
#pragma omp barrier
		if(myThreadNum == 0) skew += range;
#pragma omp barrier
	} // while(skew (lessThan) CANTIDAD)
}

An important note to make is the above will work well when all the threads scheduled for execution are available.
There are several factors that can interfere with this.

a) Your application (process) is sharing the system with other applications.
b) Your application may have additional threads competing for available cores (HW threads)
c) You are on a NUMA system and some of the objects are in near memory while others are in distant memory
d) Thermal throttling of CPU (or parts thereof) may vary the runtime of some of the threads

When any interference occurs, the blockage time at the barriers will increase.

This can be mittigated by adding code to increase the partitioning from nThreads to some number larger (e.g. nThreads * 128). This will be a trade-off between increasing the number of iterations against reduction in wasted time at the barriers.

Jim Dempsey

www.quickthreadprogramming.com
robert-reed (Intel)'s picture

Quoting - jimdempseyatthecove
Some cleaned up code and additional comments

[snip]

calcPhys is (should be) an interaction between two particles (could be coulomb or gravitational)

The above techniqueproduces all particle to particle interactions. You can add optimizations such that when the particles are far away you do something different (to reduce the numbers of interaction calculations).

When the particles are sparsely distributed, such as gravitational bodies. You can collect nearby objects into barycenters and then use the effective aggregate mass at the barycenter. e.g. using Jupiter + its moons barycenter may be sufficient for calculating its gravitational effects on an Earth orbiting satellite. For evenly distributed particles (charged particles in gas) then a different technique can be applied to reduce the numbers of interaction calculations and yet maintain a specific level of accuracy in the calculations.

Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.

Quoting - jimdempseyatthecove

//4threads,8fixedpartitions
//p1p2p3p4p5p6p7p8
//0,00,10,20,30,40,50,60,7
//1,11,21,31,41,51,61,7
//2,22,32,42,52,62,7
//3,33,43,53,63,7
//utilization26/32(81.25%-24xtaskschedules)

In the first pass of this modified algorithm, the individual partitions are truly disjoint and concurrent calculations can proceed without any cross-thread interactions. However, the same is not true of the second pass, ((0,1),(1,2),(2,3),(3,4)). In this case we havethe potential forthreads competing for access to masses on either side of the calculaPhys call. Likewise in passes 3 and 4: given Jim's assumption that calculaPhys updates status on both masses, there's still a lot of potential races even with this organization.

There's also the underlying concern about how these updates get staged so that early updates on one time step avoid affecting later updates on the same time step as position/velocity/acceleration changes propagate through the array. But this is an issue with the original calculaPhys, whose details have not been shared in this thread but whose behavior can be inferred from the context.

Quoting - Robert Reed (Intel)

Close, but no cigar. This latest reformulation of the blocking does reduce the potential races but it does not eliminate them.

There exists in fact a cute way to compute the N^2 interactions that avoids races, but the solution is not completely
trivial. I have written a short article at our blog at Cilk Arts that explains this technique. See
http://www.cilk.com/multicore-blog/bid/9328/A-cute-technique-for-avoidin...

robert-reed (Intel)'s picture

Quoting - matteocilk.com
There exists in fact a cute way to compute the N^2 interactions that avoids races, but the solution is not completely trivial. I have written a short article at our blog at Cilk Arts that explains this technique.

Ah, yes! That's a very interesting approach, with lots of recursive parallelism. There's just enough synchronization to avoid the races. It seems you should be able to do roughly the same thing in TBB using a two-function parallel_invoke in interact() and rect() (which would impose a few more synchronization points so might not scale quiteas well). Thereis abit of overkill in rect(). Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1, thus obviating the need for the nested fors.

You're, right. It's a very cute technique. Thanks.

Quoting - Robert Reed (Intel)

Since it starts with a square and always partitions equally in x and y, all the triangles should be isosceles right triangles and all the rectangles should be square, meaning that the else condition should only occur when di == dj == 1, thus obviating the need for the nested fors.

Actually, rect() starts with a square, but things become rectangular after the first cut if the square has an odd size. So you need
to check both di and dj. (E.g., the base case may end up being 1x2 or 2x1.)

robert-reed (Intel)'s picture

Quoting - matteocilk.com
Actually, rect() starts with a square, but things become rectangular after the first cut if the square has an odd size. So you need to check both di and dj. (E.g., the base case may end up being 1x2 or 2x1.)

Oh, of course! You end up with two (on-diagonal) squares, one smaller than the other and two (off-diagonal) rectangles whose shapes are transposes of each other. Thanks for clarifying that.

Raf Schietekat's picture

I would just use a parallel_for(blocked_range2d) without first folding the domain along the diagonal, speculating that it is cheaper to do the work twice than to suffer all the synchronisation barriers implied by the fancy algorithm shown above. Has that algorithm's performance been explored for different combinations of problem size, number of cores, amount of data/work per pair? Or is that superfluous and can reasoning alone decide? Should I be embarrassed for not seeing the light as well as jealous for not having discovered it myself, or can I breathe easy? :-)

Should my suspicions have any merit, I would expect the main concern to instead be how to recycle data between tasks in the same thread, i.e., perhaps just subdividing down the middle of the longest side in blocked_range2d might not be optimal in this case, maybe it should be a blocked_range on only one dimension anyway, with a carefully chosen strip width (based on cache size, not number of elements and cores, i.e., using a simple_partitioner), and evaluated by sweeping a perpendicular row/column of accumulators along its length.

So, am I mistaken, or not really, or really not? :-)

Quoting - Raf Schietekat
I would just use a parallel_for(blocked_range2d) without first folding the domain along the diagonal, speculating that it is cheaper to do the work twice than to suffer all the synchronisation barriers implied by the fancy algorithm shown above. Has that algorithm's performance been explored for different combinations of problem size, number of cores, amount of data/work per pair? Or is that superfluous and can reasoning alone decide? Should I be embarrassed for not seeing the light as well as jealous for not having discovered it myself, or can I breathe easy? :-)

Should my suspicions have any merit, I would expect the main concern to instead be how to recycle data between tasks in the same thread, i.e., perhaps just subdividing down the middle of the longest side in blocked_range2d might not be optimal in this case, maybe it should be a blocked_range on only one dimension anyway, with a carefully chosen strip width (based on cache size, not number of elements and cores, i.e., using a simple_partitioner), and evaluated by sweeping a perpendicular row/column of accumulators along its length.

So, am I mistaken, or not really, or really not? :-)

Raf,

You are making it more complicated than it is, I think. Of course there is synchronization overhead---the
way you minimize it is by stopping the recursion at a larger base case. For example, I have modified
rect() to recur only when (i > 32 && j > 32) instead of (i > 1 && j > 1) and used this more realistic
force:

#define N 30000

double force[N];
double pos[N][3];
double mass[N];

const double G = 6.673e-11; // gravitational constant

static inline double f(int i, int j)
{
double dx = pos[i][0] - pos[j][0];
double dy = pos[i][1] - pos[j][1];
double dz = pos[i][2] - pos[j][2];
double r2 = dx * dx + dy * dy + dz * dz;
return G * (mass[i] * mass[j]) / r2;
}

static inline void basecase(int i, int j)
{
if (i != j) {
double t = f(i, j);
force[i] += t;
force[j] -= t;
}
}

I then compared interact() against this simple sequential routine:

void sequential(int n)
{
for (int i = 0; i < N; ++i)
for (int j = i; j < N; ++j)
basecase(i, j);
}

For the given value of N = 30000, both sequential() and interact() take about 4.3 seconds on a machine built by an Intel competitor, using gcc/cilk++ -O3. interact() scales linearly up to 16 cores (the maximum I have at hand).

While I have not coded the same example using TBB, I have no doubt that a straightfoward TBB implementation would solve the problem just as well. The TBB overheads are pretty low, and if they turned out to be too high, just make 32 bigger. A larger base case forgives many sins (at the expense of parallelism).

(Your concerns would probably be justified if you were to code this example using pthreads. I agree that life would suck in this case.)

Raf Schietekat's picture

"interact() scales linearly up to 16 cores (the maximum I have at hand)."
That would seem like an acceptable outcome... sorry, I had missed any previous statement regarding scalability, and I am a bit surprised by this.

Dmitry Vyukov's picture
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
robert-reed (Intel)'s picture

About two months ago I suggested that Matteo's algorithm might also work using TBB parallel_invoke, a new parallel construct in the open source release. I've been working on an example and have some great results thatI intend to share in a blog series I've started. My goal is to build a case study using the tools of Intel Parallel Studio in the development and discovery process.

Login to leave a comment.