New to Intel Tbb - Question regarding performance.
Hi I'm new to Intel TBB and am trying to get more performance out of my app.
My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.
Any way i have a cicle that used to look like this:
for(int i = 0; i<CANTIDAD; i++) for(int j=0; j<CANTIDAD; j++) { if(oM[i] != oM[j]){ oM[i]->calculaPhys(oM[j]); veces++; } } }
I modified that code to use a parallel_for like this:
class ApplyPhys { objectMass *const my_Obj; public: ApplyPhys( objectMass obj[]): my_Obj(obj){} void operator()( const blocked_range<size_t>& r ) const{ objectMass *obj = my_Obj; for( size_t i = r.begin(); i != r.end(); i++){ funcTest(&obj[i]); } } };
void funcTest(objectMass * obj){ for (int i=0; i<CANTIDAD; i++){ if(obj != oM[i]){ if(obj->mass != 0){ veces++; obj->calculaPhys(oM[i]); } if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ obj->mass += oM[i]->mass; oM[i]->mass =0; } } } }
parallel_for(blocked_range<size_t>(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());
Anyway I'm getting worse performance on using parallel_for that when using a serial approach.
What could I do to increase performance?
| |
Re: New to Intel Tbb
Actually I can see an improvement of 2X by using: task_scheduler_init init (2); and task_scheduler_init init (1); So why is it that there is so much overhead when using parallel_for than when using a simple nested for. Quoting - dreamthtr
Hi I'm new to Intel TBB and am trying to get more performance out of my app.
My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.
Any way i have a cicle that used to look like this:
for(int i = 0; i<CANTIDAD; i++) for(int j=0; j<CANTIDAD; j++) { if(oM[i] != oM[j]){ oM[i]->calculaPhys(oM[j]); veces++; } } }
I modified that code to use a parallel_for like this:
class ApplyPhys { objectMass *const my_Obj; public: ApplyPhys( objectMass obj[]): my_Obj(obj){} void operator()( const blocked_range<size_t>& r ) const{ objectMass *obj = my_Obj; for( size_t i = r.begin(); i != r.end(); i++){ funcTest(&obj[i]); } } };
void funcTest(objectMass * obj){ for (int i=0; i<CANTIDAD; i++){ if(obj != oM[i]){ if(obj->mass != 0){ veces++; obj->calculaPhys(oM[i]); } if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ obj->mass += oM[i]->mass; oM[i]->mass =0; } } } }
parallel_for(blocked_range<size_t>(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());
Anyway I'm getting worse performance on using parallel_for that when using a serial approach.
What could I do to increase performance?
| |
Re: New to Intel Tbb - Question regarding performance.
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.
The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:
obj->calculaPhys(oM[i]);
But clearly when the code collapses the sticky masses:
if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){
obj->mass += oM[i]->mass;
oM[i]->mass =0;
}
A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object. Even if that object is itself being operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducing a lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.
| |
Re: New to Intel Tbb - Question regarding performance.
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread Quoting - Robert Reed (Intel)
The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other. It's hard to tell what interactions may be going on inside the physics function:
obj->calculaPhys(oM[i]);
But clearly when the code collapses the sticky masses:
if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ obj->mass += oM[i]->mass; oM[i]->mass =0; }
A thread that may have been limited by the blocked_range to a subset of objects has the potential for touching any object. Even if that object is itself being operated upon by another worker thread. Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducing a lot more memory pressure than might be exerted in the single threaded run. Perhaps you might invest some time learning about dependence analysis.
| |
Re: New to Intel Tbb - Question regarding performance.
Actually I can see an improvement of 2X by using:
task_scheduler_init init (2);
and
task_scheduler_init init (1);
So why is it that there is so much overhead when using parallel_for than when using a simple nested for.
So you see 2x speedup by TBB version with 2 threads over TBB version with 1 thread; And the TBB version with 1thread is much slower than your serial version, right? Then might be data sharing is less of a problem at the moment; rather some compiler optimizations might get missed.
| |
Re: New to Intel Tbb - Question regarding performance.
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread
FYI, I was poking around on another task and ran across this bit on the Wikipedia page regarding parallel algorithms:
Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.
If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant that you can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you're considering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.
| |
Re: New to Intel Tbb - Question regarding performance.
Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread
FYI, I was poking around on another task and ran across this bit on the Wikipedia page regarding parallel algorithms:
Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.
If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation. Thinking about the problem for a minute, while it is true at any particular instant that you can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed. Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required. The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity. And if you're considering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.
I'm assuming your goal here is accuracy, in which case as Robert Reed points out true parallelization will be hard/impossible(?). However we do this sort of calculation all the time in games (where we can be quite sloppy with our simulations), and the standard solution is to break space up into a grid/hash and calculate a step for each sub-system independently. When calculated in this fashion parallelization is trivial. Of course it's not scientifically accurate (at all), but good enough for entertainment purposes .
| |
Re: New to Intel Tbb - Question regarding performance.
"it's not scientifically accurate" Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)
| |
Re: New to Intel Tbb - Question regarding performance.
"it's not scientifically accurate" Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)
You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces. Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.
| |
Re: New to Intel Tbb - Question regarding performance.
dreamthtr,
The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB
/*
// single threaded code
for(int i = 0; i(lessThan)CANTIDAD; i++)
for(int j=0; j(lessThan)CANTIDAD; j++)
{
if(oM[i] != oM[j]){
oM[i]->calculaPhys(oM[j]);
veces++;
}
}
}
*/
// parallel task performing independent slices of volume
// no atomic operations required
void CalcPhysTask(size_t i, size_t j, size_t nVolumes)
{
size_t stride = CANTIDAD / nVolumes;
size_t iBegin = stride * i;
size_t iEnd = iBegin + stride;
if((i+1) == nVolumns) iEnd = CANTIDAD;
size_t jBegin = stride * j;
size_t jEnd = jBegin + stride;
if((j+1) == nVolumes) jEnd = CANTIDAD;
for(i = iBegin; i(lessThan)iEnd; i++)
for(int j=jBegin; j(lessThan)jEnd; j++)
{
if(oM[i] != oM[j]){
oM[i]->calculaPhys(oM[j]);
veces++;
}
}
}
}
void CalcPhys()
{
size_t nThreads = qt_get_num_threads();
// nVolumns = Arbitrary tuning parameter
// Determine an optimal number of volumes
// The processing loops are
// for(i=0; i(lessThan)nVolumes; ++i)
// for(j=i; j(lessThan)nVolumes; j++)
//
// When nVolumes-i becomes less than number of threads
// then some cores become under utilized
// However, if you create too many volumes
// then you incure excessive task enqueuing overhead
// This will have to be experimented depending on system
size_t nVolumes = CANTIDAD / nThreads / 128;
if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 16;
if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 4;
if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads;
if(nVolumes (lessThan) nThreads)
{
if(CANTIDAD (lessThanOrEquql) 1)
return;
nVolumes = nThreads = CANTIDAD;
}
size_t i, j;
for(i=0; i(lessThan)nVolumes; ++i)
{
{ // scope of qtControl
qtControl qtControl;
for(j=i; j(lessThan)nVolumes; j++)
{
parallel_task(&qtControl, CalcPhysTask, i, (i+j)%nVolumes, nVolumes);
}
} // end scope of qtControl (synchronization point)
}
}
Replace the (lessThan) and (lessThanOrEqual) with the appropriate glyphs. The paste C++ code is having problems with left angle bracket.
The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).
Jim Dempsey
Blog: The Parallel Void
www.quickthreadprogramming.com | |
Re: New to Intel Tbb - Question regarding performance.
Please excuse the typo's (sorry about that)
Jim
Blog: The Parallel Void
www.quickthreadprogramming.com | | |