Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • dreamthtrApril 29, 2009 12:26 PM PDT   
    New to Intel Tbb - Question regarding performance.

    Hi I'm new to Intel TBB and am trying to get more performance out of my app.

    My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

    Any way i have a cicle that used to look like this:

    for(int i = 0; i<CANTIDAD; i++)
    for(int j=0; j<CANTIDAD; j++)
    {
    if(oM[i] != oM[j]){
    oM[i]->calculaPhys(oM[j]);
    veces++;
    }
    }
    }


    I modified that code to use a parallel_for like this:

    class ApplyPhys {
    objectMass *const my_Obj;
    public:
    ApplyPhys( objectMass obj[]): my_Obj(obj){}
    void operator()( const blocked_range<size_t>& r ) const{
    objectMass *obj = my_Obj;
    for( size_t i = r.begin(); i != r.end(); i++){
    funcTest(&obj[i]);
    }
    }

    };

    void funcTest(objectMass * obj){
    for (int i=0; i<CANTIDAD; i++){
    if(obj != oM[i]){
    if(obj->mass != 0){
    veces++;
    obj->calculaPhys(oM[i]);
    }
    if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){
    obj->mass += oM[i]->mass;
    oM[i]->mass =0;
    }
    }
    }
    }



    parallel_for(blocked_range<size_t>(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());



    Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

    What could I do to increase performance?


    dreamthtrApril 29, 2009 12:48 PM PDT
    Rate
     
    Re: New to Intel Tbb

    Actually I can see an improvement of 2X by using:

    task_scheduler_init init (2);

    and

    task_scheduler_init init (1);

    So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

    Quoting - dreamthtr
    Hi I'm new to Intel TBB and am trying to get more performance out of my app.

    My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

    Any way i have a cicle that used to look like this:

    for(int i = 0; i<CANTIDAD; i++)
    for(int j=0; j<CANTIDAD; j++)
    {
    if(oM[i] != oM[j]){
    oM[i]->calculaPhys(oM[j]);
    veces++;
    }
    }
    }


    I modified that code to use a parallel_for like this:

    class ApplyPhys {
    objectMass *const my_Obj;
    public:
    ApplyPhys( objectMass obj[]): my_Obj(obj){}
    void operator()( const blocked_range<size_t>& r ) const{
    objectMass *obj = my_Obj;
    for( size_t i = r.begin(); i != r.end(); i++){
    funcTest(&obj[i]);
    }
    }

    };

    void funcTest(objectMass * obj){
    for (int i=0; i<CANTIDAD; i++){
    if(obj != oM[i]){
    if(obj->mass != 0){
    veces++;
    obj->calculaPhys(oM[i]);
    }
    if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){
    obj->mass += oM[i]->mass;
    oM[i]->mass =0;
    }
    }
    }
    }



    parallel_for(blocked_range<size_t>(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());



    Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

    What could I do to increase performance?



    Robert Reed (Intel)April 29, 2009 2:02 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Quoting - dreamthtr
    So why is it that there is so much overhead when using parallel_for than when using a simple nested for.

    The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other.  It's hard to tell what interactions may be going on inside the physics function:

    obj->calculaPhys(oM[i]); 

    But clearly when the code collapses the sticky masses:

    if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ 
       obj->mass += oM[i]->mass; 
       oM[i]->mass =0; 
    } 

    A thread that may have been limited by the blocked_range to a subset of objects has  the potential for touching any object.  Even if that object is itself being operated upon by another worker thread.  Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducing a lot more memory pressure than might be exerted in the single threaded run.  Perhaps you might invest some time learning about dependence analysis.



    dreamthtrApril 29, 2009 2:19 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

    Quoting - Robert Reed (Intel)

    The trick with parallel programming is isolating activities between the threads so as to keep the workers from interfering with each other.  It's hard to tell what interactions may be going on inside the physics function:

    obj->calculaPhys(oM[i]); 

    But clearly when the code collapses the sticky masses:

    if( abs(obj->x - oM[i]->x) < 0.5 && abs(obj->y - oM[i]->y) < 0.5){ 
    obj->mass += oM[i]->mass;
    oM[i]->mass =0;
    }

    A thread that may have been limited by the blocked_range to a subset of objects has  the potential for touching any object.  Even if that object is itself being operated upon by another worker thread.  Such interference with each other's data means that the caches for the various workers are probably thrashing about, inducing a lot more memory pressure than might be exerted in the single threaded run.  Perhaps you might invest some time learning about dependence analysis.




    Alexey Kukanov (Intel)April 30, 2009 1:26 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Quoting - dreamthtr
    Actually I can see an improvement of 2X by using:

    task_scheduler_init init (2);

    and

    task_scheduler_init init (1);

    So why is it that there is so much overhead when using parallel_for than when using a simple nested for.


    So you see 2x speedup by TBB version with 2 threads over TBB version with 1 thread; And the TBB version with 1thread is much slower than your serial version, right? Then might be data sharing is less of a problem at the moment; rather some compiler optimizations might get missed.

    Robert Reed (Intel)April 30, 2009 2:05 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Quoting - dreamthtr
    Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

    FYI, I was poking around on another task and ran across this bit on the Wikipedia page regarding parallel algorithms:

    Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

    If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation.  Thinking about the problem for a minute, while it is true at any particular instant that you can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed.  Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required.  The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity.  And if you're considering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.



    robert.jay.gouldApril 30, 2009 6:58 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Quoting - dreamthtr
    Yes you are right in the physics function I'm accessing the variables of the object and probably that same object is being calculated in another thread

    FYI, I was poking around on another task and ran across this bit on the Wikipedia page regarding parallel algorithms:

    Most of the available algorithms to compute Pi, on the other hand, can not be easily split up into parallel portions. They require the results from a preceding step to effectively carry on with the next step. Such problems are called inherently serial problems. Iterative numerical methods, such as Newton's method or the three body problem, are also algorithms which are inherently serial.

    If as claimed the three body problem is inherently serial, restricting it to 2 dimensions is probably not sufficient to overcome that limitation.  Thinking about the problem for a minute, while it is true at any particular instant that you can compute the forces on each mass as separate calculations that could be assigned to separate HW threads, at some point the system state needs to be updated and every particle's position needs to be changed.  Time step size needs to be small enough to meet some error bound in the approximations, meaning there's some serious serialization required.  The solution steps would permit the force computations to occur in parallel and the object position updates to occur in parallel, but switching between these two states would require all the threads to synchronize and discard any old state about position and velocity.  And if you're considering real masses with volumes, densities and individual angular momenta (English?), state update gets even more complicated.


    I'm assuming your goal here is accuracy, in which case as Robert Reed points out true parallelization will be hard/impossible(?). However we do this sort of calculation all the time in games (where we can be quite sloppy with our simulations), and the standard solution is to break space up into a grid/hash and calculate a step for each sub-system independently. When calculated in this fashion parallelization is trivial. Of course it's not scientifically accurate (at all), but good enough for entertainment purposes .
     



    Raf SchietekatApril 30, 2009 11:34 PM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    "it's not scientifically accurate"
    Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)


    Robert Reed (Intel)May 1, 2009 12:34 AM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.

    Quoting - Raf Schietekat
    "it's not scientifically accurate"
    Since forces decrease with inverse square distance, I suppose that even more scientifically oriented computations would likely either ignore far-away objects (if there are far-away objects all around to largely cancel each other out), or only communicate aggregated information assumed to remain constant (or at best to follow extrapolated paths) over a number of steps. A scientific simulation also has the luxury to go back in time and use the previous approximation to get a better one. Well, that's how I would do it, anyway. :-)

    You could also identify clusters, work out the forces within each cluster, then treat the clusters as composite objects with centers of mass and compute composite forces.  Inverse-squared attenuation would reduce the forces between the clusters and also their impact as suggested by Raf, so perhaps you could get by with computing the inter-cluster forces less frequently than the intra-cluster forces.



    jimdempseyatthecoveMay 1, 2009 10:20 AM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.


    dreamthtr,

    The following is a potential solution using QuickThread, I will leave it as an exercize for you to rework it into TBB

    /*
    // single threaded code
    for(int i = 0; i(lessThan)CANTIDAD; i++) 
      for(int j=0; j(lessThan)CANTIDAD; j++) 
        { 
           if(oM[i] != oM[j]){ 
             oM[i]->calculaPhys(oM[j]); 
               veces++; 
            } 
       } 
    }  
    */
    
    // parallel task performing independent slices of volume
    // no atomic operations required
    void CalcPhysTask(size_t i, size_t j, size_t nVolumes)
    {
    	size_t	stride = CANTIDAD / nVolumes;
    	size_t	iBegin = stride * i;
    	size_t	iEnd = iBegin + stride;
    	if((i+1) == nVolumns) iEnd = CANTIDAD;
    	size_t	jBegin = stride * j;
    	size_t	jEnd = jBegin + stride;
    	if((j+1) == nVolumes) jEnd = CANTIDAD;
    	for(i = iBegin; i(lessThan)iEnd; i++) 
    	  for(int j=jBegin; j(lessThan)jEnd; j++) 
    		{ 
    		   if(oM[i] != oM[j]){ 
    			 oM[i]->calculaPhys(oM[j]); 
    			   veces++; 
    			} 
    	   } 
    	}  
    }
    
    void CalcPhys()
    {
    	size_t	nThreads = qt_get_num_threads();
    	// nVolumns = Arbitrary tuning parameter
    	// Determine an optimal number of volumes
    	// The processing loops are
    	// for(i=0; i(lessThan)nVolumes; ++i)
    	//		for(j=i; j(lessThan)nVolumes; j++)
    	//
    	// When nVolumes-i becomes less than number of threads
    	// then some cores become under utilized
    	// However, if you create too many volumes
    	// then you incure excessive task enqueuing overhead
    	// This will have to be experimented depending on system
    	size_t	nVolumes = CANTIDAD / nThreads / 128;
    	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 16;
    	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads / 4;
    	if(nVolumes (lessThan) nThreads) nVolumes = CANTIDAD / nThreads;
    	if(nVolumes (lessThan) nThreads)
    	{
    		if(CANTIDAD (lessThanOrEquql) 1)
    			return;
    		nVolumes = nThreads = CANTIDAD;
    	}
    	size_t	i, j;
    	for(i=0; i(lessThan)nVolumes; ++i)
    	{
    		{ // scope of qtControl
    			qtControl	qtControl;
    			for(j=i; j(lessThan)nVolumes; j++)
    			{
    				parallel_task(&qtControl, CalcPhysTask, i, (i+j)%nVolumes, nVolumes);
    			}
    		} // end scope of qtControl (synchronization point)
    	}
    }
    
    

    Replace the (lessThan) and (lessThanOrEqual) with the appropriate glyphs. The paste C++ code is having problems with left angle bracket.

    The general idea is to partition the iteration space into "volumes" (not geometric volumes, rather subscript range volumes). Then permutate on the volumes in a manner that avoids collision with other threads. By using this technique you can avoid a requirement of using atomic additions into objects (i.e. gravitational acceleration/force component of the object).

    Jim Dempsey

    Blog: The Parallel Void
    www.quickthreadprogramming.com

    jimdempseyatthecoveMay 1, 2009 10:23 AM PDT
    Rate
     
    Re: New to Intel Tbb - Question regarding performance.


    Please excuse the typo's (sorry about that)

    Jim

    Blog: The Parallel Void
    www.quickthreadprogramming.com

Forum jump:  

Intel Software Network Forums Statistics

17,025 users have contributed to 48,321 threads and 172,753 posts to date.

In the past 24 hours, we have 16 new thread(s) 57 new posts(s), and 54 new user(s).

In the past 3 days, the most popular thread for everyone has been How to manage rounding by IVF ?? The most posts were made to Most likely, the issue is that The post with the most views is Optimalization of sine function\'s taylor expansion

Please welcome our newest member redfruit83


For more complete information about compiler optimizations, see our Optimization Notice.