How do I use TBB eliminate competition

How do I use TBB eliminate competition

    Hello everyone!     

When i use paralle_for parallel the program,then i use inspector detected there has data race(read and write race).But i cannot find a detailed methods that can used eliminate data race.Who can give me a suggestion.Thank you very much!

22 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hi Frank!

It's hard to help you with that little info. It's better if you post the code you have and maybe a screenshot of the Inspector when the race is detected.

 

Cheers!

Read: http://en.wikipedia.org/wiki/Race_condition

to give you an overview of race conditions.

Jim Dempsey

www.quickthreadprogramming.com

Citation :

Adri C. S. a écrit :

Hi Frank!

It's hard to help you with that little info. It's better if you post the code you have and maybe a screenshot of the Inspector when the race is detected.

 

It is read and write data race.Just like the picture show,the parameter jy appears read and write race.What can i do? If atomic ok? Or use mutexes or lock?

 

Fichiers joints: 

Fichier attachéTaille
Télécharger QQ截图20131227101043.png128.84 Ko

The question we need you to answer is:

a) Do all worker threads use the same [iter->number] range?
or
b) Does each worker have its own (different) [iter->number] range?

In the event of a) then consider using a reducer (each worker accumulates += values independently, which are then reduced once later).
In the event of b) then you only have an issue between one worker thread issuing first [iter->number] against adjacent worker thread issuing last [iter->number]. In this case you only require a mutex (or reducer) at the first and last iter->number's

Jim Dempsey

www.quickthreadprogramming.com

Citation :

jimdempseyatthecove a écrit :

The question we need you to answer is:

a) Do all worker threads use the same [iter->number] range?
or
b) Does each worker have its own (different) [iter->number] range?

In the event of a) then consider using a reducer (each worker accumulates += values independently, which are then reduced once later).
In the event of b) then you only have an issue between one worker thread issuing first [iter->number] against adjacent worker thread issuing last [iter->number]. In this case you only require a mutex (or reducer) at the first and last iter->number's

Jim Dempsey

Can you give me some information about the case a:using a reducer,I am a novice.

You can look at the TBB parallel_reduce examples or you might want to consult a recursion example such as: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\tbb\examples\task\tree_sum

(or wherever you installed the samples on Linux.

or (untested code)

double sum(double A[], size_t n)
{
  if(n < MIN_N_FOR_SUM)
  {
    double ret = 0.0;
    for(int i=0; i<n; ++i)
      ret += A[i];
    return ret;
  }
  else
  {
    size_t n1 = (n / 2) - ((n / 2) % (64 / sizeof(double)));
    size_t n2 = n - n1;
    double sum1, sum2;
    parallel_invoke(
      []() { sum1 = sum(A, n1); },
      []() { sum2 = sum(A+n1, n2); });
    return sum1 + sum2;
  }
}

Jim Dempsey

 

www.quickthreadprogramming.com

Citation :

jimdempseyatthecove a écrit :

You can look at the TBB parallel_reduce examples or you might want to consult a recursion example such as: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\tbb\examples\task\tree_sum

(or wherever you installed the samples on Linux.

or (untested code)

double sum(double A[], size_t n)
{
  if(n < MIN_N_FOR_SUM)
  {
    double ret = 0.0;
    for(int i=0; i<n; ++i)
      ret += A[i];
    return ret;
  }
  else
  {
    size_t n1 = (n / 2) - ((n / 2) % (64 / sizeof(double)));
    size_t n2 = n - n1;
    double sum1, sum2;
    parallel_invoke(
      []() { sum1 = sum(A, n1); },
      []() { sum2 = sum(A+n1, n2); });
    return sum1 + sum2;
  }
}

Jim Dempsey

 

Thank you very much! my dear friend.My code like this,and i want it be parallel_for,then the problem had arisen.

for( iter=cell_n+3; iter!=iter_end-3; iter++ )
{                        
    iter->fp = (iter-1)->fp_old1 - pidt1 * (iter-1)->jy;    
    iter->gm = (iter-1)->gm_old1 - pidt1 * (iter-1)->jz;
    iter->fm = (iter+1)->fm - pidt1 * iter->jy;
    iter->gp = (iter+1)->gp - pidt1 * iter->jz;
    iter->ey = iter->fp + iter->fm;   iter->bz = iter->fp - iter->fm;
    iter->ez = iter->gp + iter->gm;   iter->by = iter->gp - iter->gm;
    iter->ex -= 2.0 * pidt1 * iter->jx;
}

in the parallel_for code,each thread accesses the "fm" and "gp" ,int the code 5th and 6th line,there will be any data race.How do I remove this competition?Thank you!

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

Citation :

Raf Schietekat a écrit :

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

Thank you! Could  you recommend me some  material to learn TBB, some books,some Web ? And also i have a question like 

   for(iter = part_n,iter_end=part_n+(vector_part.end()-vector_part.begin());iter!=iter_end;iter++)
   {
          has_to_change_cell( cell_n, iter );   // put particles on stack
   }
inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number-1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number-1].npart++;
		part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number+1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number+1].npart++;
		part->number++;
      }
}

Then between the threads ,

		"cell[part->number].np[part->species]"will have data race ,what can i do?Thank you very much!
 

Citation :

Raf Schietekat a écrit :

(Note that Jim's code sample emphasises clarity, not performance.)

At first sight fm and gp are the target of a "scan" operation over jy and jz from right to left, so you would have to move them into a separate loop running in that direction. Use tbb::parallel_scan() for large intervals (one forward for fp and gm, one backward for fm and gp). You could parallel_invoke() those scans and then use tbb::parallel_for to compute e{x,y,z} and b{y,z} (clearest), or run one of the scans first and then everything else in the other scan (tbb::parallel_for() stuff integrated in the final pass of the second scan, which might make better use of the cache).

Thank you! Could  you recommend me some  material to learn TBB, some books,some Web ? And also i have a question like 

   for(iter = part_n,iter_end=part_n+(vector_part.end()-vector_part.begin());iter!=iter_end;iter++)
   {
          has_to_change_cell( cell_n, iter );   // put particles on stack
   }
inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number-1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number-1].npart++;
		part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
	  {
		cell[part->number].np[part->species]--;
		cell[part->number+1].np[part->species]++;
		cell[part->number].npart--;
		cell[part->number+1].npart++;
		part->number++;
      }
}

Then between the threads ,

		"cell[part->number].np[part->species]"will have data race ,what can i do?Thank you very much!
 

At TBB's own website you will find the online "User Guide and Design Patterns", and a picture of the book "Intel Threading Build Blocks" with a link to a large online shop for books and other stuff. The book might not be up to date anymore, but it's still a good read.

Unless the problem has some relevant structure that I missed, you could make np and npart atomic.

Quote:

Raf Schietekat wrote:

At TBB's own website you will find the online "User Guide and Design Patterns", and a picture of the book "Intel Threading Build Blocks" with a link to a large online shop for books and other stuff. The book might not be up to date anymore, but it's still a good read.

Unless the problem has some relevant structure that I missed, you could make np and npart atomic.

Thank you! I tried to use atomic operations before,may be i wrote the wrong statement,could you tell me which the statement I can use?

++ and -- are implemented by tbb::atomic (at least for integers).

While you can solve this with using atomics, the problem is with performance. I would suggest something along the following:

inline void propagate::has_to_change_cell( struct cell *cell, struct particle *part )
{
  if ( part->x < cell[part->number].x ) 
   {
  cell[part->number].np[part->species]--;
  cell[part->number].np_prior[part->species]++;
  cell[part->number].npart--;
  part->number--;
      }
  else if ( part->x >= cell[part->number].x  + dx ) 
   {
  cell[part->number].np[part->species]--;
  cell[part->number].np_following[part->species]++;
  cell[part->number].npart--;
  part->number++;
      }
}

inline int propagate::get_np( struct cell *cell, struct particle *part )
{
 return cell[part->number].np[part->species]
   + cell[part->number-1].np_following[part->species]
   + cell[part->number+1].np_prior[part->species];
}

Jim Dempsey

www.quickthreadprogramming.com

It goes without mentioning that you can run (occasionally) a parallel_for to reconcile (collapse) the np_following and np_prior into np.

Jim Dempsey

www.quickthreadprogramming.com

I like the idea, but I'm afraid it won't work here, because the -1 and +1 are on "part->number", not on "part".

A simple situation that would fail is if all "number" and "species" values are the same for each *part.

In both the original code and my suggested code there is a requirement that all part->number's are unique. Though, all part->species need not be unique.

My code suggestion removes the +1 and -1 on the cell subscript, and adds two arrays of size of the np array. The tally of the adjacent cells will contain the proper tallies for the changes.

These tallies are what would have been added/subtracted from the +1 and -1 entries (which can be summed when fetching the new counts).

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

In both the original code and my suggested code there is a requirement that all part->number's are unique. Though, all part->species need not be unique.

My code suggestion removes the +1 and -1 on the cell subscript, and adds two arrays of size of the np array. The tally of the adjacent cells will contain the proper tallies for the changes.

These tallies are what would have been added/subtracted from the +1 and -1 entries (which can be summed when fetching the new counts).

Jim Dempsey

Thank you very much! There is the situation like Raf Schietekat says,A simple situation that would fail is if all "number" and "species" values are the same for each *part.In a cell number,there are some particles,the particles'number is same to the cell's number.

My assumption is:

You have a parallel_for that iterates on part numbers [0:nParts) and passes the part object pointer, such as pParts[i].

If this is the case, then no two threads will have the same part numbers, and thus, will not have the same indexes of cell.

*** however, you have an issue of the +1 and -1 at the boundaries of the thread subsections.

Note, you can insert into you code a test to see if the index of the current cell is the first or last iteration. In those cases, you can perform an atomic operation (sync_fetch_and_add), otherwise on interior locations simply ++ or -- as you do now

Jim Demspey

www.quickthreadprogramming.com

And my interpretation is that "number" means "cell number", with possibly multiple part belonging to the same cell. The iteration starts at part_n, which seems to be a random-access iterator, not an index.

Quote:

Raf Schietekat wrote:

And my interpretation is that "number" means "cell number", with possibly multiple part belonging to the same cell. The iteration starts at part_n, which seems to be a random-access iterator, not an index.

Code showed:

cell[part->number]....
not
cell[i]...
or
cell[itr]...

Jim Dempsey

 

www.quickthreadprogramming.com

Connectez-vous pour laisser un commentaire.