Reuse openMP threads

Reuse openMP threads

Hi,I have designed another algorithm which I think it is fast enough, but I am more of an algorithms guy and I have a problem with parallelizing my solution with openMP: it runs a little bit faster then the serial version, but no matter how many threads I use, the real time stays the same (while the user time and CPU consumption grow proportional with the number of threads). We don't get to the big benchmarks (with 40 cores) so my team mate suggested that the problem is because of the time lost with thread creation.Out algorithm is something like:

while (condition) {

	#pragma omp parallel for private(i)

	for (i=0;...) {...}
	#pragma omp parallel for private(i)

	for (i=0;...) {...}
	#pragma omp parallel

	{

		tid = omp_get_thread_num();

		for (i=....; i+=tid) {...}

	}

}

We believe that each time the #pragma directive is encounteres, new threads are created and we lose a lot of time with this.Is this true? We couldn't find online information regarding this?
Considering that our supposition is true, we tried to solve this by using the omp single directive, like this:
#pragma omp parallel

#pragma omp single

{

while (condition) {

	#pragma omp for private(i)

	for (i=0;...) {...}
	#pragma omp for private(i)

	for (i=0;...) {...}
	#pragma omp parallel

	{

		tid = omp_get_thread_num();

		for (i=....; i+=tid) {...}

	}

}

}

But it doesn't seem to work. Can anybody help us use openMP with recreating threads at each iteration or directive encounter?
Thank you!

7 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

first: you don't need to explicitly say that the i in a parallel for is privatesecond: the single comand make the code inside it to be executed by a single thread. Therefore, i think that its use in there is slowing down your code.To reuse the threads you should try to put all the code in one parallel for. this would increase the grain size that each thread uses, and you would be using then to process more.

About the overhead with the creation of new threads look here: https://fs.hlrs.de/projects/par/par_prog_ws/pdf/openmp_performance_1.pdf... in the page titled: Performance with OpenMP: Avoid thread creationIt also talks about reutilizing threads...

Depending on your code, I think that instead of:

#pragma omp for private(i)

for (i=0;...) { work1 }  
#pragma omp for private(i)

for (i=0;...) { work2 }  
You could write:
#pragma omp for private(i)

for (i=0;...) {

      work1

      #pragma omp barrier

      work2

}
== Or even remove the barrier, if you know that work1 and work2 are independent. Good luck.

Asgrilomoto told you, OpenMP basically creates threads when he encouters the first #pragma omp parallel directive.

Note that at the end of a directive, there's an implicit barrier, so in your code, you are currently waiting 3 times that all you threads have finished (if there's 40 cores, you wait 3 times that all 40 cores have finished, if the load balancing is good, it can be meaningless).

I see that you have two loops in your parallel and a loop you want to executed in each thread at the end. I don't know your code, but you may be interested at looking at these options :

#pragma omp for schedule(static) nowait

The "schedule(static)", if you use it in your two #for directives, will ensure that the same thread will get the same i range across loops.

The nowait option removes the implicit synchronization at the end of the #for directive.

Very simple example which works :

[cpp]#pragma omp parallel
{
#pragma omp for schedule(static) nowait
for (i=0; i

Maybe this will help you getting better performances.

I agree with above said, but schedule(static) is not exactly a universal option. It is set by default, but if your iteration workload varies, you should also try schedule(dynamic) and schedule(guided) and see witch one best suits your needs.

Best regards,
Nenad

Thank you all for your replies. We didn't succed in optimizing the thread use yet, but your responses (especially of grilomoto) gave us new ideas to approach the problem.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui