Do parallel sections use persistent tasks?

Do parallel sections use persistent tasks?

Hi,

I'm concerned about thread creation overhead in some code that I'm writing.

I have a bit of code like this:

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(ROW,COL)
!$OMP DO SCHEDULE STATIC(1)
DO ROW = 1,N
DO COL = 1,M
!set global foo(row,col)
END DO
END DO
!$OMP END DO NOWAIT
!$OMP END PARALLEL

this code is in a Fortran subroutine that is called many times. What I think is happening is (1): that
N threads are created the first time that this PARALLEL section is encountered and that the threads
stick around so that thread creation overhead is minimal.

However, I am worried that I'm wrong and that what is happening is (2): N threads might be created each time the parallel section is encountered which is overhead that I would have to work to eliminate.

Can someone verify whether (1) or (2) is what happens?

Thanks for your time.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

If I'm guessing what you mean, it's not covered in the standard.  Individual implementations have a scheme to keep the thread pool alive for a time interval so as to shorten the time required to re-enter the parallel region, such as the Kuck-Intel KMP_BLOCKTIME.

Actually the code provided has the only one parallel region, whereas you are asking about multiple parallel regions.  So we have to guess that in complete code you do have multiple parallel regions.  The specification has a side note on the threads persistence - the threadprivate data should persist between parallel regions if a number of conditions held (see OpenMP specification for complete list of conditions).  That usually mean the same threads should execute subsequent parallel regions.

If you are asking about Intel(R) OpenMP runtime library, then the behavior is similar to (1) in your description.  I would also note that you pointed out N threads and the code has loop with N iterations.  Actually the number of threads created is not related to the number of loop iterations.  So if you explicitely asked for N threads, then the library will create them, otherwise the default number of threads will be created once and then will be re-used in subsequent parallel regions.

One more note.  Regarless of threads re-use, the overhead of parallel region enter-exit is high.  For example, one parallel region over million of parallel loops should work much faster than million of parallel regions each with single loop inside.  So the general rule of parallelization is to parallelize at the outermost possible level in order to get better performance.

Regards,
Andrey

Leave a Comment

Please sign in to add a comment. Not a member? Join today