OpenMP parallelization of a Subroutine call

OpenMP parallelization of a Subroutine call

Hello, and congratulations for your community.

My problem is as follows...

I have a model that calculates several parameters in a parent grid and in several (5 nested) grids. The program runs in timesteps, which are controlled with a DO-WHILE loop. Within each do-while loop itteration the program calls a subroutine and decides if it is time to calculate the parent grid or the nested ones. So the general pseudo-code is as follows

DO WHILE (SIMULATION HAS NOT ENDED)

IF (iG=1, meaning that it is time for the parent grid timestep) THEN

CALL TIMESTEP(iG)

ELSE

DO i=2,5

IG=i

call TIMESTEP(iG)

END DO 

END IF

call NESTING (THIS FUNCTION CALCULATES IF IT IS TIME FOR THE PARENT OR THE NESTS, and as a results gives iG=1(parrent) or iG=2 (nests))

END DO

So, as the code is it first runs one large timestep for the parent grid and then it runs some timesteps for the nested grids, then again for the parent and so on...The goal is to run the nest timesteps in parallel, meaning that when it is time to run the 5 nests, each thread can run each nest (from 2 to 6). My attempts to parallelize the inner DO loop, either cause segmantation faults or cause the program to run in a wierd fashion (for example one nest proceeds further than the other ones). The problem is that the TIMESTEP subroutine is pretty complex and that all the handling is beeing done within an inner DO-While loop. However, each "nest" timestep calculation, is completely irrelevant than the other nests.

11 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Please show more code.  Your example, as presented, doesn't even show a single OpenMP directive!

A common (the most common?) error is to overlook a dependency between supposedly parallel executions - i.e. you think that the nested iterations are completely independent, but they technically (versus conceptually) are not.  Correctly designating the right data sharing attribute (private versus shared, etc) is critical.

Thank you for the quick response IanH

One of the first approaches was

DO WHILE (SIMULATION HAS NOT ENDED)

IF (iG=1, meaning that it is time for the parent grid timestep) THEN

CALL TIMESTEP(iG)

ELSE

!$omp parallel default(private)

!$omp do ordered

DO i=2,5

IG=i

call TIMESTEP(iG)

END DO 

!$omp end do

!$omp end parallel

END IF

call NESTING (THIS FUNCTION CALCULATES IF IT IS TIME FOR THE PARENT OR THE NESTS, and as a results gives iG=1(parrent) or iG=2 (nests))

END DO

In fact, timestep calls a series of subroutines that must be executed for each nest and the parent

If "call timestep (IG)" can operate independently for your 4 grids 2:5 then OpenMP should work for them. IG must be private.

You would need to consider what variables in TIMESTEP need to be shared vs private. By default local variables in TIMESTEP allocated to the stack should be private, but any variables in COMMON or MODULES would be shared, unless you manage them differently. (any status variables, DO indexes or counters that are not dynamically allocated could cause problems.  Managing different indexes for different values of IG may be required.

John

Portrait de jimdempseyatthecove
Best Reply

In the above pseudo code example you have no code in the parallel region either/both preceding the ordered DO, nor following the ordered DO, thus making the parallel region effectively thread-by-thread sequential.

A better technique to use would be:

a) Break TIMESTEP into two subroutines: TIMESTEPcompute, and TIMESTEPoutput
b) Assure that TIMESTEPcompute is thread-safe
c) Use OMP DO without ORDERD to run TIMESTEPcompute (unless there are dependencies of prior section being run)
d) Use OMP DO with ORDERD to run TIMESTEPoutput. Note, if you have no parallel code following the TIMESTEPoutput then move the output to after the parallel region.

Jim Dempsey

www.quickthreadprogramming.com

Thank you for the responses jimdempseyatthecove and John

I must admit that it was not a good idea to use the Ordered clause, meaning that I don't want the threads to run as if it was serial. Maybe I should go for something like static with chunk size=1. That beeing said, I am not sure I quite understand what you propose Jim.

What I do understand though is that it all comes down to private vs shared in the end. If I set default (private) (which is clearly not the case), i get a segmentation fault in the first itteration. If I set default (shared) and keep ig and i as private, i get iregular execution of threads. For example the parrent grid does not proceed the way it should, neither do the nests (some proceed more than the others). The behavior I am hoping for would produce an output similar to below (which is the output of the serial version). The first column is for the parent grid and the others are for the nests respectively.

Model Time: 320.00 320.00 320.00 320.00 320.00 320.00 
Model Time: 340.00 320.00 320.00 320.00 320.00 320.00
Model Time: 340.00 324.00 324.00 324.00 324.00 324.00 
Model Time: 340.00 328.00 328.00 328.00 328.00 328.00 
Model Time: 340.00 332.00 332.00 332.00 332.00 332.00 
Model Time: 340.00 336.00 336.00 336.00 336.00 336.00 
Model Time: 340.00 340.00 340.00 340.00 340.00 340.00
Model Time: 360.00 340.00 340.00 340.00 340.00 340.00 
Model Time: 360.00 344.00 344.00 344.00 344.00 344.00 
Model Time: 360.00 348.00 348.00 348.00 348.00 348.00 
Model Time: 360.00 352.00 352.00 352.00 352.00 352.00 
Model Time: 360.00 356.00 356.00 356.00 356.00 356.00 
Model Time: 360.00 360.00 360.00 360.00 360.00 360.00 
Model Time: 380.00 360.00 360.00 360.00 360.00 360.00 

Portrait de jimdempseyatthecove

Apostols,

You should not go about twiddling parameters until something seems to work. You should understand the programming requirements with respect to parallel code before you choose parameters. Also, it is not always a case of private and shared, but you may also have sequencing of operations that matter too. For example, if your program were manufacturing automobile parts in parallel, you might not want to place the first door done (red), on the first chassie done (blue), rather you might want to build in parallel, but order in a defined sequence. Your program may have a similar requirement.

I think you need to chart out the flow with respect to time and dependencies on completed work. Decide what can be done in parallel, and what needs to be sequenced.

Jim Dempsey

www.quickthreadprogramming.com

Thank you for your response Jim.

I guess I need to examine the code better, so I can design/debug. I will use a profiler/debugger so I can proceed and see the adjustments I should do.

Best Regards,

Apostolos A.

Apostolos,

The effectiveness of using OpenMP depends on how much interaction there is between the 5 secondary grids in the timestep calculation.
Potentially there is
1) an initialisation phase, common to all grids (as a single thread)
2) then hopefully some calculation for each grid, as multiple threads that can operate independently,
3) then a finalisation phase where the updated calculated state is appplied.
The hope is that phase 2 of this calculation is the dominant phase and so OpemMP can provide a substantial benefit.

John

Portrait de jimdempseyatthecove

Then after even partial success with the 1, 2, 3 that John states (meaning step 2 has some benifit from parallelization), examine your code to see if it can take advantage of (parallel) pipelining. Current code: (step.majorLoopIteration)

1.1
2.1
3.1
1.2
2.2
3.2
...

Pipelined becomes: (lane 1 | lane 2 | lane 3)

1.1 | .... | ..
2.1 | 1.2 | ..
3.1 | 2.2 | 1.3
1.4 | 3.2 | 2.3
2.4 | 1.5 | 3.3
3.4 | 2.5 | 1.6

Essentially, when the input phase is independent of the processing and output phases, you start your next input phase immediately after starting the processing phase. This does require seperate work spaces for each concurrent overlapping pipeline.

In OpenMP this is a little harder to do, at least for your first time. After you finish John's suggestion, you can come back and ask how this can be pipelined in OpenMP.

Jim Dempsey

www.quickthreadprogramming.com

Greetings again John and Jim

Actually, there is no need for interaction between the secondary grids, at least not for now. Maybe sometime later, we could make this sth like domain decomposition, but at least now there is no need for sth like that. It's not like the grids have to interact with each other updating the halo regions. One massive parrent grid timestep-> 5 smaller timesteps for the 5 nested grids, then again one massive grid timestep and so on... The only interaction is between the parent and the nest grids and not between the nest grids

Connectez-vous pour laisser un commentaire.