Hybrid MPI/OpenMP : program seems to stall in non blocking communications

Hybrid MPI/OpenMP : program seems to stall in non blocking communications


I have a MPI Fortran90 CFD application parallelized in X-Y (Cartesian 2D topology) that works well and I decide to parallelize it in Z using OpenMP.
With the MPI 2D topology, each subdomain may have up to 8 neighbours, there's no periodicity. That is :
with the convention NW is North West, SE is South East and so on.
ME is equal to my_MPI_Rank2d, the MPI rank of the current process.
my_OMP_Thd contains the OpenMP rank of each thread in the thread team of each MPI process.

A call to MPI_Init_Thread gives me back MPI_THREAD_MULTIPLE level for thread support in MPI.

MPI communications are non blocking ones (MPI_ISend, MPI_IRecv) and are all put in a SECTIONS ... END SECTIONS construct, but only one per SECTION. So for each MPI process, the communications with the 8 potential neighbours are distributed among the team of threads. A call to MPI_WaitAll is done after them by the MASTER thread. Each thread keeps its informations about the requests it has in a private storage. That is

computing stuff


nb_requests_local = 0
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'before IRecv WW'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)
nb_requests_local = nb_requests_local+1
CALL MPI_IRecv ( data, type, array_requests_local(nb_requests_local) )
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after IRecv WW'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'before ISend EE'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)
nb_requests_local = nb_requests_local+1
CALL MPI_ISend (data, type, array_requests_local(nb_requests_local))
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after ISend EE'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)


write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after SECTIONS NOWAIT'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

update and filling of a shared array with the different requests hold by each thread in private storage

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after CRITICAL'

call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)


CALL MPI_WaitAll ()

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after WaitAll'

call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)


The write / flush calls are put here for a debug purpose and of course will be removed after debugging. But here, they help me to show what is wrong.
I run this code on a SGI Altix machine, using 2 nodes, having each 2 processors with 6 cores.
I run this code using 12 MPI processes, 6 on each node. Each MPI process creates a team of 2 threads.

What is strange is that OpenMP threads seem to be blocked in non blocking MPI calls, in the fort.4xx files, I get outputs like :
==> fort.400 <==
before IRecv WW
After IRecv WW
Before ISend EE
After ISend EE
Before IRecv EE <<<< end of this file

==> fort.401 <==
Before IRecv SW
After IRecv SW
Before ISend NE <<<<<< end of this file


And all the 24 threads behave like this, they enter the communication routine, do some MPI calls (with real neighbours, not only MPI_PROC_NULL ones) ; it may not be the same number for each thread. None reaches the writing of the message after the END SECTIONS directive.

The data exchanged between the MPI processes are ghost cells of a 4D array (5,Nx,Ny,Nz), so faces or 'corner's columns' with a depth of at least 3 layers. Send buffers may overlap but not receive ones. Typically, Nx=112, Ny=204, Nz=32

I use ifort (IFORT) 12.1.0 20111011 and intel-mpi

1. I check the topology.
2. I check the data scope attribute of the different variables.
3. I try replacing the SECTIONS contruct by a set of SINGLE / END SINGLE NOWAIT ones, but it behaves badly too.
4. I use ITAC and the -mpi_check option but I get nothing interesting
5. I run the code whith 12 cores and only 1 thread per MPI process : It works like the pure MPI code.

But I don't understand why it freezes.

Any help will be appreciated.

If you need further informations, please let me know.


3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I try something :
I replace all the calls to non blocking MPI communications by calls to MPI_SendRecv like

isendtag = 1
irecvtag = 1
CALL MPI_SendRecv (data, type, ...)
isendtag = 2
irecvtag = 2
CALL MPI_SendRecv (data, type, ...)


and it works : the application is running and ends correctly after the right number of time iterations.
All results are not correct yet but the code does not hang any more.

Are there some special settings one has to think about when using non blocking communications inside OPenMP parallel region ?

This may be more likely to get a reply on the HPC/clustering forum where experts in Intel MPI participate.

Leave a Comment

Please sign in to add a comment. Not a member? Join today