Hybrid MPI/OpenMP : program seems to stall in non blocking communications

Hybrid MPI/OpenMP : program seems to stall in non blocking communications

mguy44的头像

Hello,

I have a MPI Fortran90 CFD application parallelized in X-Y (Cartesian 2D topology) that works well and I decide to parallelize it in Z using OpenMP.
With the MPI 2D topology, each subdomain may have up to 8 neighbours, there's no periodicity. That is :
NW NN NE
WW ME EE
SW SS SE
with the convention NW is North West, SE is South East and so on.
ME is equal to my_MPI_Rank2d, the MPI rank of the current process.
my_OMP_Thd contains the OpenMP rank of each thread in the thread team of each MPI process.

A call to MPI_Init_Thread gives me back MPI_THREAD_MULTIPLE level for thread support in MPI.

MPI communications are non blocking ones (MPI_ISend, MPI_IRecv) and are all put in a SECTIONS ... END SECTIONS construct, but only one per SECTION. So for each MPI process, the communications with the 8 potential neighbours are distributed among the team of threads. A call to MPI_WaitAll is done after them by the MASTER thread. Each thread keeps its informations about the requests it has in a private storage. That is

...
computing stuff

!$OMP BARRIER

nb_requests_local = 0
!$OMP SECTIONS
!$OMP SECTION
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'before IRecv WW'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)
nb_requests_local = nb_requests_local+1
CALL MPI_IRecv ( data, type, array_requests_local(nb_requests_local) )
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after IRecv WW'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

!$OMP SECTION
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'before ISend EE'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)
nb_requests_local = nb_requests_local+1
CALL MPI_ISend (data, type, array_requests_local(nb_requests_local))
write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after ISend EE'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

!$OMP SECTION
...
!$OMP END SECTIONS NOWAIT

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after SECTIONS NOWAIT'
call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

!$CRITICAL
update and filling of a shared array with the different requests hold by each thread in private storage
!$OMP END CRITICAL

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after CRITICAL'

call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

!$OMP BARRIER

!$OMP MASTER
CALL MPI_WaitAll ()
!$OMP END MASTER

write (400+my_MPI_Rank2d*10+my_OMP_Thd,*) 'after WaitAll'

call flush (400+my_MPI_Rank2d*10+my_OMP_Thd)

!$OMP BARRIER

The write / flush calls are put here for a debug purpose and of course will be removed after debugging. But here, they help me to show what is wrong.
I run this code on a SGI Altix machine, using 2 nodes, having each 2 processors with 6 cores.
I run this code using 12 MPI processes, 6 on each node. Each MPI process creates a team of 2 threads.

What is strange is that OpenMP threads seem to be blocked in non blocking MPI calls, in the fort.4xx files, I get outputs like :
==> fort.400 <==
before IRecv WW
After IRecv WW
Before ISend EE
After ISend EE
Before IRecv EE <<<< end of this file

==> fort.401 <==
Before IRecv SW
After IRecv SW
Before ISend NE <<<<<< end of this file

....

And all the 24 threads behave like this, they enter the communication routine, do some MPI calls (with real neighbours, not only MPI_PROC_NULL ones) ; it may not be the same number for each thread. None reaches the writing of the message after the END SECTIONS directive.

The data exchanged between the MPI processes are ghost cells of a 4D array (5,Nx,Ny,Nz), so faces or 'corner's columns' with a depth of at least 3 layers. Send buffers may overlap but not receive ones. Typically, Nx=112, Ny=204, Nz=32

I use ifort (IFORT) 12.1.0 20111011 and intel-mpi 4.0.0.028

1. I check the topology.
2. I check the data scope attribute of the different variables.
3. I try replacing the SECTIONS contruct by a set of SINGLE / END SINGLE NOWAIT ones, but it behaves badly too.
4. I use ITAC and the -mpi_check option but I get nothing interesting
5. I run the code whith 12 cores and only 1 thread per MPI process : It works like the pure MPI code.

But I don't understand why it freezes.

Any help will be appreciated.

If you need further informations, please let me know.

Regards

3 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
mguy44的头像

I try something :
I replace all the calls to non blocking MPI communications by calls to MPI_SendRecv like

!$OMP SECTIONS
!
!$OMP SECTION
isendtag = 1
irecvtag = 1
CALL MPI_SendRecv (data, type, ...)
!
!$OMP SECTION
isendtag = 2
irecvtag = 2
CALL MPI_SendRecv (data, type, ...)

...
!$OMP END SECTIONS

and it works : the application is running and ends correctly after the right number of time iterations.
All results are not correct yet but the code does not hang any more.

Are there some special settings one has to think about when using non blocking communications inside OPenMP parallel region ?

Tim Prince的头像

This may be more likely to get a reply on the HPC/clustering forum where experts in Intel MPI participate.

登陆并发表评论。