Using MPI in parallel OpenMP regions

Using MPI in parallel OpenMP regions

Hi all,

I am trying to call MPI from within OpenMP regions, but I cannot have it working properly; my program compiles OK using mpiicc (4.1.1.036) and icc (13.1.2 20130514). I checked that it was linked against thread-safe libraries (libmpi_mt.so appears when I run ldd).

But when I try to run it (2 Ivybridge nodes x 2 MPI tasks x 12 OpenMP threads), I get a SIGSEGV without any backtrace :

/opt/softs/intel/impi/4.1.1.036/intel64/bin/mpirun -np 4 -ppn 2 ./mpitest.x

APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Or with debug level set to 5 :

/opt/softs/intel/impi/4.1.1.036/intel64/bin/mpirun -np 4 -ppn 2 ./mpitest.x
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): shm and dapl data transfer modes
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[2] MPI startup(): shm and dapl data transfer modes
[3] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[3] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       90871    beaufix522  {0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] MPI startup(): 1       90872    beaufix522  {12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[0] MPI startup(): 2       37690    beaufix523  {0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,30,31,32,33,34,35}
[0] MPI startup(): 3       37691    beaufix523  {12,13,14,15,16,17,18,19,20,21,22,23,36,37,38,39,40,41,42,43,44,45,46,47}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,15,15,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 12
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Of course, if I use a single OpenMP thread, everything works fine. I also tried to wrap calls to MPI into critical regions, which works, but is not what I want.

My program is just a small test case to figure out whether I can try this pattern inside a bigger program. For each MPI task, all OpenMP threads are used to send messages to other tasks, and afterwards, all OpenMP threads are used to receive messages from other tasks.

My questions are :

  • does my program conforms to the thread level MPI_THREAD_MULTIPLE (which btw is returned by MPI_Init_thread) ?
  • is IntelMPI supposed to run it correctly ?
  • if not, will it work someday ?
  • what can I do now (extra tests, etc...) ?

Best regards,

Philippe

AttachmentSize
Downloadtext/x-csrc mpitest.c2.22 KB
5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Among the simpler possibilities are that you need to allow more stack, either in the shell or kmp_stacksize or both.

Hello Tim,

I already had OMP_STACKSIZE=20000M and ulimit -s unlimited; I added KMP_STACKSIZE=20000M and got this:

Fatal error in MPI_Bsend: Internal MPI error!, error stack:
MPI_Bsend(195)..............: MPI_Bsend(buf=0x2ae0f1fff3ec, count=2, MPI_INT, dest=2, tag=0, MPI_COMM_WORLD) failed
MPIR_Bsend_isend(226).......:
MPIR_Bsend_check_active(456):
MPIR_Test_impl(63)..........:
MPIR_Request_complete(227)..: INTERNAL ERROR: unexpected value in case statement (value=0)
APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)

Regards,

Philippe

Best Reply

Hi Philippe,

One solution is to use MPI_Send or MPI_Isend instead of MPI_Bsend.  Will either of these work in your program?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

I have never seen a successful setting of KMP_STACKSIZE greater than 40M.  I guess OMP_STACKSIZE would be preferable but would mean the same thing. With 24 threads each set to KMP_STACKSIZE of 20GB you would need 480GB per node just for the thread stacks.  I haven't seen a system where ulimit -s unlimited could give you that much.

Leave a Comment

Please sign in to add a comment. Not a member? Join today