Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)

Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)

Hi,We have a huge HPC application compiled with Intel compiler and uses Intel MPI Library. It works fine when run on single node (with multiple processes) but crashes when run on 2 nodes (with multiple processes) with the following message :

-------------
rank 63 in job 1 blade4_34649 caused collective abort of all ranks
exit status of rank 63: killed by signal 9

---
---------------

I'm not sure if it is Intel MPI related error or an error in the application. Some info related to Intel MPI that we are using and the mpd ring consisting of 2 nodes.

-------------------
[kunal@GPUBlade exp]$ which mpirun
/opt/intel/impi/4.0.1.007/intel64/bin/mpirun

[kunal@GPUBlade exp]$ mpirun --version
Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved.

[kunal@GPUBlade exp]$ mpdtrace -l
GPUBlade_37085 (GPUBlade)
blade4_57372 (192.168.1.102)

-------------------

Any suggestions on how do I go about debugging this error ?Thanks & Regards,
Kunal

5 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi Kunal,

Having only information about MPI library it's hardly possible to say anything about this issue.
It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.
As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."
Do you see the same issue using less cores? Is your issue absolutely resproducable?
BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.
Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.

Regards!
---Dmitry

Thanks Dmitry for your reply. Your suggestions were helpful. I was able to give a run with those extra debugging flags and was able to get some more insight into the problem.The application crashes with the following message inmpi_comm_dup_MPI call in the application :----------[0] ERROR: LOCAL:MPI:CALL_FAILED: error[0] ERROR: Invalid communicator.[0] ERROR: Error occurred at:[0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)[0] ERROR: LOCAL:MPI:CALL_FAILED: error[0] ERROR: Invalid communicator.[0] ERROR: Error occurred at:[0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)---------I'll look more into it. Let me know if you have some further suggestions.Thanks & Regards,Kunal

Kunal,

Looks like first argument of function MPI_COMM_DUP is incorrect.
As an example: MPI_COMM_DUP(MPI_COMM_WORLD, new_comm, ierr)
The arg should be INTEGER.

Regards!
Dmitry

Hi ,

I have compiled espresso with intel mpi and MKL library but  getting error Failure during collective error when ever it is working fine with openmpi.

is there problem with intel mpi

Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x516f460, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x5300310, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x6b295c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x67183d0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x4f794c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
[0:n125] unexpected disconnect completion event from [22:n122]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 0
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x56bfe30, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
/var/spool/PBS/mom_priv/epilogue: line 30: kill: (5089) - No such process

Kindly help us for resolving this

Thanks
sanjiv

发表评论

登录添加评论。还不是成员?立即加入