Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)

Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)

Kunal Rao的头像

Hi, We have a huge HPC application compiled with Intel compiler and uses Intel MPI Library. It works fine when run on single node (with multiple processes) but crashes when run on 2 nodes (with multiple processes) with the following message :

-------------
rank 63 in job 1 blade4_34649 caused collective abort of all ranks
exit status of rank 63: killed by signal 9

---
---------------

I'm not sure if it is Intel MPI related error or an error in the application. Some info related to Intel MPI that we are using and the mpd ring consisting of 2 nodes.

-------------------
[kunal@GPUBlade exp]$ which mpirun
/opt/intel/impi/4.0.1.007/intel64/bin/mpirun

[kunal@GPUBlade exp]$ mpirun --version
Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved.

[kunal@GPUBlade exp]$ mpdtrace -l
GPUBlade_37085 (GPUBlade)
blade4_57372 (192.168.1.102)

-------------------

Any suggestions on how do I go about debugging this error ? Thanks & Regards,
Kunal

4 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项.
Dmitry Kuzmin (Intel)的头像

Hi Kunal,

Having only information about MPI library it's hardly possible to say anything about this issue.
It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.
As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."
Do you see the same issue using less cores? Is your issue absolutely resproducable?
BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.
Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.

Regards!
---Dmitry

Kunal Rao的头像

Thanks Dmitry for your reply. Your suggestions were helpful. I was able to give a run with those extra debugging flags and was able to get some more insight into the problem. The application crashes with the following message inmpi_comm_dup_MPI call in the application : ---------- [0] ERROR: LOCAL:MPI:CALL_FAILED: error [0] ERROR: Invalid communicator. [0] ERROR: Error occurred at: [0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)

[0] ERROR: LOCAL:MPI:CALL_FAILED: error[0] ERROR: Invalid communicator.[0] ERROR: Error occurred at:[0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc) --------- I'll look more into it. Let me know if you have some further suggestions. Thanks & Regards, Kunal

Dmitry Kuzmin (Intel)的头像

Kunal,

Looks like first argument of function MPI_COMM_DUP is incorrect.
As an example: MPI_COMM_DUP(MPI_COMM_WORLD, new_comm, ierr)
The arg should be INTEGER.

Regards!
Dmitry

登陆并发表评论。