Hi Kunal,
Having only information about MPI library it's hardly possible to say anything about this issue.
It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.
As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."
Do you see the same issue using less cores? Is your issue absolutely resproducable?
BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.
Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.
Regards!
---Dmitry




Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)
Hi, We have a huge HPC application compiled with Intel compiler and uses Intel MPI Library. It works fine when run on single node (with multiple processes) but crashes when run on 2 nodes (with multiple processes) with the following message :
-------------
rank 63 in job 1 blade4_34649 caused collective abort of all ranks
exit status of rank 63: killed by signal 9
---
---------------
I'm not sure if it is Intel MPI related error or an error in the application. Some info related to Intel MPI that we are using and the mpd ring consisting of 2 nodes.
-------------------
[kunal@GPUBlade exp]$ which mpirun
/opt/intel/impi/4.0.1.007/intel64/bin/mpirun
[kunal@GPUBlade exp]$ mpirun --version
Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[kunal@GPUBlade exp]$ mpdtrace -l
GPUBlade_37085 (GPUBlade)
blade4_57372 (192.168.1.102)
-------------------
Any suggestions on how do I go about debugging this error ? Thanks & Regards,
Kunal