Failure to launch (ssh becomes zombie)

Failure to launch (ssh becomes zombie)

I have a problem where in about 1-2% of mpiruns (0.1-0.2% of ssh processes launched my mpiexec.hydra) one of the ssh processes fails to launch and becomes a zombie. As a consequence the overall process will hang forever.

With setenv I_MPI_DEBUG 1 and -verbose added to the mpirun command I get some information (see Bug.txt attached). The node that in this case failed to start is qnode0708, and if you wade through the file you will see no "Start PMI_proxy 5".

At this moment I do not know if this is an impi issue (version 4.1 is being used), a ssh race condition (this appears to be possible), something with the large cluster I am using or what. Two specific questions:

a) Has anyone seen anything like this?

b) Is there a way to launch with "ssh -v", which might be informative. I cannot find anything about how to do this.

N.B., 99.99% certain that this is nothing to do with the code being run, compilation or anything else. In fact the failure occurs equally for three different mpi executables which are very different.

AttachmentSize
Downloadtext/plain bug.txt162.35 KB
6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You can launch with ssh -v by putting this into a script and setting I_MPI_HYDRA_BOOTSTRAP_EXEC to point to this script.

Try using I_MPI_DEBUG=5 instead of I_MPI_DEBUG=1.  This will provide additional information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Thanks for the comment. For reference, in case others run across the same problem please see the thread starting at http://lists.mindrot.org/pipermail/openssh-unix-dev/2013-July/031518.html and http://lists.mindrot.org/pipermail/openssh-unix-dev/2013-July/031527.html.

I was not able to trace the fault beyond localizing it to ssh/sshd on that system and I came to the conclusion that ssh is just not robust enough for some reason on the Quest computer at Northwestern. Since I don't have any rights to see any of the log files on that cluster, I gave up and replaced ssh by openmpi/rmpirun as the bootstrap. While this is an ugly hack, it has proved to be 100% reliable.

N.B., for the future I suggest the your hydra should check to see if the ssh process it has launched has become a zombie.

As an addendum, I now have a way to reproduce this issue and it is a "bug" since the end result is highly undesirable (mpi tasks running forever is the consequence). Curing the bug may not be trivial, and there are probably many ways to reproduce it which are not particularly user friendly.

To reproduce, arrange so that an impi task is being run on more than one node where a secondary node has a cooling problem and as a consequence oom-killer gets invoked to terminate the mpi task. For whatever reason (beyond my pay grade) this leaves the ssh connection as a zombie. The other nodes/cores do not know and will continue to run forever, probably sending requests to send/receive data which go into a black hole.

I have a similar issue.  I can see the 'mpiexec.hydra process running after my job completes.  I am using 5.1.2.150

.  If I use "mpd" I do not see this happen.

This only happens when I fork our script in the background.  Our script then sends the MPIRUN off with:

$MPI_ROOT/bin64/mpirun  --rsh=rsh -f $hostfile $DEBUG -configfile $cmdfile

Command File:

-genv LD_LIBRARY_PATH /scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/lib:/scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8:/lib64:/scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/lib:/scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/nCode/bin -genv npath /scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/nCode

-n 1 -host sudev604.na.mscsoftware.com /scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/nastran -c0 -d0 jid=./bc2.dat version=2017.0 sdir=/scratch mode=i8 out=./bc2.t0 rc=./bc2.T15787_28.rc intelmpi=yes dmp=2 nhosts=1

-n 1 -host sudev604.na.mscsoftware.com /scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/nastran -c1 -d0 jid=./bc2.dat version=2017.0 sdir=/scratch mode=i8 out=./bc2.t1 rc=./bc2.T15787_28.rc intelmpi=yes dmp=2 nhosts=1

If I add a "-v" I see

[mpiexec@sudev604] [pgid: 0] got aggregated PMI command (part of it): cmd=put kvsname=kvs_12634_0 key=P1-businesscard-0 value=fabrics_list#shm$

[mpiexec@sudev604] reply: cmd=put_result rc=0 msg=success

[proxy:0:0@sudev604] got pmi command (from 10): barrier_in

[proxy:0:0@sudev604] got pmi command (from 12): barrier_in

[proxy:0:0@sudev604] forwarding command (cmd=barrier_in) upstream

[mpiexec@sudev604] [pgid: 0] got PMI command: cmd=barrier_in

[mpiexec@sudev604] PMI response to fd 8 pid 12: cmd=barrier_out

[proxy:0:0@sudev604] PMI response: cmd=barrier_out

[proxy:0:0@sudev604] PMI response: cmd=barrier_out

[proxy:0:0@sudev604] got pmi command (from 10): finalize

[proxy:0:0@sudev604] PMI response: cmd=finalize_ack

[proxy:0:0@sudev604] got pmi command (from 12): finalize

[proxy:0:0@sudev604] PMI response: cmd=finalize_ack

[1]  + Suspended (tty input)         script

 

FWIW – Here is output from I_MPI_DEBUG=5

sudev604 <147> grep MPI bc2.*log

bc2.t0.log:MPI Shared Object "/scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/intel/lib64/libmpi.so" has been loaded.

bc2.t0.log:[0] MPI startup(): Multi-threaded optimized library

bc2.t0.log:[0] MPI startup(): shm data transfer mode

bc2.t0.log:[0] MPI startup(): Rank    Pid      Node name  Pin cpu

bc2.t0.log:[0] MPI startup(): 0       16770    sudev604   {0,1,2,3}

bc2.t0.log:[0] MPI startup(): 1       16772    sudev604   {4,5,6,7}

bc2.t0.log:[0] MPI startup(): I_MPI_DEBUG=5

bc2.t0.log:[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10

bc2.t0.log:[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1

bc2.t0.log:[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 4

bc2.t1.log:MPI Shared Object "/scratch/jjg/xxx_bugfix_i8/msc/nastran/msc20170/linux64_rhe71i8/intel/lib64/libmpi.so" has been loaded.

bc2.t1.log:[1] MPI startup(): shm data transfer mode

 

Do you have any suggestions how I can avoid the hanging MPIEXE.HYDRA?

 

Regards,

Joe

Does this occur in the latest version (Intel® MPI Library 2017)?

Leave a Comment

Please sign in to add a comment. Not a member? Join today