MPI program behavior on node crash ...

MPI program behavior on node crash ...

Na Na's picture

In the production environment, it happens that some nodes crash once in a while. What's the behavior of Intel's MPI when an MPI program encounters lost contact of some of its processes? Would there be any difference if the node crashed contains rank 0? Is there any option of Intel's MPI to control the behavior of such situation so that the program will be cleaned up in case one of the MPI processesis lost?Thank you very much,Tofu

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
James Tullos (Intel)'s picture

Hi Tofu,

If a node containing a process crashes, the entire job will end. You can use the -cleanup option (or I_MPI_HYDRA_CLEANUP) to create a temporary file that will list the PID of each process, and the mpicleanup utility will use this file to clean the environment if the job does not end correctly. You can also use I_MPI_MPIRUN_CLEANUP if you are using MPD instead of Hydra.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

YY C.'s picture

Hi,

I found a similar situation that the mpirun command does not terminate even some of the processes do not start up properly.  Here I have two nodes p0 and p1 running the program /opt/intel/impi/4.1.0.024/test/test.f90.  Here is what I've done:

cp /opt/intel/impi/4.1.0.024/test/test.f90 /path/to/shared/storage

cd /path/to/shared/storage

mpiifort test.f90

mpirun -hosts p0,p1 -n 32 ./a.out

 Hello world: rank            0  of           32  running on 
 p01
                                                 
 Hello world: rank            1  of           32  running on 
 p01
                                                 
 Hello world: rank            2  of           32  running on 
 p01
                                                 
 Hello world: rank            3  of           32  running on 
 p01
                                                 
 Hello world: rank            4  of           32  running on 
 p01
                                                 
 Hello world: rank            5  of           32  running on 
 p01
                                                 
 Hello world: rank            6  of           32  running on 
 p01
                                                 
 Hello world: rank            7  of           32  running on 
 p01
                                                 
 Hello world: rank            8  of           32  running on 
 p01
                                                 
 Hello world: rank            9  of           32  running on 
 p01
                                                 
 Hello world: rank           10  of           32  running on 
 p01
                                                 
 Hello world: rank           11  of           32  running on 
 p01
                                                 
 Hello world: rank           12  of           32  running on 
 p01
                                                 
 Hello world: rank           13  of           32  running on 
 p01
                                                 
 Hello world: rank           14  of           32  running on 
 p01
                                                 
 Hello world: rank           15  of           32  running on 
 p01
                                                 
 Hello world: rank           16  of           32  running on 
 p02
                                                 
 Hello world: rank           17  of           32  running on 
 p02
                                                 
 Hello world: rank           18  of           32  running on 
 p02
                                                 
 Hello world: rank           19  of           32  running on 
 p02
                                                 
 Hello world: rank           20  of           32  running on 
 p02
                                                 
 Hello world: rank           21  of           32  running on 
 p02
                                                 
 Hello world: rank           22  of           32  running on 
 p02
                                                 
 Hello world: rank           23  of           32  running on 
 p02
                                                 
 Hello world: rank           24  of           32  running on 
 p02
                                                 
 Hello world: rank           25  of           32  running on 
 p02
                                                 
 Hello world: rank           26  of           32  running on 
 p02
                                                 
 Hello world: rank           27  of           32  running on 
 p02
                                                 
 Hello world: rank           28  of           32  running on 
 p02
                                                 
 Hello world: rank           29  of           32  running on 
 p02
                                                 
 Hello world: rank           30  of           32  running on 
 p02
                                                 
 Hello world: rank           31  of           32  running on 
 p02

 

​Now, on p02, I umount the shared storage and then issue the command again:

mpirun -hosts p01,p02 -n 32 ./a.out

[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)
[proxy:0:1@p02] HYDU_create_process (./utils/launch/launch.c:111): execvp error on file ./a.out (No such file or directory)

However, the mpirun process is not terminated and the ps tree shows the following:

100776 pts/14   S      0:00      \_ /bin/sh /opt/intel/impi/4.1.0.024/intel64/bin/mpirun -hosts p01,p02 -ppn 1 -n 2 ./a.out
100781 pts/14   S      0:00      |   \_ mpiexec.hydra -hosts p01 p02 -ppn 1 -n 2 ./a.out
100782 pts/14   S      0:00      |       \_ /usr/bin/ssh -x -q p01 /opt/intel/impi/4.1.0.024/intel64/bin/pmi_proxy --control-port metro:36671 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --proxy-id 0
100783 pts/14   Z      0:00      |       \_ [ssh] <defunct>

 

Just wonder if there is any option that can help in this situation so that mpirun can terminate properly instead of hanging.

 

regards,

C. Bean

 

 

 

Tim Prince's picture

Could you use mpdallexit?

James Tullos (Intel)'s picture

We have corrected some problems related to ranks not exiting correctly.  Please try with Version 4.1 Update 2 and see if this resolves the problem.

Na Na's picture

Hi,

Our situation is slightly different but still encountered similar problem; though we're using the 4.1 version update 2.  We started running HPL benchmark test and one of the node was crashed in the middle.  However, the mpiexec.hydra does not terminate:

 28351 pts/4    Ss     0:00  \_ /bin/bash
 32295 pts/4    S+     0:00      \_ /bin/sh /opt/intel/impi/4.1.2.040/intel64/bin/mpirun -hosts node107,node213 -n 32 ./xhpl_intel64_dynamic
 32300 pts/4    S+     0:00          \_ mpiexec.hydra -hosts node107 node213 -n 32 ./xhpl_intel64_dynamic
 32301 pts/4    Z      0:00              \_ [ssh] <defunct>
 32302 pts/4    S      0:00              \_ /usr/bin/ssh -x -q node213 /opt/intel/impi/4.1.2.040/intel64/bin/pmi_proxy --control-port master:49817 --pmi-connect lazy-cache --pmi-aggregate -s 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1138594473 --proxy-id 1

 

Any clue?

regards,

Tofu

 

YY C.'s picture

The 4.1 update 2 Intel MPI works fine for the initial start up issue; i.e., if a node does not mount the shared storage, the mpirun terminates properly.

We also tried unplugging a compute node in the middle of a run and found mpirun hangs with [ssh] <defunct>.  Any way to cause mpirun terminates in such situation?

regards,

 

C. Bean

Na Na's picture

Any update on this issue?  We tried compiling application using MVAPICH2 and their mpiexec.hydra, the whole application is terminated whenever a compute node is down.

regards,

tofu

Login to leave a comment.