MPI: Prevent mpirun from terminating on SIGTERM

MPI: Prevent mpirun from terminating on SIGTERM

Hi,

I'm using a IntelMPI with PBS.
When I send a SIGTERM signal using qdel to my job mpirun exits immediatly and my program that is called by mpirun has no time to finish its cleanup work.
(I'm using
if [ x$PBS_ENVIRONMENT != x ]; then
trap "" SIGTERM
fi
in my ~/.profile to prevent any shell from exiting when it gets the SIGTERM)

How can I tell IntelMPI's mpirun not to exit on SIGTERM?

Cheers,
Manuel

3 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Quoting - manuels
Hi,

I'm using a IntelMPI with PBS.
When I send a SIGTERM signal using qdel to my job mpirun exits immediatly and my program that is called by mpirun has no time to finish its cleanup work.
(I'm using
if [ x$PBS_ENVIRONMENT != x ]; then
trap "" SIGTERM
fi
in my ~/.profile to prevent any shell from exiting when it gets the SIGTERM)

How can I tell IntelMPI's mpirun not to exit on SIGTERM?

Cheers,
Manuel

Hi Manuel,

Thanks for posting here.
Personnally I don't understand why you need to send SIGTERM and execute cleanup code.
Anyway, I've tried to kill mpirun (it was SIGKILL really instead of SIGTERM, but I think it is not so important):

[user1@mpiserver100 spawn1]$ mpirun -r ssh -f mpd.hosts -n 2 IMB-MPI1 > out_IMB
Killed

From another console:
[user1@mpiserver100 spawn1]$ ps xf
PID TTY STAT TIME COMMAND
20989 pts/0 Ss 0:00 -bash
23276 pts/0 R+ 0:00 _ ps xf
14865 pts/6 Ss+ 0:00 -bash
23269 pts/0 S 0:00 python /user1/intel/impi/4.0/intel64/bin/mpiexec -n 2 IMB-MPI1
23270 pts/0 Z 0:00 _ [sh]
23255 ? S 0:00 python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23271 ? S 0:00 _ python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23274 ? R 0:09 | _ IMB-MPI1
23272 ? S 0:00 _ python /user1/intel/impi/4.0/intel64/bin/mpd.py --ncpus=1 --myhost=mpiserver100 -e -d -s 2
23273 ? R 0:09 _ IMB-MPI1

So, you can see that mpiexec and application itself are still running. mpirun doesn't send signals further. Probably this is PBS responsible for the problem you mentioned - seems PBS can kill not only parent processes but all children as well. Could you tell me your version of PBS and I'll try to reproduce the problem.

Best wishes,
Dmitry

I have the same problem

If I send SIGUSR1 it gets passed to the subproceesses they can save there state and shutdown cleanly.

If I send a SIGINT (Ctrl-C) mpirun exits and my processes get killed without being able to save state. How do I make mpirun signore all signals and pass them on to the subprocesses?

Accedere per lasciare un commento.