Detecting process failure using Intel MPI

Detecting process failure using Intel MPI

Hi,

I'm trying to use MPI_Errhandler_set in the communicator from a process created by MPI_Comm_spawn. I would like to detect process failure and do something about it.

My testing code is:

------------------
#include
#include
#include

int main(int argc, char ** argv){

MPI_Comm comm_parent, intercomm;
int err, errRecv;
int v = 0;
MPI_Status status;
MPI_Info info;

MPI_Init(&argc, &argv);
MPI_Comm_get_parent(&comm_parent);

if(comm_parent == MPI_COMM_NULL){

MPI_Info_create(&info);
MPI_Info_set(info, "host", "192.168.0.2");

printf("Parent creates child...\\n");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, 1, info, 0, MPI_COMM_SELF, &intercomm, &err);

MPI_Errhandler_set(intercomm, MPI_ERRORS_RETURN);

printf("Waiting...\\n");
errRecv = MPI_Recv(&v, 1, MPI_INT, 0, 0, intercomm, &status);

if(errRecv != MPI_SUCCESS){
printf("Error detected!\\n");
fflush(stdout);
}

}
else{

sleep(60);
MPI_Send(&v, 1, MPI_INT, 0, 0, comm_parent);

}

printf("Finalize\\n");
MPI_Finalize();
return(0);

}
------------------

I typed in a terminal:

$ export I_MPI_FAULT_CONTINUE=on
$ mpicc test.c -o test -Wall
$ mpirun -np 1 ./test 1

In another terminal, I killed child process and parent process stoped without printing the following messages (from printf). The output is only:

$ mpirun -np 1 ./test 1
Parent creates child...
Waiting...
$

I was expecting that the program to continue and print "Error detected!" and "Finalize".
Why doesn't it happen?

Thanks,
Fernanda Oliveira

3 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hi Fernanda,

According to the documentation, Fault Tolerance works only for master-slave processes and only for processes which rank is not 0. Also you need to set Errhandler for MPI_COMM_WORLD. Might be it's not obvious but it means that fault tolerance feature won't work for spawn processes (you can see that both processes in your case have rank 0).
You just need to modify your program:
MPI_Init(&argc, &argv);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);

And start 2 processes: mpirun -np 2 ...
So, you need also change
if(comm_parent == MPI_COMM_NULL){
to
if(rank == 0){

Working with spawned processes is very difficult task and I'd recommend avoiding this scheme of MPI programming.

Regards!
Dmitry

Our testing has shown that in addition to the restrictions on fault tolerance mentioned in the MPI reference guide, it also only works when the slave process send only and the master receives with MPI_WaitAny, on a vector if receive objects. When an error is received that object must be eliminated from the vector and not called again. All other scenarios we tried resulted in various system failures.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui