MPI MPMD fault tolerance support

MPI MPMD fault tolerance support

Portrait de John Gilmore

Hi,

I would really appreciate some help. I would like to know whether Intel MPI supports fault tolerance (run-through stabilisation) for multiple programs multiple data (MPMD) applications?

I have read the Intel MPI fault tolerance documentation. I am running a master - worker application, where the master and worker code are seperate and where there is no communication amongst workers. My configure command looks like this:

mpirun -perhost 10 -f /home/john/Application/src/hostfile_intel \
-n 1 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Master : \
-n 9 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Worker

Does MPI support this type of fault tolerance in terms of run-through stabilisation? I don't want the MPI job to crash, if a single process crashes. Currently, it doesn't seem to be working. If I kill a process, the complete MPI job terminates with the error:
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

You help will be appreciated. 

5 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de James Tullos (Intel)

Hi John,

Have you set the error handler to MPI_ERRORS_RETURN in your program? Are you handling errors within your program appropriately to insure that communications with a failed worker do not continue?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Portrait de John Gilmore

Hi James,

Yes, right after calling MPI_Init, I set the error handler. I'm not sure what you mean with "appropriatly" handling errors. Currently, whenever I perform a send or receive, I have the following piece of code:

err =MPI_Recv(data, BUFFER_SIZE, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Error_class(err, &err_class);
if(err_class != MPI_SUCCESS)
{
    MPI_Error_string(err, err_str, &err_len);
    printf("Receive error %d: %s\n", err_class, err_str);fflush(stdout);
}

So I just print the error if there is one. I never see this error printout before the MPI job fails. After a receive gives an error, it is possible for my application to call the same function again, but shouldn't that just also return with an error?

Also, is it possible to reuse MPI_ANY_SOURCE after a process in MPI_COMM_WORLD has failed?

Your help is greatly appreciated!
John 

Portrait de James Tullos (Intel)

Hi John,

It looks like you are doing what you need to be doing.  I'll see if I can reproduce the behavior here and let you know the results.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Portrait de James Tullos (Intel)

Hi John,

Can you please send a reproducer program?  I am unable to reproduce this behavior.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Connectez-vous pour laisser un commentaire.