I would really appreciate some help. I would like to know whether Intel MPI supports fault tolerance (run-through stabilisation) for multiple programs multiple data (MPMD) applications?
I have read the Intel MPI fault tolerance documentation. I am running a master - worker application, where the master and worker code are seperate and where there is no communication amongst workers. My configure command looks like this:
mpirun -perhost 10 -f /home/john/Application/src/hostfile_intel \
-n 1 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Master : \
-n 9 -env I_MPI_FABRICS shm:ofa -env I_MPI_OFA_NUM_ADAPTERS 2 \
-env I_MPI_OFA_RAIL_SCHEDULER ROUND_ROBIN -env I_MPI_FAULT_CONTINUE on ./Worker
Does MPI support this type of fault tolerance in terms of run-through stabilisation? I don't want the MPI job to crash, if a single process crashes. Currently, it doesn't seem to be working. If I kill a process, the complete MPI job terminates with the error:
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
You help will be appreciated.