I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).
What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?
Any help would be very much appreciated!