How does Intel MPI handle network failures

How does Intel MPI handle network failures

Hi all,

I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).

What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?

Any help would be very much appreciated!

4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Quoting - dludick
Hi all,

I am new to the forum and have a question regarding network failures and MPI applications (specifically using the Intel MPI binding).

What happens if I have a a number of processes running on a cluster, and someone unplugs a network cable? As far as I have read, the MPI processes gets terminated immediately. How can I circumvent this, say by using some sort of a WAIT or TIMEOUT command if a network fault is detected, so that they can see if maybe they can again recover after anumber of (set) seconds?

Any help would be very much appreciated!

Hi Dludick,

You are quite right - there is no way to restore connection after unexpected network problem.
But you could try to implement sort of fault tolerance by setting error hadler to MPI_ERROR_RETURNS. A user should test the return code of MPI calls that invoke the error handler MPI_ERROR_RETURNS, and execute suitable
recovery code when the call was unsuccessful. But it depeneds on the device driver.
If you call MPI_Receive and there is no connection the application will rather hang than you'll get an error.

Intel MPI Library 4.0 will have fault tolerance implementation but you need to design you application so that it will be able to recover after a network fault.

Best wishes,
Dmitry

Dmitry Kuzmin ,

I'm working on a fault-tolerant MPI Program.

Suppose this situation: You have 2 nodes A and B. A
sends messages to B. Besides, B failed in some time and a message from A
to B was already sent. In this case, TCP will try to retransmit this
message until a certain number of times, 15, by default. I had to
increase this number, because after 15 times is reached, TCP gives an
error and pass it to MPI layer. I noted that Intel MPI aborts
application. When I increased
variable tcp_retries2 on Linux to 60000 value, my fault tolerant mechanism worked until the end of application. This works for small number of process. For 128 (divided by 16 nodes) and 256 process (divided by 32 nodes).

So, my question is: Do you know if there is any way to make Intel MPI not to abort when receive an TCP error because a connection was closed?

The error is: Assertion failed in file ../../socksm.c at line 2573: (it_plfd->revents & 0x008) == 0
internal ABORT - process 170

Another question : Do you know a good way to clean all the communication channels in MPI? I did an clean procedure that does an MPI_Probe and receives the messages that were not received before because a failed node.

Hugs,
matheusbersot.

Hi matheusbersot,

Have you tried to use '-env I_MPI_FAULT_CONTINUE=on'?
If you set this environment variable and handle MPI_ERROR_RETURN you should see this message:
The error is: Assertion failed in file ../../socksm.c at line 2573: (it_plfd->revents & 0x008) == 0
internal ABORT - process 170
Error state is returned into error_string and program execution can continue.

Regards!
Dmitry

Connectez-vous pour laisser un commentaire.