I am writing because I am currently implementing a failure recovery system for a cluster with Intel OmniPath that will be designated for handling computations in a physical experiment. What I want to implement is a mechanism to detect a node that failed and to notify rest of the nodes. I tried to check the node failure by invoking psm2_poll. Unfortunately, as I saw in the Intel ® Performance ScaledMessaging 2 (PSM2) Programmer’s Guide, this function does not return errors (values) other than OK or OK_NO_PROGRESS (this is at least what I have observed in my application - the poll on a dead node behaves as if the node did not fail/disconnect and did not send any message).
So the question is: What are the methods of notifying other nodes after node failure ? Is there a lightweight function that I can invoke along with poll to check if the node from whom I am trying to get messages exists ?
In worst case, I can implement this using a counter and a timeout, but if there is a mechanism supported by the API, I am wide open.