we're using IntelMPI in an SGE cluster (tight integration). For some nodes, the jobs consistently fail with message similar to these:
startmpich2.sh: check for mpd daemons (2 of 10)startmpich2.sh: got all 24 of 24 nodes
node26-05_46554 (handle_rhs_input 2425): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
node26-22_42619: connection error in connect_lhs call: Connection refused
node26-22_42619 (connect_lhs 777): failed to connect to the left neighboring daemon at node26-23 40826
Is there any promising way to debug this and find out where the actual problem is? There seems to be some communications problem, but I do not know where.
Thanks for any insight,