We have a new cluster with Mellanox FDR Infiniband interconnect and sometimes get the following error when running Intel MPI :
[15] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c[
16] Abort: Error code in polled desc!
[16] Abort: Got FATAL event 3at line 1010 in file ../../ofa_utility.c
at line 2346 in file ../../ofa_init.c[
159] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c
[0] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c
We have also seen this error when running over a very large nodeset :
send desc error[400] Abort: Got completion with error 9, vendor code=8a, dest rank=at line 870 in file ../../ofa_poll.c
I am not seeing this type of error at all using OPENMPI. The cluster is using OFED (not the mellanox vendor supplied one). We are using Torque as our resource manager.
Any help diagnosing this would be appreciated.
Bernie



