Random fabric errors on Red Hat Enterprise Linux 5.4

Problem:

The Intel® MPI Library fails intermittently when run over the RDSSM or RDMA devices. Approximately 5-10% of runs fail on RHEL (Red Hat Enterprise Linux) 5.4, but this problem does not occur on earlier versions of RHEL.

When reviewing the debug output, the following error is seen during the Intel MPI Library operations:

setup_listener Cannot assign requested address

Environment:

Red Hat Enterprise Linux 5.4 only

Root Cause:

This error occurs with the Intel MPI Library and the version of OFED (Open Fabrics Enterprise Distribution) included with RHEL 5.4. There is a potential port space conflict with RDS (reliable datagram sockets) and when this port space conflict occurs, uDAPL does not resolve it correctly.

By default, the Intel MPI Library uses its process ID to define its port number. In RHEL 5.4, the process ID can occasionally match a port number that the RDS driver has already allocated, which creates a port space conflict. Currently, uDAPL will reply with the wrong return code to the Intel MPI Library and communication will fail.

Resolution:

As a temporary workaround, set the following environment variable on all nodes:

$ export I_MPI_RDMA_CREATE_CONN_QUAL = 0

After setting this variable, the Intel MPI Library will not define its port number from its process ID.

This error is resolved in DAPL 2.0.25, to be included in Open Fabrics Enterprise Distribution (OFED) 1.5.  Status of the resolution can be found in the latest OFED release notes.

Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.