Hi,
I am running a heterogeneous cluster, half the nodes Gbit ethernet and the other half Infiniband. For a year or so everything went well, but recently the Gbit nodes complain about the lack of Infiniband (see below). This phaenomenon is limited to impi code, GNU mpi still runs fine.
The problem appears unrelated to the queuing system, a direct launch fails in the same way as a SGE submitted one.
Any help would be greatly appreciated.
...
compute-0-15.local:19848: open_hca: rdma_bind ERR No such device. Is eth0 configured?
compute-0-15.local:19847: open_hca: rdma_bind ERR No such device. Is eth0 configured?
compute-0-15.local:19845: open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured?
compute-0-15.local:19845: open_hca: device mthca0 not found
compute-0-15.local:19845: open_hca: device mthca0 not found
compute-0-15.local:19845: open_hca: device mlx4_0 not found
compute-0-15.local:19845: open_hca: device mlx4_0 not found
compute-0-15.local:19845: open_hca: device ipath0 not found
compute-0-15.local:19845: open_hca: device ipath0 not found
compute-0-15.local:19845: open_hca: device ehca0 not found
compute-0-15.local:19845: open_hca: rdma_bind ERR No such device. Is eth0 configured?
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 xxxxl_51508 caused collective abort of all ranks
exit status of rank 0: return code 13


