bad filename - /etc/dat.conf

bad filename - /etc/dat.conf

Hello,
We've got HPCC cluster using InfiniBand (Mellanox ConnectX2) with OFED 1.5 and Intel MPI 4.0
while running MPI binary, we got:
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: sysconfdir, bad filename - /etc/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
DAT Registry: default, bad filename - /etc/dat.conf, aborting
[10] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[15] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[9] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[12] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[13] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[11] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[14] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[8] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 8:wn2
[9] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 9:wn2
[10] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 10:wn2
[11] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 11:wn2
[14] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 14:wn2
[15] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 15:wn2
[12] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 12:wn2
[13] MPI startup(): DAPL provider  on rank 0:wn1 differs from ofa-v2-mlx4_0-1u(v2.0) on rank 13:wn2
[0] dapl fabric is not available and fallback fabric is not enabled
[1] dapl fabric is not available and fallback fabric is not enabled
[2] dapl fabric is not available and fallback fabric is not enabled
[3] dapl fabric is not available and fallback fabric is not enabled
[6] dapl fabric is not available and fallback fabric is not enabled
[7] dapl fabric is not available and fallback fabric is not enabled
rank 7 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 7: return code 254 
[8] MPI startup(): shm and dapl data transfer modes
[15] MPI startup(): shm and dapl data transfer modes
rank 3 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 3: return code 254 
rank 2 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 2: return code 254 
rank 1 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 1: return code 254 
rank 0 in job 1  wn1_34032   caused collective abort of all ranks
  exit status of rank 0: return code 254 


dat.conf is the same on all compute nodes because it's on shared network system - NFS (diskless servers) and was nothing wrong with NFS then.

Here is my dat.conf:
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""ofa-v2-cma u2.0 nonthreadsafe default libdaplcma.so.2 dapl.2.0 "ib0 0" ""ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""

What could be a reason of this problem?
There is a new syntax of I_MPI_DEVICE in Intel MPI 4.0, I was still using old syntax (I_MPI_DEVICE="rdssm:ofa-v2-mlx4_0-1u"). Should it work the same as in impi3?Propably we should use new syntax but what are your suggested fabrics to use in this case and why? How "ofa" differs in real life from various dapl providers? As far I understand ofa doesn't use dapl at all?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Rafal,

In Intel MPI Library 4.0 you can use I_MPI_DEVICE but for rdma and rdssm fabrics only. ofa-v2-mlx4_0-1u should work with ofa fabric.

To use ofa fabric you need to set I_MPI_FABRICS=ofa or shm:ofa.

To use DAPL you need to set I_MPI_FABRICS=shm:dapl.

If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable.

I hope this helps.

Regards!
Dmitry

Hi Dmitry, thanks for clearing things out. Ofa interface is not well documented, but AFAIK it has multi-rail support which DAPL doesn't.I believe if DAPL works ok there are no reasons to switch to ofa.Regards

"If your dat.conf file is not located in /etc directory, please use DAT_OVERRIDE env variable."My dat.conf was located in /etc everywhere but I was using binary compiled with MPI 3.2 with mpirun from MPI 4.Could that be a reason of some problems? Is there something like binary compability between MPI 3.2 and MPI 4?

Hi Rafal,

Yeah, 4.0 should be binary compatible with 3.2 library, but I'd recommend recompilation if it's possible. Also you can check attached libraries by ldd command.

OFA supports multi-rail and you can use I_MPI_OFA_NUM_ADAPTERS variable to set number of interconnets on your nodes.
If you have multi-port cards you need to set I_MPI_OFA_NUM_PORTS env variable.

Please let me know if the issue still persists.

Regards!
Dmitry

Leave a Comment

Please sign in to add a comment. Not a member? Join today