Error DAPL with Intel MPI library

Error DAPL with Intel MPI library

Dear HPC forum,
i'm trying Intel MPI 4.0 over Linux cluster but i have some problems. My PBS command are:

export I_MPI_MPD_RSH=ssh

mpirun -genv I_MPI_USE_DYNAMIC_CONNECTIONS 0 -n 32 -env I_MPI_DEVICE rdssm:OpenIB-cma  $CINECA_SCRATCH/skampi_intel_mpi -i $CINECA_SCRATCH/skampi-5.0.4-r0355/ski/skampi_coll.ski -o $CINECA_SCRATCH/intel_coll.ski_32

and obviously other flags related to queue, number of procs and so on.

But when i run my PBS script i get the follow error:

[5:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[0:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[2:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[4:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[1:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[7:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[6:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0
[3:node151ib0] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 4045: 0
internal ABORT - process 0

I have read that this type of error is related to process dynamic connection and using "-genv I_MPI_USE_DYNAMIC_CONNECTIONS 0" can be solve it. But it isn't true.
Adding -env I_MPI_DEBUG 5 the output is:

[21] MPI startup(): RDMA, shared memory, and socket data transfer modes
[17] MPI startup(): RDMA, shared memory, and socket data transfer modes
......
[27] MPI startup(): DAPL provider OpenIB-cma
[24] MPI startup(): DAPL provider OpenIB-cma
.....
[28] MPI startup(): shm and dapl data transfer modes
[31] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): static connections storm algo
node151:16665: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.14.102.68,27804
node151:16664: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.14.102.68,27804
node151:16663: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.14.102.68,27804
node151:16662: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.14.102.68,27804
rank 23 in job 1 node151ib0_36911 caused collective abort of all ranks
exit status of rank 23: return code 1
rank 17 in job 1 node151ib0_36911 caused collective abort of all ranks
exit status of rank 17: return code 1
rank 1 in job 1 node151ib0_36911 caused collective abort of all ranks
exit status of rank 1: return code 1

Someone can help me? Thanks in advance.

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi unrue,

Welcome to the Intel HPC forums!

While it's true that this issues is sometimes resolved by disabling dynamic connection establishment, the 4008 error code here comes from the OFED layer and means "destination unreachable". And the causes for that are several.

Looking at your command line, though, the most likely culprit is using the OpenIB-cma DAPL device. The CMA driver is fairly old (in relation to how fast technology moves :) and does not scale out. We recommend using the newer SCM driver instead. It can be found in later versions of the OFED stack. What version do you have installed right now? You can find it by running the ofed_info utility on your cluster. We recommend you upgrade to the latest OFED 1.5.1 release.

Finally, it'll be great if you could provide the contents of the your /etc/dat.conf file, and the output of the ibstat utility (usually located in /usr/sbin).

Looking forward to hearing back from you.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Dear Gergana,

thanks for yout reply! Unfortunately, i can't install new OFED driver because i'm not a root machine. At the moment, i solved my problem with these settings:

export I_MPI_MPD_RSH=ssh
export I_MPI_USE_DYNAMIC_CONNECTIONS=0
export I_MPI_FABRICS_LIST="ofa,dapl,tcp,tmi"
export I_MPI_DEBUG=5
export I_MPI_FALLBACK_DEVICE=1

mpirun -n 512 $CINECA_SCRATCH/skampi_intel_mpi -i $CINECA_SCRATCH/skampi-5.0.4-r0355/ski/skampi_coll.ski -o $CINECA_SCRATCH/intel_coll.ski_512

TO answer your question:

My actual OFED version is 1.4( not too old :) )

Output of ibstat is:

CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.6.648
Hardware version: a0
Node GUID: 0x0002c9030004b056
System image GUID: 0x0002c9030004b059
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 105
LMC: 0
SM lid: 207
Capability mask: 0x02510868
Port GUID: 0x0002c9030004b057

And my /etc/dat.conf is:
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""

Running with SKaMPI benchmark, test is very slow and appear locked qith a particular size buffer of communication ( 8192). With OpenMPI works correctly. Do you have an idea of this behaviour?

Thanks in advance!

Nobody can help me ? :(

I'm not an expert on the subject, but after searching a bit I have found some potential alternatives at http://kerneltrap.org/mailarchive/openfabrics-general/2008/2/1/685234. I've appending it at the end.
Also, your issue seems to be related to OFED which follows a typical open-source process, I would suggest checking at http://www.openfabrics.org/resources_linux.htm for documentation/support resources.
Hope it helps.-- AndresWhen using rdma_cm to establish end-to-end connections we incur a 3 step
process, each with various tunable knobs. There is ARP, Path Resolution,
and CM req/reply. Anyone of these could cause the 4008 timeout error.

Here are tunable parameters that may help:1. ARP:

ARP cache entries for ib0 can be increased from default of 30:

sysctl w net.ipv4.neigh.ib0.base_reachable_time=144002. PATH RESOLUTION:

ib_sa.ko provides path record caching, no timer controls,
auto refresh with new device notification events from SM/SA,
manual refresh control for administrators,
default == SA caching is OFF.

To enable: add following to /etc/modprobe.conf -

options ib_sa paths_per_dest=0x7f
or
echo 0x7f > /sys/module/ib_sa/paths_per_dest

To manually refresh:
echo 1 > /sys/module/ib_sa/refresh

To monitor:
cat /sys/module/ib_sa/lookup_method
* 0 round robin
1 round robin

cat /sys/module/ib_sa/paths_per_dest

You can also increase the uDAPL PR timeout with the following
enviroment variable (if you don't have SA caching):

export DAPL_CM_ROUTE_TIMEOUT_MS=20000 (default=4000)3. CM PROTOCOL:

OFED 1.2.5 provides the following module parameters to increase
the IB cm response timeout from default of 21:

To increase timeout: add following to /etc/modprobe.conf -
options rdma_cm cma_response_timeout=23
options ib_cm max_timeout=23

Hi unrue,

I actually think you might be running over the good ol' sockets interface instead of one of the faster inteconnects. This might explain the slowness. To check, could you provide the output of your mpirun command (since you're already using I_MPI_DEBUG)?

Also, thanks for providing all that info. Instead of you setting all of those env variables, could you try this:

mpirun -r ssh -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 -n 512 $CINECA_SCRATCH/skampi_intel_mpi -i $CINECA_SCRATCH/skampi-5.0.4-r0355/ski/skampi_coll.ski -o $CINECA_SCRATCH/intel_coll.ski_512

Let me know how it goes.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Dear Gergana,

thanks to your suggest now it works very well! But what is exactly DAPL provider? An interface to Infiniband driver?

Hi unrue,

I'm glad to hear you have it working. Enjoy the Intel MPI Library :)

Quoting unrueBut what is exactly DAPL provider? An interface to Infiniband driver?

Yes, that's one way to think of it. If you start at the highest level, the Intel MPI Library supports several interfaces:

  • sockets - just regular TCP/IP for GigE and 10GigE clusters
  • shared memory - via the /dev/shm device
  • TMI - a new interface to allow for native tag matching support - for Myrinet* MX* and Qlogic* PSM* drivers
  • OFA - direct support of the OFED verbs - allows for multi-rail capabilities
  • DAPL (Direct Access Programming Library) - formally defined, this is an RDMA (remote direct memory access) API. Informally defined, it's just a way for the various interconnect vendors out there (e.g. Mellanox, Qlogic, etc) to interact with the higher level libraries (such as Intel MPI) in some standard fashion. And because there are multiple IHVs making network cards (plus different versions of the DAPL interface - 1.2 and 2.0), you need multiple DAPL providers to support them all.

In your case, you were using Mellanox cards, I believe, so you had to pick the one DAPL provided that provided support for those: ofa-v2-mlx4_0-1. The "ofa-v2" is OFED's way of designating DAPL providers that support the new DAPL 2.0 standard (vs. DAPL 1.2, marked by the OpenIB prefix), the "mlx4_0" was taken from the ibstat output, determining which vendor cards you have, and the "-1" means only your port 1 is active (also from ibstat).

That's probably a lot more info than you wanted but I hope it was helpful nonetheless.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Hi Gergana,

thanks very much for your help. Details are not too much never :) But i'm confused. You suggest me to use ofa-v2-mlx4_0-1

I don't understand if in that way i'm using Infiniband net. In my /etc/dat.conf i have also:

ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""

And in myifconfig i have defined ib0 interface;

ib0 Link encap:InfiniBand
..
...

So, i have a doubt. Mellanox card are used to interact with Infiniband net?

How can i switch from Infiniband to Gigabit or Ethernet for example?

Hi unrue,

This is getting a bit outside my expertise, since it's a question regarding OFED and how they name their devices. My guess would be that you can probably use the ofa-v2-ib0 device, but that would default to the mlx4_0 one. My suggestion would be, go ahead and try it both ways with I_MPI_DEBUG=5. That should print out some information so we can see which providers it's accessing.

But, just to clarify, you are using InfiniBand if you're selecting the ofa-v2-mlx4_0-1 provider. That's the purpose of the /etc/dat.conf file - DAPL providers to access your IB cards.

If you want to switch between IB and Ethernet, you need to do this:

  • For IB:
    mpirun -r ssh -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 -n 512 $CINECA_SCRATCH/skampi_intel_mpi -i $CINECA_SCRATCH/skampi-5.0.4-r0355/ski/skampi_coll.ski -o $CINECA_SCRATCH/intel_coll.ski_512
  • For Ethernet:
    mpirun -r ssh -genv I_MPI_FABRICS shm:tcp -n 512 $CINECA_SCRATCH/skampi_intel_mpi -i $CINECA_SCRATCH/skampi-5.0.4-r0355/ski/skampi_coll.ski -o $CINECA_SCRATCH/intel_coll.ski_512

Let me know how it goes.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Dear Georgana,

now, thanks to you i have understand my doubt:

Regards.

Leave a Comment

Please sign in to add a comment. Not a member? Join today