Enabling Connectionless DAPL UD in the Intel® MPI Library

What is DAPL UD?

Traditional InfiniBand* support involves MPI message transfer over the Reliable Connection (RC) protocol. While RC is long-standing and rich in functionality, it does have certain drawbacks: since it requires that each pair of processes setup a one-to-one connection at the start of the execution, memory consumption could (at the worst case) grow linearly as more MPI ranks are added and the number of pair connections grows.

In recent years, the User Datagram (UD) protocol has emerged as a more memory-efficient alternative to the standard RC transfer. UD implements a connectionless model that allows for a many-to-one connection to be set up, using a fixed number of connection pairs even as more MPI ranks are started.


There are two aspects to DAPL UD support: availability in the InfiniBand* software stack, and support in the MPI implementation.

The Open Fabrics Enterprise Distribution (OFED™) stack is open source software for high-performance networking applications offering low latencies and high bandwidth. It is developed, distributed, and tested by the Open Fabrics Alliance (OFA) – a committee of industry, academic, and government organizations working to improve and influence RDMA fabric technologies. Support for the DAPL UD extensions is part of OFED 1.4.2 and later. Make sure you have the latest OFED installed on your cluster.

Alternatively, contact your InfiniBand* provider and ask if your cluster’s IB software stack supports DAPL UD.

On the MPI side, the Intel® MPI Library has supported execution over DAPL UD since Intel MPI 4.0 and later. Make sure you have the latest Intel MPI version installed on your cluster. To download the latest release, log into the Intel® Registration Center or check our website.

Enabling DAPL UD

To enable usage of DAPL UD with your Intel MPI application, you need to set the following environment variables:

$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_UD=enable

Note that the shm:dapl setting is default for the I_MPI_FABRICS environment variable. This will use the shm device for intra-node communication and the dapl device when communicating between nodes.

Selecting the DAPL UD provider

Finally, select the appropriate DAPL provider that supports the UD InfiniBand* extensions. While several providers (e.g. scm, ucm) offer this functionality, we recommend using the ucm device as that offers better scalability and is more suitable for many-core machines. For example, given the following /etc/dat.conf entries:

$ cat /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""

The UD-supported providers are highlighted in bold. To use the ucm-specific provider, set:

$ export I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx4_0-1u

Your Intel MPI application will now utilize connectionless communication at runtime.

For more complete information about compiler optimizations, see our Optimization Notice.


Thomas Clune's picture

Gergana - thanks.    We isolated this a bit more and found that the problem disappeared if we use mpi_init_thread(MPI_THREAD_MULTIPLE) instead of just mpi_init().   The only reason we discovered this is that the next layer is doing some OpenMP work and we backtracked why that bit was behaving oddly.   (We did not try other options for the REQUIRED argument.) So now we can use the settings advertised on this page but are instead bewildered why the behavior depends on how we initialized MPI.  I think I can create a relatively modestly sized reproducer if there is interest.  Certainly we can toggle the initialization and reliably hang in the layer that uses the 1-sided to set up the buffer.  (No OpenMP is used in that layer.)

Sorry - just saw the bit about submitting a report.   Will do so as soon as I create a simpler reproducer than our 1 million line weather model.  :-)


Gergana S. (Intel)'s picture

Thomas, DAPL UD is meant to be run across multiple nodes and it is not expected to hang when you run your distributed MPI job outside of a single machine.  The culprit is likely some setting with your network setup but the comments section of this article is not the best place to address that :)  Can you submit a ticket via our online service center and we'll be able to help you out?


Thomas Clune's picture

We have what we believe to be a standard conforming MPI program that hangs with the settings discussed on this page. The code is essentially trying to use passive MPI 1-sided communication to establish a "coordination" buffer on the root process.    The code works correctly when I_MPI_DAPL_UD does not have a value, but when it is set to "enable", then  the code only works when all processes are on the same node.   If the job spans multiple nodes it will hang.    Is this a bug in the MPI layer, or is this an intended limitation of UD support?




dingjun.chencmgl.ca's picture

I am trying to test the Intel MPI Benchmark(IMB) 4.0 beta on our Windows PCs cluster. Both Intel MPI 5.0 and the WinOFED 3.2 are installed on our Windows PCs cluster. When I did tests, the following errors were always occurred:

C:\Users\dingjun\mpi5tests>mpiexec -configfile config_file

dapls_ib_init() NdStartup failed with NTStatus: The specified module could not be found.

The above confgi_file contains the following content:

-host drmswc4-1 -n 8 -genv I_MPI_FABRICS shm:dapl IMB-MPI1 Exchange

-host drmswc4-2 -n 8 -genv I_MPI_FABRICS shm:dapl IMB-MPI1 Exchange


C:\Users\dingjun\mpi5tests>mpiexec -n 4 -env I_MPI_FABRICS shm:dapl IMB-MPI1
dapls_ib_init() NdStartup failed with NTStatus: The specified module could not be found.

dapls_ib_init() NdStartup failed with NTStatus: The specified module could not be found.

dapls_ib_init() NdStartup failed with NTStatus: The specified module could not be found.

dapls_ib_init() NdStartup failed with NTStatus: The specified module could not be found.

job aborted:
rank: node: exit code[: error message]
0: drmswc4-1.cgy.cmgl.ca: 291: process 0 exited without calling finalize
1: drmswc4-1.cgy.cmgl.ca: 291: process 1 exited without calling finalize
2: drmswc4-1.cgy.cmgl.ca: 291: process 2 exited without calling finalize
3: drmswc4-1.cgy.cmgl.ca: 291: process 3 exited without calling finalize

Could you tell me the reasons why above errors occurred? If you are not able to answer this question, could you tell me someone in the Intel Corp. who can answer it?

By the way, on our LINUX PCs cluster the Intel MPI DAPL option works very well and the above problem only occurred on our Windows PCs cluster. In addition, What kind of hardware is  used to pass DAPL over Infiniband test ?  We need hardware information such as vender and model and driver information, provided by vender or opensource, if opensource what’s download link.

I am looking forward to hearing from you and your early response is highly appreciated.

Have a good day.

Dingjun Chen

Office #150, 3553-31 Street NW

Calgary, AB T2L 2K7, Canada

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.