Unexpected DAPL event 0x4003

Unexpected DAPL event 0x4003

Hello,

I try to start an MPI job on with the following settings.

I have two nodes, workstation1 and workstation2.
I can ssh from workstation1 (10.0.0.1) to workstation2 (10.0.0.') without password. I've already arranged rsa keys.
I can ssh from both workstation1 and workstation2 to themselves without password.
I can ping from 10.0.0.1 to 10.0.0.2 and from 10.0.0.2 to 10.0.0.1

workstation 1 & workstation2 are connected via Mellanox inifiniband.
I'm running Intel(R) MPI Library, Version 2017 Update 2  Build 20170125
I've installed MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64

workstation1 /etc/hosts :

127.0.0.1    localhost
10.0.0.1    workstation1

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

# mpi nodes
10.0.0.2 workstation2

-------------------------------------------------------------
workstation2 /etc/hosts :

127.0.0.1    localhost
10.0.0.2    workstation2

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

#mpi nodes
10.0.0.1 workstation1

--------------------------------------------------------------
Here's my application start command, (simplified app names and params)

#!/bin/bash
export PATH=$PATH:$PWD:/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$I_MPI_ROOT/intel64/lib:../program1/bin:../program2/bin
export I_MPI_FABRICS=dapl:dapl
export I_MPI_DEBUG=6
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1

# Due to the bug in IntelMPI, -genv I_MPI_ADJUST_BCAST "9" flags has been added.
# Mode detailed information is available : https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large-user-defined-datatypes

mpirun -l -genv I_MPI_ADJUST_BCAST "9" -genv I_MPI_PIN_DOMAIN=omp
: -n 1 -host 10.0.0.1 ../program1/bin/program1 master stitching stitching \
: -n 1 -host 10.0.0.2 ../program1/bin/program1 slave dissemination \
: -n 1 -host 10.0.0.1 ../program1/bin/program2 param1 param2

-------------------------------------------

I can start my application in dual node with export I_MPI_FABRICS=tcp:tcp, but when I start with dapl:dapl it gives the following error :

OUTPUT :

0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2  Build 20170125 (id: 16752)
[0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[0] [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[2] [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] [0] MPI startup(): dapl data transfer mode
[1] [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[2] [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] [1] MPI startup(): dapl data transfer mode
[2] [2] MPI startup(): dapl data transfer mode
[0] [0:10.0.0.1] unexpected DAPL event 0x4003
[0] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[0] MPIR_Init_thread(805): fail failed
[0] MPID_Init(1831)......: channel initialization failed
[0] MPIDI_CH3_Init(147)..: fail failed
[0] (unknown)(): Internal MPI error!
[1] [1:10.0.0.2] unexpected DAPL event 0x4003
[1] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[1] MPIR_Init_thread(805): fail failed
[1] MPID_Init(1831)......: channel initialization failed
[1] MPIDI_CH3_Init(147)..: fail failed
[1] (unknown)(): Internal MPI error!

Do you have any idea what could be the cause? By the way, on single node with dapl, I can start my application on both computers separately (meaning -host 10.0.0.1 for all application for workstation1, never attaching 10.0.0.2 related apps).

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

 

This could be an internal MPI library issue, or something completely different. I'd like to see the output for a few tests to see if I can help you isolate the issue:

1) Run a simple "mpirun -n 1 -host 10.0.0.1 hostname : -n 1 -host 10.0.0.2 hostname"

2) Build the "test.c" example provided with Intel MPI (in the installation directory under the test directory) and run that:

$ mpicc test.c -o impi_test

$ mpirun -n 1 -host 10.0.0.1 ./impi_test : -n 1 -host 10.0.0.2 ./impi_test

This will help me determine if this is a startup issue as it looks like or more related to the mpmd setup you seem to be running.

Also, is your system configured as IPV4 or IPV6?

Regards,
Carlos

Leave a Comment

Please sign in to add a comment. Not a member? Join today