tmi

tmi

Hello everyone,

I am trying to use tmi fabric for intel mpi library but when i run my application with dynamic process management using MPI_Comm_spawn, application fails to run, but if i run without any I_MPI_FABRICS arguments then it works fine. Could someone please suggest what I might be doing wrong? Please see the rows that have been marked with "-->" are debugging statements in my program.

//***************************

/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 1 -perhost 1 -f ./mpd.hosts -env I_MPI_DEBUG 2 -env I_MPI_FABRICS shm:tmi ./parent

[0] MPI startup(): shm and tmi data transfer modes
-->provided: MPI_THREAD_MULTIPLE
-->Initalizing MPI environment...
-->Finished initalizing MPI environment.
-->Spawning child binary ...

Fatal error in PMPI_Init_thread: Invalid port, error stack:
MPIR_Init_thread(658)............................:
MPID_Init(320)...................................: spawned process group was unable to connect back to the parent on port <tag#0$epaddr_size#16$epaddr#0C00000000000000030A0C0000000000$>
MPID_Comm_connect(206)...........................:
MPIDI_Comm_connect(579)..........................: Named port tag#0$epaddr_size#16$epaddr#0C00000000000000030A0C0000000000$ does not exist
MPIDI_Comm_connect(380)..........................:
MPIDI_Create_inter_root_communicator_connect(134):
MPIDI_CH3_Connect_to_root(309)...................:
MPID_nem_tcp_connect_to_root(1082)...............:
MPID_nem_tcp_get_addr_port_from_bc(1236).........: Missing port or invalid host/port description in business card

//***************************

$ cat mpd.hosts

10.20.xx.xx
10.20.xx.xx

//***************************

// without any I_MPI_FABRICS arguments

/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 1 -perhost 1 -f ./mpd.hosts -env I_MPI_DEBUG 2 -env I_MPI_FABRICS shm:tmi ./parent

[0] MPI startup(): shm data transfer mode

-->provided: MPI_THREAD_MULTIPLE
-->Initalizing MPI environment...
-->Finished initalizing MPI environment.
-->Spawning child binary ...
[0] MPI startup(): cannot open dynamic library libdat2.so.2
[0] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat.so.1
[0] MPI startup(): cannot open dynamic library libdat.so
[0] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): reinitialization: shm and tcp data transfer modes

libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device

-->Finished.
-->fine

//***************************

4 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Gergana Slavova (Intel)

Hey Robo,

Thanks for getting in touch.  We have a nice article explaining how to enable use the TMI fabric for the Intel® MPI Library.  You're welcome to check it out.

When you don't set any value for the I_MPI_FABRICS env variable, it'll try to run over the default shm:dapl fabric.  In your case, that's actually failing as well, as indicated by these messages:

librdmacm: Fatal: unable to open RDMA device

And because Intel MPI doesn't want you to fail, it falls back to running just over regular Ethernet:

[0] MPI startup(): shm and tcp data transfer modes

Can you read through the article and make sure you have everything in place, such as a tmi.conf file?  If it still doesn't work, it'll be great to know what network cards you're running, and what the software stack controlling them are (e.g. Qlogic PSM).

Looking forward to hearing back.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Thanks Gergana for your reply. We are using intel

"QLE7342-CK dual port IB card". I set up tmi.conf following the article that you mentioned. I am not sure why I am getting "librdmacm: Fatal: unable to open RDMA device" and "[0] MPI startup(): cannot open dynamic library libdat2.so.2"

$ ldconfig -p | grep "librdmacm"
librdmacm.so.1 (libc6,x86-64) => /usr/lib64/librdmacm.so.1

$ ldconfig -p | grep libdat
libdat2.so.2 (libc6,x86-64) => /usr/lib64/libdat2.so.2
libdat.so.1 (libc6,x86-64) => /usr/lib64/libdat.so.1

$ which mpirun
/opt/intel/impi/4.1.1.036/intel64/bin/mpirun

$ env | grep I_MPI
I_MPI_ROOT=/opt/intel/impi/4.1.1.036

$cat /etc/tmi.conf


# TMI provider configuration
#
# format od each line:
# <name> <version> <path/to/library> <string-arguments>
#
# Notice: the string arguments must have at least one character inside
#

psm 1.1 /opt/intel/impi/4.1.1.036/intel64/lib/libtmip_psm.so " " # comments ok

$ cat ~/.bash_profile


# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH

export PSM_SHAREDCONTEXTS_MAX=20

I_MPI_ROOT=/opt/intel/impi/4.1.1.036; export I_MPI_ROOT
export PATH=/opt/intel/impi/4.1.1.036/intel64/bin:$PATH

export LD_LIBRARY_PATH=/opt/intel/impi/4.1.1.036/intel64/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/intel/impi/4.1.1.036/lib64:$LD_LIBRARY_PATH

export PATH="$PATH":/bin:/usr/lib:/usr:/usr/local/lib:/usr/lib64

any feedback on this. when i run my mpi program using tmi option, it hangs for some time and then errors out. It hangs on MPI_Comm_spawn call in the code but when I use shm:tcp option, program works fine.

$ /opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 1 -perhost 1 -f ./mpd.hosts -env I_MPI_DEBUG 2 -env I_MPI_FABRICS shm:tmi ./parent

[0] MPI startup(): shm and tmi data transfer modes

-->provided: MPI_THREAD_MULTIPLE
-->Initalizing MPI environment...
-->Finished initalizing MPI environment.
-->Spawning child binary ...
Fatal error in PMPI_Init_thread: Invalid port, error stack:
MPIR_Init_thread(658)............................:
MPID_Init(320)...................................: spawned process group was unable to connect back to the parent on port <tag#0$epaddr_size#16$epaddr#24000000000000000302240000000000$>
MPID_Comm_connect(206)...........................:
MPIDI_Comm_connect(579)..........................: Named port tag#0$epaddr_size#16$epaddr#24000000000000000302240000000000$ does not exist
MPIDI_Comm_connect(380)..........................:
MPIDI_Create_inter_root_communicator_connect(134):
MPIDI_CH3_Connect_to_root(309)...................:
MPID_nem_tmi_connect_to_root(1043)...............:
(unknown)(): Other MPI error

if i use shm:dapl option then it is not able to open RDMA device and I encounter the following issue:

$ /opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 1 -perhost 1 -f ./mpd.hosts -env I_MPI_DEBUG 2 -env I_MPI_FABRICS shm:dapl ./parent

librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
librdmacm: Fatal: unable to open RDMA device
librdmacm: Fatal: unable to open RDMA device
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ehca0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-iwarp
librdmacm: Fatal: unable to open RDMA device
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1u
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2u
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1u
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2u
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth2
librdmacm: Fatal: unable to open RDMA device
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth3
librdmacm: Fatal: unable to open RDMA device
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-mthca0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-mthca0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-mlx4_0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-mlx4_0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-ipath0-1
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-ipath0-2
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-ehca0-2
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-iwarp
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-roe-eth2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-cma-roe-eth3
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-scm-roe-mlx4_0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: OpenIB-scm-roe-mlx4_0-2
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

my dat.conf file looks like this

$ cat /etc/dat.conf


OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""

my tmi.conf file looks like this:

$ cat /etc/tmi.conf


# TMI provider configuration
#
# format od each line:
# <name> <version> <path/to/library> <string-arguments>
#
# Notice: the string arguments must have at least one character inside
#

psm 1.1 libtmip_psm.so " " # comments ok

and my .bash_profile looks like this

$ cat ~/.bash_profile


# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH
export I_MPI_TMI_PROVIDER=psm
export PSM_SHAREDCONTEXTS_MAX=20
export TMI_CONFIG=/etc/tmi.conf
export I_MPI_FABRICS=shm:tmi
DAT_OVERIDE=/etc/dat.conf; export DAT_OVERIDE
I_MPI_ROOT=/opt/intel/impi/4.1.1.036; export I_MPI_ROOT

export PATH=/opt/intel/impi/4.1.1.036/intel64/bin:$PATH

export LD_LIBRARY_PATH=/usr/lib64:/opt/intel/impi/4.1.1.036/intel64/lib:/opt/intel/impi/4.1.1.036/lib64:/usr/local/lib:$LD_LIBRARY_PATH

export PATH="$PATH":/bin:/usr/lib:/usr:/usr/local/lib:/usr/lib64

$ cat mpd.hosts

10.20.xx.xx

Accedere per lasciare un commento.