Able to use fabric dapl but ofa

Able to use fabric dapl but ofa

I'm able to use I_MPI_FABRICS=dapl but not I_MPI_FABRICS=ofa on my system.

For example I'm using IMB to test out the performance using command:

mpiexec.hydra -genv I_MPI_FABRICS=shm:tcp -n 1 -host bio-xinyi ~/tmp/imb/imb/3.2.4/src/IMB-MPI1 -off_cache 12,64 -npmin 64 -msglog 24:28 -time 10 -mem 1 PingPong Exchange : -n 1 -host mic0 /tmp/IMB-MPI1.mic

When using I_MPI_FABRICS=ofa, it shows:

Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
Max MV2_SRQ_SIZE is 0, set to 512
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
Max MV2_SRQ_SIZE is 0, set to 512
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

When using I_MPI_FABRICS=dapl, it shows some error and warning, but the program ran well and fast,

bio-xinyi-mic0:140a:b77f4700: 3436 us(3436 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 4278 us(842 us): open_hca: device mlx4_0 not found
DAT: library load failure: libdaplofa.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplofa.so.2: cannot open shared object file: No such file or directory
bio-xinyi-mic0:140a:b77f4700: 6154 us(1876 us): open_hca: device mthca0 not found
bio-xinyi-mic0:140a:b77f4700: 6849 us(695 us): open_hca: device mthca0 not found
bio-xinyi-mic0:140a:b77f4700: 7515 us(666 us): open_hca: device ipath0 not found
bio-xinyi-mic0:140a:b77f4700: 8223 us(708 us): open_hca: device ipath0 not found
bio-xinyi-mic0:140a:b77f4700: 8925 us(702 us): open_hca: device ehca0 not found
DAT: library load failure: libdaplofa.so.2: cannot open shared object file: No such file or directory
bio-xinyi-mic0:140a:b77f4700: 93 us(93 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 337 us(244 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 637 us(300 us): open_hca: device mthca0 not found
bio-xinyi-mic0:140a:b77f4700: 873 us(236 us): open_hca: device mthca0 not found
DAT: library load failure: libdaplofa.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplofa.so.2: cannot open shared object file: No such file or directory
bio-xinyi-mic0:140a:b77f4700: 13650 us(4725 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 14402 us(752 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 95 us(95 us): open_hca: device mlx4_0 not found
bio-xinyi-mic0:140a:b77f4700: 392 us(297 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 43523 us(43523 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 43574 us(51 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 1023 us(1023 us): open_hca: getaddr_netdev ERROR: No such device. Is ib0 configured?
bio-xinyi:c8e:b1bae700: 1474 us(451 us): open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured?
bio-xinyi:c8e:b1bae700: 55106 us(11532 us): open_hca: device mthca0 not found
bio-xinyi:c8e:b1bae700: 55133 us(27 us): open_hca: device mthca0 not found
bio-xinyi:c8e:b1bae700: 55161 us(28 us): open_hca: device ipath0 not found
bio-xinyi:c8e:b1bae700: 55185 us(24 us): open_hca: device ipath0 not found
bio-xinyi:c8e:b1bae700: 55211 us(26 us): open_hca: device ehca0 not found
bio-xinyi:c8e:b1bae700: 1671 us(197 us): open_hca: rdma_bind ERR No such file or directory. Is eth2 configured?
bio-xinyi:c8e:b1bae700: 35 us(35 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 61 us(26 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 81 us(20 us): open_hca: device mthca0 not found
bio-xinyi:c8e:b1bae700: 99 us(18 us): open_hca: device mthca0 not found
bio-xinyi:c8e:b1bae700: 10752 us(9081 us): open_hca: rdma_bind ERR No such file or directory. Is eth2 configured?
bio-xinyi:c8e:b1bae700: 11216 us(464 us): open_hca: getaddr_netdev ERROR: No such device. Is eth3 configured?
bio-xinyi:c8e:b1bae700: 64833 us(9622 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 64863 us(30 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 12 us(12 us): open_hca: device mlx4_0 not found
bio-xinyi:c8e:b1bae700: 38 us(26 us): open_hca: device mlx4_0 not found

I've installed the OFED-1.5.4.1 and mic-ofed (rpm rebuilt for my kernel) according to the installation instruction of MPSS version 5889.

 

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Ruibang,

Let me take a look on your problem and will get back to you. Thank you.

Hi Ruibang,

The issue with the 'ofa' fabric is probably caused by these two messages:

Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
Max MV2_SRQ_SIZE is 0, set to 512

I'm checking with one of the OFED developers to see what those environment variables control and what they should be set to.  Let me ask you this: have you changed any environment settings after installing OFED on your host and Xeon Phi cards?

Also, it'll be great to know what the contents of your /etc/dat.conf file look like, as well as the output of the ibstat command.

I assume you've verified IB runs correctly on the machine?  The output you provide from the 'dapl' run actually contains messages about some of the IB devices not being available.  So I want to make sure that running over 'dapl' actually used IB and it did not fall back on using sockets.  Perhaps you can also provide me with the full output of your run over 'dapl'.  Feel free to send that to my e-mail address below (without the underscores '_').

Regards,
~Gergana

Thanks in advance loc-nguyen.

Quote:

loc-nguyen (Intel) wrote:

Hi Ruibang,

Let me take a look on your problem and will get back to you. Thank you.

Thanks in advance for your help.

I haven't changed the MV2_DEFAULT_MAX_SG_LIST and MV2_SRQ_SIZE variable manually.

My machine is running CentOS6.3 with an updated kernel, running "uname -a" gives:

Linux bio-xinyi 2.6.32-279.5.2.el6.x86_64 #1 SMP Fri Aug 24 01:07:11 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

The dat.conf is attached below.

Running ibstat gives nothing.But running ibv_stat gives:

device node GUID
------ ----------------
scif0 000000ffff000000

and running ibv_devinfo give:

hca_id: scif0
transport: iWARP (1)
fw_ver: 0.0.1
node_guid: 0000:00ff:ff00:0000
sys_image_guid: 0000:00ff:ff00:0000
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1000
port_lmc: 0x00
link_layer: IB

ibstatus gives:

Infiniband device 'scif0' port 1 status:
default gid: fe80:0000:0000:0000:4e79:baff:fe2c:0519
base lid: 0x3e8
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet

While install OFED_1.5.4.1, I've installed all packages except ib-kernel{,-devel}, comp-dapl and dapl according to Intel's v.5889 installation guide (RPMS all rebuilt). And then I've installed all RPMs included in the ofed folder come with the MPSS stack (also with RPMS rebuilt from src folder). Most of the services started correctly include openibd, mpss and ofed-mic. But opensmd failed starting up.

$ service opensmd start
Starting IB Subnet Manager...... [FAILED]

I can confirm that dapl is working (even with bunches of error message) since it provides upto 6.5G/s bandwidth and 12G/s duplexing. But if using tcp (then I won't get any error message), the maximum is only ~450M/s.

dat.conf:

---

# DAT v2.0, v1.2 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
# <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
# network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL RoCE provider, <ia_params> is device name and 0
#
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""

---

 

Quote:

Gergana Slavova (Intel) wrote:

Hi Ruibang,

The issue with the 'ofa' fabric is probably caused by these two messages:

Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
Max MV2_SRQ_SIZE is 0, set to 512

I'm checking with one of the OFED developers to see what those environment variables control and what they should be set to.  Let me ask you this: have you changed any environment settings after installing OFED on your host and Xeon Phi cards?

Also, it'll be great to know what the contents of your /etc/dat.conf file look like, as well as the output of the ibstat command.

I assume you've verified IB runs correctly on the machine?  The output you provide from the 'dapl' run actually contains messages about some of the IB devices not being available.  So I want to make sure that running over 'dapl' actually used IB and it did not fall back on using sockets.  Perhaps you can also provide me with the full output of your run over 'dapl'.  Feel free to send that to my e-mail address below (without the underscores '_').

Regards,
~Gergana

Hi there,

We were getting these error messages as well:

Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
Max MV2_SRQ_SIZE is 0, set to 512

We found that the problem could be solved by adding the below variable to our environment. We noticed that if we brought the virtual scif0 interface down on a MIC-containing node, the error messages would go away and our ofa fabric would work. We guessed that with the ofa fabric selected, MPI was trying to connect to scif0 instead of to the physical adaptor. Setting this variable fixed the problem.

export I_MPI_OFA_ADAPTER_NAME=mlx4_0

Jason,

Thanks for feeding this back into the community.

Regards
---
Taylor

 

Hello,

I followed this thread to solve my issue. But unfortunately i was not able to resolve it.

Both DAPL and OFA doesn't work for me.

Software Versions:

  • MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64
  • Intel parallel cluster 2015
  • Intel MPSS 3.4.3
  • Mellanox Infiniband Connect X-3 adapter

With OFA:

export I_MPI_MIC=1
export I_MPI_FABRICS=shm:ofa
export I_MPI_DEVICE=rdssm
export I_MPI_OFA_ADAPTER_NAME=mlx4_0
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u ,ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto
 

Error Messages: [export I_MPI_DEBUG=2]

[42] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[19] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[43] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[26] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[27] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

Error Messages: [export I_MPI_DEBUG=100]

[0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 1  Build 20140709
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[1] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[2] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[5] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[9] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[10] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[3] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[4] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[6] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[7] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[8] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[11] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[9] MPI startup(): Found 2 IB devices
[10] MPI startup(): Found 2 IB devices
[6] MPI startup(): Found 2 IB devices
[8] MPI startup(): Found 2 IB devices
[7] MPI startup(): Found 2 IB devices
[11] MPI startup(): Found 2 IB devices
[0] MPI startup(): Found 2 IB devices
[1] MPI startup(): Found 2 IB devices
[3] MPI startup(): Found 2 IB devices
[2] MPI startup(): Found 2 IB devices
[4] MPI startup(): Found 2 IB devices
[5] MPI startup(): Found 2 IB devices
[10] MPI startup(): Open 0 IB device: mlx4_0
[6] MPI startup(): Open 0 IB device: mlx4_0
[9] MPI startup(): Open 0 IB device: mlx4_0
[8] MPI startup(): Open 0 IB device: mlx4_0
[5] MPI startup(): Open 0 IB device: mlx4_0
[7] MPI startup(): Open 0 IB device: mlx4_0
[3] MPI startup(): Open 0 IB device: mlx4_0
[1] MPI startup(): Open 0 IB device: mlx4_0
[4] MPI startup(): Open 0 IB device: mlx4_0
[0] MPI startup(): Open 0 IB device: mlx4_0
[11] MPI startup(): Open 0 IB device: mlx4_0
[42] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[2] MPI startup(): Open 0 IB device: mlx4_0
[36] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[31] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[37] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[37] MPI startup(): Found 0 IB devices
[31] MPI startup(): Found 0 IB devices
[38] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[38] MPI startup(): Found 0 IB devices
[40] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[40] MPI startup(): Found 0 IB devices
[20] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[33] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[33] MPI startup(): Found 0 IB devices
[20] MPI startup(): Found 0 IB devices
[17] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[43] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[43] MPI startup(): Found 0 IB devices
[13] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[25] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[25] MPI startup(): Found 0 IB devices
[27] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[30] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[30] MPI startup(): Found 0 IB devices
[17] MPI startup(): Found 0 IB devices
[23] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[23] MPI startup(): Found 0 IB devices
[27] MPI startup(): Found 0 IB devices
[12] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[12] MPI startup(): Found 0 IB devices
[13] MPI startup(): Found 0 IB devices
[29] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[29] MPI startup(): Found 0 IB devices
[15] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[15] MPI startup(): Found 0 IB devices
[35] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[35] MPI startup(): Found 0 IB devices
[36] MPI startup(): Found 0 IB devices
[39] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[39] MPI startup(): Found 0 IB devices
[22] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[22] MPI startup(): Found 0 IB devices
[41] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[41] MPI startup(): Found 0 IB devices
[42] MPI startup(): Found 0 IB devices
[24] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[24] MPI startup(): Found 0 IB devices
[26] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[26] MPI startup(): Found 0 IB devices
[14] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[14] MPI startup(): Found 0 IB devices
[16] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[16] MPI startup(): Found 0 IB devices
[18] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[18] MPI startup(): Found 0 IB devices
[28] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[28] MPI startup(): Found 0 IB devices
[32] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[32] MPI startup(): Found 0 IB devices
[19] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[19] MPI startup(): Found 0 IB devices
[36] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[34] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[34] MPI startup(): Found 0 IB devices
[31] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[37] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[38] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[21] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[21] MPI startup(): Found 0 IB devices
[30] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[20] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[39] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[13] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[33] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[25] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[40] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[17] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[28] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[23] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[41] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[29] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[42] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[12] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[24] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[43] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[32] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[27] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[14] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[15] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[34] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[21] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[35] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[16] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[22] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[26] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[18] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[19] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[10] MPI startup(): Start 1 ports per adapter
[11] MPI startup(): Start 1 ports per adapter
[0] MPI startup(): Start 1 ports per adapter
[2] MPI startup(): Start 1 ports per adapter
[5] MPI startup(): Start 1 ports per adapter
[3] MPI startup(): Start 1 ports per adapter
[1] MPI startup(): Start 1 ports per adapter
[7] MPI startup(): Start 1 ports per adapter
[8] MPI startup(): Start 1 ports per adapter
[6] MPI startup(): Start 1 ports per adapter
[9] MPI startup(): Start 1 ports per adapter
[4] MPI startup(): Start 1 ports per adapter

  • While installing MPSS and starting the openibd service, i noticed that setting up infiniband network interfaces doesnt say OK

[root@tbx-node07 MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64]# service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]
Setting up InfiniBand network interfaces:
No configuration found for ib0
Setting up service network . . .                           [  done  ]

[root@node07 ~]# ibv_devinfo [-On host]
Failed to query device propshca_id:     mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.32.5100
        node_guid:                      f452:1403:006a:9050
        sys_image_guid:                 f452:1403:006a:9053
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00
                        link_layer:             InfiniBand

[On one of the MIC]

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.32.5100
        node_guid:                      f452:1403:006a:9050
        sys_image_guid:                 f452:1403:006a:9053
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: scif0
        transport:                      SCIF (2)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe57:02a8
        sys_image_guid:                 4c79:baff:fe57:02a8
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             SCI

  • On host

                          [root@node07 ~]# ls /sys/class/infiniband

                          mlx4_0  scif0

  • on mic

                         [root@node07 ~]# ssh mic0 ls /sys/class/infiniband
                          mlx4_0
                          scif0

  • I_MPI_ROOT=/opt/intel/impi/5.0.1.035 is set to the following.

 

My setup has 4 mic cards in the server with 2 processors. Can you guys please help me in getting ofa and dapl work with intel mic's?

Please let me know if you need any additional information.

Top

 

Hi Elena,

Could you please check the status of ofed-mic service on the host side - seems it isn't running.
See the Intel MPSS User Guide for details about its usage.

Regarding correct OFA/DAPL settings for Intel MPI Library:

OFA:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:ofa
export I_MPI_OFA_ADAPTER_NAME=mlx4_0

DAPL:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:dapl

I_MPI_DEVICE is an obsolete analogue of I_MPI_FABRICS.
There should be one DAPL provider in I_MPI_DAPL_PROVIDER variable. For multiple DAPL providers there's I_MPI_DAPL_PROVIDER_LIST. See the Intel MPI Library Reference Manual for details about its usage.
Try to run without I_MPI_PIN_* variables.

Moreover you may need mpxyd (CCL-proxy) service for DAPL - see the Intel MPSS User Guide or OFED documentation for details.
 

[root@tbx-node05 conus2.5km]# service openibd status

  HCA driver loaded

Configured IPoIB devices:
ib0

Currently active IPoIB devices:

The following OFED modules are loaded:

  rdma_ucm
  rdma_cm
  ib_addr
  ib_ipoib
  mlx4_core
  mlx4_ib
  mlx4_en
  mlx5_core
  mlx5_ib
  ib_uverbs
  ib_umad
  ib_ucm
  ib_sa
  ib_cm
  ib_mad
  ib_core

[root@node05 conus2.5km]# service opensmd status
opensm (pid 8086) is running...
[root@node05 conus2.5km]# service mpxyd status
mpxyd (pid  7884) is running...

[root@node05 conus2.5km]# service ofed-mic status
Status of OFED Stack:
host                                                       [  OK  ]
mic0                                                       [  OK  ]
mic1                                                       [  OK  ]
mic2                                                       [  OK  ]
mic3                                                       [  OK  ]

WITH DAPL:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0

Error messages:

[7] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[11] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[5] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[6] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[8] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[2] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[4] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[10] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[9] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[1] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[3] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[0] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[20] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[30] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[22] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[28] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[29] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[15] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[21] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[12] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[18] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[31] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[16] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[34] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[14] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[35] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[13] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[32] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[17] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[33] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[19] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[40] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[37] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[25] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[41] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[36] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[38] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[39] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[42] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[24] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[43] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[23] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[26] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
[27] DAPL startup(): trying to open first DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-mlx4_0-1u
tbx-node05-mic2:UCM:1619:8ea879c0: 176 us(176 us):  open_hca: ibv_get_device_list() failed
tbx-node05-mic0:UCM:161f:6e0f9c0: 180 us(180 us):  open_hca: ibv_get_device_list() failed
tbx-node05-mic2:UCM:1617:270249c0: 177 us(177 us):  open_hca: ibv_get_device_list() failed
tbx-node05-mic2:UCM:1615:fcfa29c0: 176 us(176 us):  open_hca: ibv_get_device_list() failed
[30] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[15] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[15] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[30] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[28] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[29] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[28] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic1:UCM:1614:590469c0: 204 us(204 us):  open_hca: ibv_get_device_list() failed
[29] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic0:UCM:161d:b9b039c0: 263 us(263 us):  open_hca: ibv_get_device_list() failed
[20] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic1:UCM:1617:31ec9c0: 205 us(205 us):  open_hca: ibv_get_device_list() failed
[7] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[7] DAPL startup(): trying to open secondary (2) DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-scif0
tbx-node05-mic1:UCM:1618:942999c0: 246 us(246 us):  open_hca: ibv_get_device_list() failed
[22] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[21] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic2:UCM:1613:14d099c0: 255 us(255 us):  open_hca: ibv_get_device_list() failed
[20] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[12] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic2:UCM:1618:74d3c9c0: 176 us(176 us):  open_hca: ibv_get_device_list() failed
[12] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[21] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[22] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic1:UCM:1613:fedfa9c0: 208 us(208 us):  open_hca: ibv_get_device_list() failed
tbx-node05-mic2:UCM:1616:2ce609c0: 173 us(173 us):  open_hca: ibv_get_device_list() failed
tbx-node05:SCM:20f8:adebdd20: 23 us(23 us):  open_hca: device scif0 not found
[7] DAPL startup(): failed to open DAPL provider ofa-v2-scif0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
[25] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[31] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic0:UCM:1621:e08969c0: 179 us(179 us):  open_hca: ibv_get_device_list() failed
[25] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[18] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic0:UCM:161e:59bd49c0: 173 us(173 us):  open_hca: ibv_get_device_list() failed
tbx-node05-mic2:UCM:161a:62fee9c0: 175 us(175 us):  open_hca: ibv_get_device_list() failed
[32] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic2:UCM:1614:1159c0: 179 us(179 us):  open_hca: ibv_get_device_list() failed
[33] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[18] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic1:UCM:1616:9638d9c0: 181 us(181 us):  open_hca: ibv_get_device_list() failed
[4] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
[34] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[4] DAPL startup(): trying to open secondary (2) DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-scif0
[31] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05:SCM:20f5:adebdd20: 22 us(22 us):  open_hca: device scif0 not found
[4] DAPL startup(): failed to open DAPL provider ofa-v2-scif0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
tbx-node05-mic0:UCM:161a:63dff9c0: 174 us(174 us):  open_hca: ibv_get_device_list() failed
[16] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic0:UCM:161c:6a0139c0: 167 us(167 us):  open_hca: ibv_get_device_list() failed
[17] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[14] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[2] DAPL startup(): trying to open secondary (2) DAPL provider from I_MPI_DAPL_PROVIDER_LIST: ofa-v2-scif0
[14] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[24] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[34] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05:SCM:20f3:adebdd20: 20 us(20 us):  open_hca: device scif0 not found
[16] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic1:UCM:161a:9dcb49c0: 173 us(173 us):  open_hca: ibv_get_device_list() failed
[26] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[35] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[2] DAPL startup(): failed to open DAPL provider ofa-v2-scif0
tbx-node05-mic1:UCM:1615:3a0819c0: 211 us(211 us):  open_hca: ibv_get_device_list() failed
[32] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 765: 0
internal ABORT - process 0
[24] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[33] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[17] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic1:UCM:1619:7c3af9c0: 205 us(205 us):  open_hca: ibv_get_device_list() failed
[35] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
tbx-node05-mic0:UCM:1620:28cc39c0: 193 us(193 us):  open_hca: ibv_get_device_list() failed
[27] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
tbx-node05-mic0:UCM:161b:a7ce39c0: 258 us(258 us):  open_hca: ibv_get_device_list() failed
[26] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[13] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[27] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[19] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[23] DAPL startup(): failed to open DAPL provider ofa-v2-mlx4_0-1u
[23] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[13] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
[19] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

With OFA:
export I_MPI_MIC=1
export I_MPI_FABRICS=shm:ofa
export I_MPI_OFA_ADAPTER_NAME=mlx4_0
 
[0] MPI startup(): Multi-threaded optimized library
[28] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[29] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[12] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[30] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[36] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[31] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

All the services are up and running. But still get the same error messages for OFA and DAPL

Hi Elena,

Have you installed Mellanox* OFED according to the Intel MPSS User Guide (the chapter "Steps to Install Intel MPSS using Mellanox* OFED 2.1/2.2/2.3")?

Hello Artem,

Yes. I followed the steps given in MPSS user guide to install Mellanox OFED.

Thanks,

Elena

Hi Elena,

Unfortunately I'm unable to reproduce your issue with the similar configuration.

According to the provided debug information (I_MPI_DEBUG=100) it looks like there aren't any IB devices on some nodes (possibly MIC):
[37] MPI startup(): Found 0 IB devices
[31] MPI startup(): Found 0 IB devices

But you wrote that ofed-mic service was up and provided ibv_devinfo output looks correct.

Could you please simplify your test scenario to something like (IMB-MPI1 is in the Intel MPI installation):
export I_MPI_MIC=1
export I_MPI_MIC_PREFIX=$I_MPI_ROOT/mic/bin/
export I_MPI_FABRICS=shm:dapl (or shm:ofa)
mpirun -ppn 1 -n 2 -hosts node,node-mic0 IMB-MPI1 pingpong

Try to vary different MIC nodes. Check that specified MIC hostnames refer to valid MIC cards (where you checked ibv_devinfo).

Leave a Comment

Please sign in to add a comment. Not a member? Join today