open_hca: device mlx4_0 not found

open_hca: device mlx4_0 not found

Bild des Benutzers Praveen k.

[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[12] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[30] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
malaga.ncl.res.in:20e0:3f1f6a20: 2052 us(2052 us):  open_hca: device mlx4_0 not found
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:20e0:3f1f6a20: 2209 us(157 us):  open_hca: device mlx4_0 not found
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
malaga.ncl.res.in:20e7:df6dba20: 1967 us(1967 us):  open_hca: device mlx4_0 not found
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:20e7:df6dba20: 2130 us(163 us):  open_hca: device mlx4_0 not found
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[22] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[14] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
malaga.ncl.res.in:20ec:99e6aa20: 3857 us(3857 us):  open_hca: device mlx4_0 not found
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:20ee:9340ba20: 3929 us(3929 us):  open_hca: device mlx4_0 not found
[30] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:20ec:99e6aa20: 3972 us(115 us):  open_hca: device mlx4_0 not found
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[10] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[2] MPI startup(): DAPL provider ofa-v2-ib0
[16] MPI startup(): DAPL provider ofa-v2-ib0
malaga.ncl.res.in:20ee:9340ba20: 4095 us(166 us):  open_hca: device mlx4_0 not found
[30] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[26] MPI startup(): DAPL provider ofa-v2-ib0
[30] MPI startup(): DAPL provider ofa-v2-ib0
[18] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[23] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[13] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[7] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[17] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1

12 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Praveen k.

Can any one please help me in  this problem

Bild des Benutzers James Tullos (Intel)

Hi Praveen,

Please attach your /etc/dat.conf file.  What is the output from ibstat?  Please run with I_MPI_DEBUG=2 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Praveen k.

Hi James,

 These are the files and output

ibstat
CA 'qib0'
    CA type: InfiniPath_QLE7340
    Number of ports: 1
    Firmware version:
    Hardware version: 2
    Node GUID: 0x001175000070a728
    System image GUID: 0x001175000070a728
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 40
        Base lid: 1
        LMC: 0
        SM lid: 1
        Capability mask: 0x0761086a
        Port GUID: 0x001175000070a728
        Link layer: InfiniBand

cat /etc/dat.conf

ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""

 cat matmul.o213

[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[14] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[12] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[24] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
malaga.ncl.res.in:21be:f82aa20: 826 us(826 us):  open_hca: device mlx4_0 not found
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21be:f82aa20: 902 us(76 us):  open_hca: device mlx4_0 not found
malaga.ncl.res.in:21bf:a6c55a20: 824 us(824 us):  open_hca: device mlx4_0 not found
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21bf:a6c55a20: 898 us(74 us):  open_hca: device mlx4_0 not found
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
malaga.ncl.res.in:21c0:247eba20: 810 us(810 us):  open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21c3:971c0a20: 833 us(833 us):  open_hca: device mlx4_0 not found
[12] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21c3:971c0a20: 909 us(76 us):  open_hca: device mlx4_0 not found
[12] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
malaga.ncl.res.in:21c4:dcb51a20: 1042 us(1042 us):  open_hca: device mlx4_0 not found
[14] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21c4:dcb51a20: 1118 us(76 us):  open_hca: device mlx4_0 not found
[14] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
malaga.ncl.res.in:21ca:29e59a20: 954 us(954 us):  open_hca: device mlx4_0 not found
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21ca:29e59a20: 1029 us(75 us):  open_hca: device mlx4_0 not found
[26] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
malaga.ncl.res.in:21c0:247eba20: 887 us(77 us):  open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
malaga.ncl.res.in:21c5:db6f8a20: 826 us(826 us):  open_hca: device mlx4_0 not found
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
malaga.ncl.res.in:21c5:db6f8a20: 904 us(78 us):  open_hca: device mlx4_0 not found
[16] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[22] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
malaga.ncl.res.in:21c9:96faaa20: 812 us(812 us):  open_hca: device mlx4_0 not found
                                                                                              42,1           0%

These are the files and output...

Bild des Benutzers James Tullos (Intel)

Hi Praveen,

Those messages indicate that the DAPL* provider mlx4_0 is not available, but do not indicate why.  Please send the output from ibstat.  Also, what command are you using to run your program?  Setting I_MPI_DEBUG=2 should have given additional output.

James.

Bild des Benutzers Praveen k.

Hi James,

  Thanks for the reply..

   I am using this command

   mpiexec.hydra -machinefile ./NODE -np 32 -genv I_MPI_DEBUG=2 ./matmul.bin

$ ibstat
CA 'qib0'
    CA type: InfiniPath_QLE7340
    Number of ports: 1
    Firmware version:
    Hardware version: 2
    Node GUID: 0x001175000070a728
    System image GUID: 0x001175000070a728
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 40
        Base lid: 1
        LMC: 0
        SM lid: 1
        Capability mask: 0x0761086a
        Port GUID: 0x001175000070a728
        Link layer: InfiniBand

 This is the ibstat output

I have attached the output file please find that

Anlagen: 

AnhangGröße
Herunterladen matmul-output.txt.gz1.95 MB
Bild des Benutzers James Tullos (Intel)

Hi Praveen,

Everything is working as expected.  As I said, the messages are due to the DAPL* provider mlx4_0 not being available.  That is because you are using ib0 instead.  By default, the Intel® MPI Library tries the entries in /etc/dat.conf in order.

I would suggest modifying your /etc/dat.conf file and putting the ofa-v2-ib0 line first, as this is the provider you are using.  I would recommend either commenting out the ofa-v2-mlx4_0-1 and ofa-v2-mlx4_0-2 lines or moving them to the bottom of the file.

You can also set I_MPI_DAPL_PROVIDER=ofa-v2-ib0 and this will skip to the approprate provider.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Praveen k.

Hi James,

Thanks for the support

It is working Perfectly ....

If there any way to check the performance of the IB?

Bild des Benutzers James Tullos (Intel)

Hi Praveen,

You can check directly using

ib_read_bw -d ib0&
ib_read_bw -w ib0 localhost

Or you can use the Intel® MPI Benchmarks to test MPI performance over the fabric.  A binary is included with the Intel® MPI Library installation, or you can download the source at http://software.intel.com/en-us/articles/intel-mpi-benchmarks/.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Praveen k.

My team has ran some Gromacs job

Now performance got deducted

earlier the Performance on a single node was : 937.9 ns/day
Today when I tried the performance is: 832.1 ns/day

So thats more than 10% difference.

For the multi-node jobs:
Earlier Performance: 1401.8 ns/day
(expected was 1876ns/day, i.e. double of the single node performance)

Today's Performance: 1317.2 ns/day

Any idea why the single node performance has gone down

(The jobs were run with mpiexec.hydra on both days)

Today's Performance: 1317.2 ns/day

Bild des Benutzers Praveen k.

How can i fine tune this setup for better performance??

Bild des Benutzers James Tullos (Intel)

Hi Praveen,

I don't have a solid answer as to why the performance would be different on a different day.  Was a different job running as well on the lower performing day that could have used up some system resources?  Did anything on the system change?

As to how to improve the performance, we offer quite a few options.  The simplest is the automatic tuner, mpitune.  Please see http://software.intel.com/en-us/articles/increase-cluster-mpi-application-performance-with-a-mpi-tune-up for more information about mpitune.

You can also use the Intel® Trace Analyzer and Collector to help locate MPI bottlenecks.  Go to http://software.intel.com/en-us/intel-trace-analyzer/ for more information.

We have a performance and threading analysis tool, Intel® VTune™ Amplifier XE, which can provide information about hotspots and threading performance problems within your program as well.  Visit http://software.intel.com/en-us/intel-vtune-amplifier-xe/ for more information.

If you decide one or more of these tools can help, look through the articles to findi specific usage information, or feel free to ask and I can help point you in the correct direction or answer specific questions.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Melden Sie sich an, um einen Kommentar zu hinterlassen.