Intel Cluster Studio problem with infiniband

Intel Cluster Studio problem with infiniband

Hi,

On a cluster in my university we have Intel Cluster Studio (2011 I think, i'm not the admin). The distibution is a red hat. And the OFED drivers are installed.

We have 2 problems:
- the first one is with I_MPI_FABRICS=shm:ofa we get:

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=f95030

[0] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so: cannot open shared object file: No such file or directory
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled


- the second one is with I_MPI_FABRICS=shm:dapl it seems to works with IMB-MPI1:
[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=11ac030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=1787030

[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1

[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1

[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1

[0] MPI startup(): dapl data transfer mode

[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1

[1] MPI startup(): dapl data transfer mode

[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000

[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000

[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000

[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000

then the benchmark runs.
but with our personnal programs we get:
[0] dapl fabric is not available and fallback fabric is not enabled

What can I do to understand the problem ?

Thx a lot,
best regards
Guillaume

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Guillaume,

The version of Intel Cluster Studio is less important than the versions of the individual components. Please send me the output from the following commands:

mpirun -V

icc -V

env | grep I_MPI

For the first problem, check that libibverbs.so is available andcorrectly linkedon each of the nodes. It should be a symlink to libibverbs.so.1.0.0, if not, you should reinstall OFED.

For the second problem, I'll need some more detail. Is IMB-MPI1 the only program that works with DAPL? What if you recompile the benchmark, will the newly compiled version run? What are the contents of your /etc/dat.conf file?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Studio

Hi,

[13:05:28] denayer@frontend ~ $ mpirun -V

Intel MPI Library for Linux Version 4.0 Update 2

Build 20110330 Platform Intel 64 64-bit applications

Copyright (C) 2003-2011 Intel Corporation. All rights reserved


[13:06:17] denayer@frontend ~ $ icc -V

Intel C Intel 64 Compiler XE for applications running on Intel 64, Version 12.1.0.233 Build 20110811

Copyright (C) 1985-2011 Intel Corporation.  All rights reserved.

[13:06:19] denayer@frontend ~ $ env | grep I_MPI

I_MPI_PIN=0

I_MPI_F77=ifort

I_MPI_FABRICS=shm:dapl

I_MPI_PATH=/appl/intel/impi/4.0.3.008

I_MPI_TUNER_DATA_DIR=/appl/intel/impi/4.0.3.008/etc64/

I_MPI_F90=ifort

I_MPI_CC=icc

I_MPI_CXX=icpc

I_MPI_MPD_RSH=ssh

I_MPI_FC=ifort

I_MPI_ROOT=/appl/intel/impi/4.0.3.008


I don't have any libibverbs.so...Just:
/usr/lib64/libibverbs.so.1
/usr/lib64/libibverbs.so.1.0.0
These files come with the package libibverbs-1.1.4-2.el6.x86_64.

Momentan it is the only one. We have 4 in-house programs (which works without problem on others clusters with intel mpi). These 4 programs do not work on the present cluster.

There is no /etc/dat.conf...I found one unter /etc/rdma/dat.conf:
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""

Ths a lot,

best regards,
Guillaume

Hi,

I have created the link "libibverbs.so -> libibverbs.so.1.0.0" per hand. And the problem with ofa is disappeared. Strange...the official red hat packages do not create this link.

perhaps it is the same problem with dapl. Which dapl library does intel mpi search ?

should we install compat-dapl (interface dapl 1.2) package ?

Thx for your tip.
Best regards
Guillaumelibibverbs.so -> libibverbs.so.1.0.0

Hi Guillame,

First, there is an odd discrepancy in your MPI versions. The I_MPI_ROOT shows that you should be running 4.0Update 3, but mpirun claims to be 4.0Update 2. That shouldn't be the cause of any of these problems, but let's try to get that straightened out. What do you get from running

which mpirun

which icc

My guess is that you're getting the mpirun from a different location than I_MPI_ROOT. To correct this,make sure /appl/intel/impi/4.0.3.008/bin64/mpivars.sh is sourced after any other scripts that would add an MPI implementation to your path. You might want to logout and login again, just to clear out any environment variables that could be causing a problem.

It appears that you've solved the problem from the OFA fabric. As long as the missing symlink is the only problem, you should be all set there. If other problems arise, I would recommend reinstalling OFED.

Now, for the DAPL fabric. Please try compiling and running the test program (pick one of the files in /appl/intel/impi/4.0.3.008/test/) with I_MPI_DEBUG=5. Try running with a different provider (I_MPI_DAPL_PROVIDER=ofa-v2-ib0 as an example). It is possible (though unlikely) that the dat.conf file is not being found by programs other than the benchmark. Try setting DAT_OVERRIDE=/etc/rdma/dat.conf and see if that helps. Or you could trycreating a symlink /etc/dat.conf -> /etc/rdma/dat.conf instead.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi James,

[17:00:42] denayer@frontend ~ $ which mpirun

/appl/intel/composer_xe_2011_sp1.6.233/mpirt/bin/intel64/mpirun

[17:00:45] denayer@frontend ~ $ which icc

/appl/intel/composer_xe_2011_sp1.6.233/bin/intel64/icc


The problem is: i'm not the admin or the guy who has installed intel mpi. I'm the one, who wants to use intel mpi :) So I do not know exactly, what the admin did...

Should I install compat-dapl ?

Thx for your help.

Guillaume

Hi Guillaume,

You should not need the compat-dapl package. Try changing your .bash_login to have

. /appl/intel/impi.4.0.3.008/intel64/bin/mpivars.sh

after any references to compilervars.sh and that should correct the version mismatch.

Have you tried the test programs?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi,

I have tested with ./appl/intel/impi/4.0.3.008/intel64/bin/mpivars.sh, but the results of env | grep I_MPI are the same.

What is the problem with /appl/intel/impi/4.0.3.008/ and my version of mpirun ?

I do not tested the test programs...not yet :)

THx for your support!
Guillaume

grrrrrrrr! I have understood a part of the problem! the mpirun version problem is a problem in my .bashrc...sorry. Now I get:
[17:40:19] denayer@frontend ~ $ mpirun -V
Intel MPI Library for Linux* OS, Version 4.0 Update 3 Build 20110824
Copyright (C) 2003-2011, Intel Corporation. All rights reserved.

Hi Guillame,

That should avoid any issues with the versions being different. Let me know once you've tried the test programs and we'll go from there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

It was a little bit more difficult. our administrator had created a script under /etc/profile.d/intel.sh with:
. /appl/intel/bin/compilervars.sh intel64

Why is this line not correct ? is it deprecated ?

Thx a lot,
Best regards,

Hi Guillame,

That line should work just fine. It sets up the paths for the compilers libraries. However, this does not set up the correct path for MPI development. It uses a slightlyolder MPI version (4.0.2 instead of 4.0.3). The mpivars.sh script sets up the paths for the current MPI version (assuming you use the one for that version, which you are doing), and includes all of the development libraries, rather than just the runtime libraries. You will want to run both of these scripts, but make sure the mpivars.sh script is run after the compilervars.sh script.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi Guillame,

I need to make a correction to my last statement. Between these two, it should not matter which is run first, as the compilervars.sh script checks for I_MPI_ROOT, and if this variable is set (mpivars.sh sets it), then it will use that to set the paths. If not, then it will use the runtime version by default. So as long as you run the mpivars.sh script, there should be no problem at all, as long as it is not overwritten later by something else.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

ok. Thx a lot. THe both dapl and ofa problems seem solved!

Leave a Comment

Please sign in to add a comment. Not a member? Join today