intel mpi failed with infiniband on new nodes of our cluster (Got FATAL event 3)

intel mpi failed with infiniband on new nodes of our cluster (Got FATAL event 3)

Аватар пользователя Guillaume De Nayer

Hi,

We got new nodes on our cluster. On the first 12 old nodes intel mpi (intel cluster studio 2010, 2011, 2012) works without any problems. The 4 new nodes are exactly the same OS than the 12 old and the same installation (node image). It is the same hardware too.

We have I_MPI_FABRICS=shm:ofa

If I start mpirun on the 12 old nodes, it works without problems.
If I try to start a parallel job with one of the new node I get:

send desc error [1] Abort: Got FATAL event 3 at line 861 in file ../../ofa_utility.c 

If I try to start a local job on one of the new node, it works.
So It is linked with infiniband.

Strange, because a run with openmpi with infiniband works with the new nodes.

If I'm using I_MPI_FABRICS=shm:dapl with the new nodes it works.

Ideas ?

Best regards,
Guillaume

12 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя James Tullos (Intel)

Hi Guillaume,

What happens if you try to run with a non-parallel command?

mpirun -genv I_MPI_FABRICS shm:ofa -n 1 -host old_node hostname : -n 1 -host new_node hostname

Also, on the parallel job, what is the output with -verbose and I_MPI_DEBUG=5?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Аватар пользователя Guillaume De Nayer

Sorry for the delay...

I tried your test. It does not work:
- under PBS/Torque I get: "-host (or -ghost) and -machinefile are incompatible"

- in a terminal I get:
mpiexec: unable to start all procs; may have invalid machine names
remaining specified hosts:
192.168.0.13 (n13.blabla)
192.168.0.14 (n14.blabla)

It do that on all the nodes...but the machine names are correct. So I don't understand.

Best regards

Аватар пользователя James Tullos (Intel)

Hi Guillaume,

The first error message is likely due to a lack of tight integration with Torque*. Could you please send me the output from running the same command with -verbose added? Are you able to ssh from an old node to a new node, or the reverse?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Аватар пользователя Guillaume De Nayer

Hi,

- do you mean --verbose, isn't it ? here the output with --verbose directly on n13:

[16:20:41] denayer@n13 ~ $ mpirun --verbose -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm -n 1 -host n13 hostname : -n 1 -host n14 hostname

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.

running mpdallexit on n13

LAUNCHED mpd on n13  via

RUNNING: mpd on n13

mpiexec: unable to start all procs; may have invalid machine names

    remaining specified hosts:

        192.168.0.14 (n14.marvin)

here the output with --verbose from master :

[16:20:17] denayer@master ~ $ mpirun --verbose -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm -n 1 -host n13 hostname : -n 1 -host n14 hostname

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.

running mpdallexit on master

LAUNCHED mpd on master  via

RUNNING: mpd on master

mpiexec: unable to start all procs; may have invalid machine names

    remaining specified hosts:

        192.168.0.13 (n13.marvin)

        192.168.0.14 (n14.marvin)

- For your ssh question:
from master to n13: ok.
from master to n14: ok.
from n13 to master: ok
from n14 to master: ok
from n13 to n14: ok
from n14 to n13: ok

Regards

Аватар пользователя James Tullos (Intel)

Hi Guillaume,

What is the value of I_MPI_PROCESS_MANAGER? Which version of the Intel MPI Library are you using?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Аватар пользователя Guillaume De Nayer

We have 3 different one:
intel cluster toolkit 2010
intel cluster studio 2011
intel cluster studio 2012.

The errors above are with intel cluster studio 2011:

[13:44:33] denayer@master ~ $ mpirun -version
Intel MPI Library for Linux Version 4.0 Update 1
Build 20100910 Platform Intel 64 64-bit applications
Copyright (C) 2003-2010 Intel Corporation. All rights reserved

I_MPI_PROCESS_MANAGER has no value in my shell.

Regards

Аватар пользователя James Tullos (Intel)

Hi Guillaume,

What happens with Intel Cluster Studio 2012 (which contains Intel MPI Library 4.0 Update 3)?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Аватар пользователя Guillaume De Nayer

with Intel Cluster Studio 2013:
15:52:56] denayer@master ~ $ mpirun -version
Intel MPI Library for Linux* OS, Version 4.0 Update 3 Build 20110824
Copyright (C) 2003-2011, Intel Corporation. All rights reserved.

your command works:

[15:53:48] denayer@master ~ $ mpirun -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm -n 1 -host n13 hostname : -n 1 -host n14 hostname

n14

n13


with --verbose:

[15:52:58] denayer@master ~ $ mpirun --verbose -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm -n 1 -host n13 hostname : -n 1 -host n14 hostname
==================================================================================================

mpiexec options:

----------------

  Base path: /opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/

  Bootstrap server: ssh

  Debug level: 1

  Enable X: -1
  Global environment:

  -------------------

    I_MPI_PERHOST=allcores

    MODULE_VERSION_STACK=3.2.5

    MKLROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl

    MANPATH=/opt/intel/ics_2012/itac/8.0.3.007/man:/opt/intel/ics_2012/impi/4.0.3.008/man:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/man/en_US:/opt/intel/ics_2012/vtune_amplifier_xe_2011/man:/opt/modules/Modules/default/share/man:/opt/pbs/man:/opt/env-switcher/man:/usr/man:/usr/share/man:/usr/local/man:/usr/local/share/man:/usr/X11R6/man:/opt/c3-4/man

    HOSTNAME=master

    VT_MPI=impi4

    I_MPI_PIN=0

    INTEL_LICENSE_FILE=/opt/intel/licenses

    IPPROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp

    I_MPI_F77=ifort

    SHELL=/bin/bash

    TERM=xterm

    HISTSIZE=200000

    I_MPI_FABRICS=shm:dapl

    SSH_CLIENT=139.11.215.121 5290 22

    LIBRARY_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21

    CVSROOT=:ext:fhpout@laplace.lstm.uni-erlangen.de:/data/linux/proj_tape/LSTM/fhpdev

    MODULE_SHELL=sh

    FPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include

    SSH_TTY=/dev/pts/5

    USER=denayer

    MODULE_OSCAR_USER=denayer

    LD_LIBRARY_PATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4:/opt/intel/ics_2012/impi/4.0.3.008/intel64/lib:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21:/home/denayer/FSI_new/FSI/Software/carat20/libraries/rlog-1.4/lib/:/home/denayer/FSI_new/FSI/Software/carat20/libraries/atlas/lib/:/opt/maui/lib:/opt/tecplot/tec360_2010/lib

    LS_COLORS=no=00:fi=00:di=01;35:ln=01;36:pi=40;33:so=01;33:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:

    ENV=/home/denayer/.bashrc

    CPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/include

    TMOUT=36000

    MSM_PRODUCT=MSM

    NLSPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64/locale/en_US

    PATH=/opt/intel/ics_2012/itac/8.0.3.007/bin:/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/bin/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/opt/intel/ics_2012/vtune_amplifier_xe_2011/bin64:/usr/kerberos/bin:/opt/maui/bin:/opt/tecplot/tec360_2010/bin:/usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/ansys_inc/shared_files/licensing/lic_admin:/opt/ansys_inc/v130/icemcfd/linux64_amd/bin:/opt/ansys_inc/v130/Framework/bin/Linux64:/opt/ansys_inc/v130/CFX/bin:/opt/c3-4/:/home/denayer/bin:.:/opt/gid/gid_9:/opt/matlab/r2011a/bin

    MAIL=/var/spool/mail/denayer

    MODULE_VERSION=3.2.5

    VT_ADD_LIBS=-ldwarf -lelf -lvtunwind -lnsl -lm -ldl -lpthread

    I_MPI_TUNER_DATA_DIR=/opt/intel/ics_2012/impi/4.0.3.008/etc64/

    TBBROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb

    PWD=/home/denayer

    _LMFILES_=/opt/modules/oscar-modulefiles/torque-oscar/2.1.10:/opt/env-switcher/share/env-switcher/ansys/ansys-13.0:/opt/env-switcher/share/env-switcher/tecplot/tec360-2010:/opt/modules/oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/default-manpath/1.0.1:/opt/modules/oscar-modulefiles/maui/3.2.6:/opt/modules/modulefiles/oscar-modules/1.0.5:/opt/modules/Modules/3.2.5/modulefiles/dot:/opt/env-switcher/share/env-switcher/tools/intel-vtune-2011:/opt/env-switcher/share/env-switcher/gid/gid-9.0.6:/opt/env-switcher/share/env-switcher/matlab/matlab-r2011a:/opt/env-switcher/share/env-switcher/compiler/intel-compiler-12.1:/opt/env-switcher/share/env-switcher/mpi/intel-cluster-toolkit-2012.0.032

    CARAT_LIC_PATH=/home/denayer/FSI_new/FSI/Software/carat20/exe

    EDITOR=/usr/bin/emacs

    LANG=en_US.UTF-8

    MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-modulefiles:/opt/modules/version:/opt/modules/Modules/$MODULE_VERSION/modulefiles:/opt/modules/modulefiles:

    LOADEDMODULES=torque-oscar/2.1.10:ansys/ansys-13.0:tecplot/tec360-2010:switcher/1.0.13:default-manpath/1.0.1:maui/3.2.6:oscar-modules/1.0.5:dot:tools/intel-vtune-2011:gid/gid-9.0.6:matlab/matlab-r2011a:compiler/intel-compiler-12.1:mpi/intel-cluster-toolkit-2012.0.032

    VT_LIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4

    I_MPI_F90=ifort

    MPIROOTDIR=/opt/intel/impi/4.0.1/intel64/lib

    I_MPI_CC=icc

    VT_ROOT=/opt/intel/ics_2012/itac/8.0.3.007

    SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass

    HOME=/home/denayer

    SHLVL=2

    I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh

    I_MPI_CXX=icpc

    I_MPI_MPD_RSH=ssh

    MSM_HOME=/usr/local/MegaRAID Storage Manager

    FHPSYSTEM=INTEL64

    VT_SLIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4

    I_MPI_FC=ifort

    LOGNAME=denayer

    CVS_RSH=ssh

    SSH_CONNECTION=139.11.215.121 5290 139.11.215.117 22

    CLASSPATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4

    MODULESHOME=/opt/modules/Modules/3.2.5

    CPRO_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233

    LESSOPEN=|/usr/bin/lesspipe.sh %s

    CVSEDITOR=emacs

    FHPTARGET=parallel

    INCLUDE=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/include

    G_BROKEN_FILENAMES=1

    I_MPI_ROOT=/opt/intel/ics_2012/impi/4.0.3.008

    _=/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/mpiexec.hydra
  User set environment:

  ---------------------

    I_MPI_DEBUG=5

    I_MPI_FABRICS=shm
    Proxy information:

    *********************

      Proxy ID:  1

      -----------------

        Proxy name: n13

        Process count: 1

        Start PID: 0
        Proxy exec list:

        ....................

          Exec: hostname; Process count: 1

      Proxy ID:  2

      -----------------

        Proxy name: n14

        Process count: 1

        Start PID: 1
        Proxy exec list:

        ....................

          Exec: hostname; Process count: 1
==================================================================================================
[mpiexec@master] Timeout set to -1 (-1 means infinite)

[mpiexec@master] Got a control port string of master:47174
Proxy launch args: /opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/pmi_proxy --control-port master:47174 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --bootstrap ssh --bootstrap-exec ssh --demux poll --pgid 0 --enable-stdin 1 --proxy-id
[mpiexec@master] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1

Arguments being passed to proxy 0:

--version 1.3 --interface-env-name MPICH_INTERFACE_HOSTNAME --hostname n13 --global-core-count 2 --global-process-count 2 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname kvs_21039_0 --pmi-process-mapping (vector,(0,2,1)) --binding mode=off --bindlib ipl --ckpoint-num -1 --global-inherited-env 70 'I_MPI_PERHOST=allcores' 'MODULE_VERSION_STACK=3.2.5' 'MKLROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl' 'MANPATH=/opt/intel/ics_2012/itac/8.0.3.007/man:/opt/intel/ics_2012/impi/4.0.3.008/man:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/man/en_US:/opt/intel/ics_2012/vtune_amplifier_xe_2011/man:/opt/modules/Modules/default/share/man:/opt/pbs/man:/opt/env-switcher/man:/usr/man:/usr/share/man:/usr/local/man:/usr/local/share/man:/usr/X11R6/man:/opt/c3-4/man' 'HOSTNAME=master' 'VT_MPI=impi4' 'I_MPI_PIN=0' 'INTEL_LICENSE_FILE=/opt/intel/licenses' 'IPPROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp' 'I_MPI_F77=ifort' 'SHELL=/bin/bash' 'TERM=xterm' 'HISTSIZE=200000' 'I_MPI_FABRICS=shm:dapl' 'SSH_CLIENT=139.11.215.121 5290 22' 'LIBRARY_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21' 'CVSROOT=:ext:fhpout@laplace.lstm.uni-erlangen.de:/data/linux/proj_tape/LSTM/fhpdev' 'MODULE_SHELL=sh' 'FPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include' 'SSH_TTY=/dev/pts/5' 'USER=denayer' 'MODULE_OSCAR_USER=denayer' 'LD_LIBRARY_PATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4:/opt/intel/ics_2012/impi/4.0.3.008/intel64/lib:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21:/home/denayer/FSI_new/FSI/Software/carat20/libraries/rlog-1.4/lib/:/home/denayer/FSI_new/FSI/Software/carat20/libraries/atlas/lib/:/opt/maui/lib:/opt/tecplot/tec360_2010/lib' 'LS_COLORS=no=00:fi=00:di=01;35:ln=01;36:pi=40;33:so=01;33:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:' 'ENV=/home/denayer/.bashrc' 'CPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/include' 'TMOUT=36000' 'MSM_PRODUCT=MSM' 'NLSPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64/locale/en_US' 'PATH=/opt/intel/ics_2012/itac/8.0.3.007/bin:/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/bin/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/opt/intel/ics_2012/vtune_amplifier_xe_2011/bin64:/usr/kerberos/bin:/opt/maui/bin:/opt/tecplot/tec360_2010/bin:/usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/ansys_inc/shared_files/licensing/lic_admin:/opt/ansys_inc/v130/icemcfd/linux64_amd/bin:/opt/ansys_inc/v130/Framework/bin/Linux64:/opt/ansys_inc/v130/CFX/bin:/opt/c3-4/:/home/denayer/bin:.:/opt/gid/gid_9:/opt/matlab/r2011a/bin' 'MAIL=/var/spool/mail/denayer' 'MODULE_VERSION=3.2.5' 'VT_ADD_LIBS=-ldwarf -lelf -lvtunwind -lnsl -lm -ldl -lpthread' 'I_MPI_TUNER_DATA_DIR=/opt/intel/ics_2012/impi/4.0.3.008/etc64/' 'TBBROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb' 'PWD=/home/denayer' '_LMFILES_=/opt/modules/oscar-modulefiles/torque-oscar/2.1.10:/opt/env-switcher/share/env-switcher/ansys/ansys-13.0:/opt/env-switcher/share/env-switcher/tecplot/tec360-2010:/opt/modules/oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/default-manpath/1.0.1:/opt/modules/oscar-modulefiles/maui/3.2.6:/opt/modules/modulefiles/oscar-modules/1.0.5:/opt/modules/Modules/3.2.5/modulefiles/dot:/opt/env-switcher/share/env-switcher/tools/intel-vtune-2011:/opt/env-switcher/share/env-switcher/gid/gid-9.0.6:/opt/env-switcher/share/env-switcher/matlab/matlab-r2011a:/opt/env-switcher/share/env-switcher/compiler/intel-compiler-12.1:/opt/env-switcher/share/env-switcher/mpi/intel-cluster-toolkit-2012.0.032' 'CARAT_LIC_PATH=/home/denayer/FSI_new/FSI/Software/carat20/exe' 'EDITOR=/usr/bin/emacs' 'LANG=en_US.UTF-8' 'MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-modulefiles:/opt/modules/version:/opt/modules/Modules/$MODULE_VERSION/modulefiles:/opt/modules/modulefiles:' 'LOADEDMODULES=torque-oscar/2.1.10:ansys/ansys-13.0:tecplot/tec360-2010:switcher/1.0.13:default-manpath/1.0.1:maui/3.2.6:oscar-modules/1.0.5:dot:tools/intel-vtune-2011:gid/gid-9.0.6:matlab/matlab-r2011a:compiler/intel-compiler-12.1:mpi/intel-cluster-toolkit-2012.0.032' 'VT_LIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4' 'I_MPI_F90=ifort' 'MPIROOTDIR=/opt/intel/impi/4.0.1/intel64/lib' 'I_MPI_CC=icc' 'VT_ROOT=/opt/intel/ics_2012/itac/8.0.3.007' 'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'HOME=/home/denayer' 'SHLVL=2' 'I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh' 'I_MPI_CXX=icpc' 'I_MPI_MPD_RSH=ssh' 'MSM_HOME=/usr/local/MegaRAID Storage Manager' 'FHPSYSTEM=INTEL64' 'VT_SLIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4' 'I_MPI_FC=ifort' 'LOGNAME=denayer' 'CVS_RSH=ssh' 'SSH_CONNECTION=139.11.215.121 5290 139.11.215.117 22' 'CLASSPATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4' 'MODULESHOME=/opt/modules/Modules/3.2.5' 'CPRO_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233' 'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'CVSEDITOR=emacs' 'FHPTARGET=parallel' 'INCLUDE=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/include' 'G_BROKEN_FILENAMES=1' 'I_MPI_ROOT=/opt/intel/ics_2012/impi/4.0.3.008' '_=/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/mpiexec.hydra' --global-user-env 2 'I_MPI_DEBUG=5' 'I_MPI_FABRICS=shm' --global-system-env 0 --start-pid 0 --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /home/denayer --exec-args 1 hostname
[mpiexec@master] PMI FD: (null); PMI PORT: (null); PMI ID/RANK: -1

Arguments being passed to proxy 1:

--version 1.3 --interface-env-name MPICH_INTERFACE_HOSTNAME --hostname n14 --global-core-count 2 --global-process-count 2 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname kvs_21039_0 --pmi-process-mapping (vector,(0,2,1)) --binding mode=off --bindlib ipl --ckpoint-num -1 --global-inherited-env 70 'I_MPI_PERHOST=allcores' 'MODULE_VERSION_STACK=3.2.5' 'MKLROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl' 'MANPATH=/opt/intel/ics_2012/itac/8.0.3.007/man:/opt/intel/ics_2012/impi/4.0.3.008/man:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/man/en_US:/opt/intel/ics_2012/vtune_amplifier_xe_2011/man:/opt/modules/Modules/default/share/man:/opt/pbs/man:/opt/env-switcher/man:/usr/man:/usr/share/man:/usr/local/man:/usr/local/share/man:/usr/X11R6/man:/opt/c3-4/man' 'HOSTNAME=master' 'VT_MPI=impi4' 'I_MPI_PIN=0' 'INTEL_LICENSE_FILE=/opt/intel/licenses' 'IPPROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp' 'I_MPI_F77=ifort' 'SHELL=/bin/bash' 'TERM=xterm' 'HISTSIZE=200000' 'I_MPI_FABRICS=shm:dapl' 'SSH_CLIENT=139.11.215.121 5290 22' 'LIBRARY_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21' 'CVSROOT=:ext:fhpout@laplace.lstm.uni-erlangen.de:/data/linux/proj_tape/LSTM/fhpdev' 'MODULE_SHELL=sh' 'FPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include' 'SSH_TTY=/dev/pts/5' 'USER=denayer' 'MODULE_OSCAR_USER=denayer' 'LD_LIBRARY_PATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4:/opt/intel/ics_2012/impi/4.0.3.008/intel64/lib:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/../compiler/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21:/home/denayer/FSI_new/FSI/Software/carat20/libraries/rlog-1.4/lib/:/home/denayer/FSI_new/FSI/Software/carat20/libraries/atlas/lib/:/opt/maui/lib:/opt/tecplot/tec360_2010/lib' 'LS_COLORS=no=00:fi=00:di=01;35:ln=01;36:pi=40;33:so=01;33:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:' 'ENV=/home/denayer/.bashrc' 'CPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb/include' 'TMOUT=36000' 'MSM_PRODUCT=MSM' 'NLSPATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/debugger/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/compiler/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/lib/intel64/locale/en_US:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/lib/intel64/locale/en_US' 'PATH=/opt/intel/ics_2012/itac/8.0.3.007/bin:/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/bin/intel64:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/opt/intel/ics_2012/vtune_amplifier_xe_2011/bin64:/usr/kerberos/bin:/opt/maui/bin:/opt/tecplot/tec360_2010/bin:/usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/env-switcher/bin:/opt/ansys_inc/shared_files/licensing/lic_admin:/opt/ansys_inc/v130/icemcfd/linux64_amd/bin:/opt/ansys_inc/v130/Framework/bin/Linux64:/opt/ansys_inc/v130/CFX/bin:/opt/c3-4/:/home/denayer/bin:.:/opt/gid/gid_9:/opt/matlab/r2011a/bin' 'MAIL=/var/spool/mail/denayer' 'MODULE_VERSION=3.2.5' 'VT_ADD_LIBS=-ldwarf -lelf -lvtunwind -lnsl -lm -ldl -lpthread' 'I_MPI_TUNER_DATA_DIR=/opt/intel/ics_2012/impi/4.0.3.008/etc64/' 'TBBROOT=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/tbb' 'PWD=/home/denayer' '_LMFILES_=/opt/modules/oscar-modulefiles/torque-oscar/2.1.10:/opt/env-switcher/share/env-switcher/ansys/ansys-13.0:/opt/env-switcher/share/env-switcher/tecplot/tec360-2010:/opt/modules/oscar-modulefiles/switcher/1.0.13:/opt/modules/oscar-modulefiles/default-manpath/1.0.1:/opt/modules/oscar-modulefiles/maui/3.2.6:/opt/modules/modulefiles/oscar-modules/1.0.5:/opt/modules/Modules/3.2.5/modulefiles/dot:/opt/env-switcher/share/env-switcher/tools/intel-vtune-2011:/opt/env-switcher/share/env-switcher/gid/gid-9.0.6:/opt/env-switcher/share/env-switcher/matlab/matlab-r2011a:/opt/env-switcher/share/env-switcher/compiler/intel-compiler-12.1:/opt/env-switcher/share/env-switcher/mpi/intel-cluster-toolkit-2012.0.032' 'CARAT_LIC_PATH=/home/denayer/FSI_new/FSI/Software/carat20/exe' 'EDITOR=/usr/bin/emacs' 'LANG=en_US.UTF-8' 'MODULEPATH=/opt/env-switcher/share/env-switcher:/opt/modules/oscar-modulefiles:/opt/modules/version:/opt/modules/Modules/$MODULE_VERSION/modulefiles:/opt/modules/modulefiles:' 'LOADEDMODULES=torque-oscar/2.1.10:ansys/ansys-13.0:tecplot/tec360-2010:switcher/1.0.13:default-manpath/1.0.1:maui/3.2.6:oscar-modules/1.0.5:dot:tools/intel-vtune-2011:gid/gid-9.0.6:matlab/matlab-r2011a:compiler/intel-compiler-12.1:mpi/intel-cluster-toolkit-2012.0.032' 'VT_LIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4' 'I_MPI_F90=ifort' 'MPIROOTDIR=/opt/intel/impi/4.0.1/intel64/lib' 'I_MPI_CC=icc' 'VT_ROOT=/opt/intel/ics_2012/itac/8.0.3.007' 'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'HOME=/home/denayer' 'SHLVL=2' 'I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh' 'I_MPI_CXX=icpc' 'I_MPI_MPD_RSH=ssh' 'MSM_HOME=/usr/local/MegaRAID Storage Manager' 'FHPSYSTEM=INTEL64' 'VT_SLIB_DIR=/opt/intel/ics_2012/itac/8.0.3.007/itac/slib_impi4' 'I_MPI_FC=ifort' 'LOGNAME=denayer' 'CVS_RSH=ssh' 'SSH_CONNECTION=139.11.215.121 5290 139.11.215.117 22' 'CLASSPATH=/opt/intel/ics_2012/itac/8.0.3.007/itac/lib_impi4' 'MODULESHOME=/opt/modules/Modules/3.2.5' 'CPRO_PATH=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233' 'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'CVSEDITOR=emacs' 'FHPTARGET=parallel' 'INCLUDE=/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/mkl/include:/opt/intel/ics_2012/composer_xe_2011_sp1.6.233/ipp/include' 'G_BROKEN_FILENAMES=1' 'I_MPI_ROOT=/opt/intel/ics_2012/impi/4.0.3.008' '_=/opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/mpiexec.hydra' --global-user-env 2 'I_MPI_DEBUG=5' 'I_MPI_FABRICS=shm' --global-system-env 0 --start-pid 1 --proxy-core-count 1 --exec --exec-appnum 1 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /home/denayer --exec-args 1 hostname
[mpiexec@master] Launch arguments: ssh -x -q n13 /opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/pmi_proxy --control-port master:47174 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --bootstrap ssh --bootstrap-exec ssh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0

[mpiexec@master] Launch arguments: ssh -x -q n14 /opt/intel/ics_2012/impi/4.0.3.008/intel64/bin/pmi_proxy --control-port master:47174 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --bootstrap ssh --bootstrap-exec ssh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 1

[mpiexec@master] STDIN will be redirected to 1 fd(s): 7

[proxy:0:0@n13] Start PMI_proxy 0

[proxy:0:0@n13] STDIN will be redirected to 1 fd(s): 7

[proxy:0:1@n14] Start PMI_proxy 1

[proxy:0:0@n13] got crush from 4, 0

n13

[proxy:0:1@n14] got crush from 4, 0

n14


I did the tests with -genv I_MPI_FABRICS shm:ofa, and it works too.

Do you see interesting infos to solve our original problem ?

Thx a lot

Аватар пользователя James Tullos (Intel)

Hi Guillaume,

Do you have a systemwide mpd.hosts file? Make certain it contains the old nodes and the new nodes.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Аватар пользователя Guillaume De Nayer

No, there is no mpd.hosts file. find or locate give 0 entry.

Where is this file normally ?

Regards

Аватар пользователя James Tullos (Intel)

Hi Guillaume,

Generally, there wouldn't be one, I wanted to make certain that there wasn't one. Back to the original error, did you get that error from all versions of Intel MPI Library?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Зарегистрируйтесь, чтобы оставить комментарий.