MPI 4.1.0.024

MPI 4.1.0.024

Hi

I've got MPI installed v4.1.0.024 and if I do run a test program "Hellow world" on one node with 12 cpu it does work, but if I run the program on two nodes (24cpu) it does not and I am getting message

[mpiexec@red0044] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@red0044] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@red0044] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@red0044] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@red0044] main (./ui/mpich/mpiexec.c:745): process manager error waiting for completion

My qsub script is below:

module load intel/mpi/4.1.0.024
nprocs=`wc -l $PBS_NODEFILE | awk '{ print $1 }'`
echo $PBS_NODEFILE
mpirun -n $nprocs  ./test > output_file

---------------------------------------

The module I am loading.

setenv              MPIROOT /local/software/rh53/intel/mpi/4.1.0
prepend-path     PATH /local/software/rh53/intel/mpi/4.1.0/bin64
prepend-path     LD_LIBRARY_PATH /local/software/rh53/intel/mpi/4.1.0/lib64

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Could you please point me if I do something wrong.

Regards,

Max

7 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hey Max,

Thanks for posting.  Do you have InfiniBand (IB) on your cluster setup properly?  I have a feeling that's where this is breaking down.

Let's try it first over regular TCP/IP and circumvent IB for now.  Can you try re-running this way:

mpirun -genv I_MPI_FABRICS shm:tcp -n $nprocs  ./test > output_file

If that does work, it's very likely your IB setup is the culprit.  Can you provide details on what software stack you're running (OFED, or something else), which versions, what networking cards you have, and potentially the contents of your /etc/dat.conf file.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Hi,

Yes we do have IB.

And I did try -genv I_MPI_FABRICS shm:tcp but I am still getting in my output

[mpiexec@red0077] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@red0077] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@red0077] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@red0077] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@red0077] main (./ui/mpich/mpiexec.c:745): process manager error waiting for completion

Our /etc/dat.conf file below.

ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""

Network

  • IO & inter-node communication is via an Infiniband network. Management functions are controlled via a GigE network. The infiniband network is composed of groups of 32 nodes connected by DDR links to a 48 port QDR leaf-switch. The leaf switches then have 4 trunked QDR connections to 4 QDR 48-port core switches. (Giving 4 redundant pathways for extra bandwidth and resilience).

Hi Gergana

I figure out the problem. We had mpi 2011 installed on our cluster and mprun did work without any problem, so I was looking for the reasong why mpi 2013 does not work. So I came acrose that it looks like "rsh" protocol was changeed to ssh in a new mpi 2013. Since comunication on our cluster via rsh , mpi did not work.

If I run my test MPI program by using  [ mpirun -bootstrap rsh ... ] then I dont have any errors and the program works.

My question is, is it posible to Install MPI so by defaults it will use rsh?

Thank you.

Hey Max,

You're correct.  We changed the Intel MPI connectivity default from rsh to ssh back in version 4.0 Update 1 (which came out at the end of 2010, actually).  You probably had a previous version of the Cluster Toolkit/Studio which contained an older Intel MPI.

Since we ship pre-built libraries and binaries, there's no build settings at install time that define the shell connectivity.  But we do have a corresponding env variable to the "-bootstrap rsh" runtime option that you can set globally on your cluster.  Just do:

$ export I_MPI_HYDRA_BOOTSTRAP rsh

and that will use rsh for all your mpirun jobs.

Let me know if this helps.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Hi Gergana

Yes, it did help and solve the problem.

Thank you for your help and time.

Regards,

Max

Excellent!  Let us know if we can help with anything else.  And enjoy the library :)

~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Melden Sie sich an, um einen Kommentar zu hinterlassen.