problem to start mpd ring

problem to start mpd ring

this is the information i got:

yukai@hc-abs:/home_sas/yukai => mpdboot -d -v -r ssh -f mpd.hosts -n 7
debug: starting
running mpdallexit on hc-abs
LAUNCHED mpd on hc-abs via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=hc-abs -e -d -s 7
debug: mpd on hc-abs on port 40529
RUNNING: mpd on hc-abs
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 40529, 'entry_port': '', 'host': 'hc-abs', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on n10 via hc-abs
debug: launch cmd= ssh -x -n n10 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.160 --ncpus=1 --myhost=n10 --myip=192.168.0.160 -e -d -s 7
LAUNCHED mpd on n11 via hc-abs
debug: launch cmd= ssh -x -n n11 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.161 --ncpus=1 --myhost=n11 --myip=192.168.0.161 -e -d -s 7
LAUNCHED mpd on n12 via hc-abs
debug: launch cmd= ssh -x -n n12 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.162 --ncpus=1 --myhost=n12 --myip=192.168.0.162 -e -d -s 7
LAUNCHED mpd on n13 via hc-abs
debug: launch cmd= ssh -x -n n13 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.163 --ncpus=1 --myhost=n13 --myip=192.168.0.163 -e -d -s 7
debug: mpd on n10 on port 32896
mpdboot_hc-abs (handle_mpd_output 886): failed to ping mpd on n10; received output={}

i am sure ssh work perfectly (passwordless).
mpd.hosts:
n10
n11
n12
n13
n14
n15
n16

mpirun works fine for each node.

yukai@hc-abs:/home_sas/yukai => cpuinfo
Intel Xeon Processor (Intel64 Dunnington)
===== Processor composition =====
Processors(CPUs) : 16
Packages(sockets) : 4
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 0 2
3 0 0 3
4 0 2 0
5 0 2 1
6 0 2 2
7 0 2 3
8 0 1 0
9 0 1 1
10 0 1 2
11 0 1 3
12 0 3 0
13 0 3 1
14 0 3 2
15 0 3 3
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,2,1,3 0,4,8,12
1 0,2,1,3 1,5,9,13
2 0,2,1,3 2,6,10,14
3 0,2,1,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 3 MB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,4,8,12)(1,5,9,13)(2,6,10,14)(3,7,11,15)

/etc/hosts looks fine.

Any help and suggestion will be greatly appreciated!

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

problem solved!

Now I have a question about mpd.hosts.

The ring doesn't work if I don't put the head node in the first line.

Now the question is how I can avoid this cause I don't want to use the head node (leave it for the other system programs).

Eithe starting ring without the head node or submitting jobs to specified nodes in the ring?

Hi,

Have you tried '-nolocal' option?

Regards!
Dmitry

Hi tamuer,

Could you please tell me how to resolve that. I'm having the same problem.

Thanks,
Tuan

Hi Tuan,

Could you clarify what problem you have.
What library version do you use.
Could you post commands and error message here and I'll try to help you.

Regards!
Dmitry

hi Tuan, I just did what Dmitry told me. He is a wonderful expert.

Thanks, Tamuer.

What was the fix? I've just begun experiencing the problem on a cluster that was working previously. Thanks.

Hi Daniel,

Could you clarify what the problem is? What version of the Intel MPI Library do you use?
Usually there are some log files in /tmp directory. Try 'ls /tmp | grep mpd'

Please provide as much information as possible and I'll try to help you.

Regards!
Dmitry

Hi,I'm using ICT 3.2.2. Everything works fine on all nodes except 2. The install is on a shared filesystem. The logfile is empty. If I run w/ -d I get:[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -ddebug: startingrunning mpdallexit on test1debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2debug: mpd on test1 on port 37556debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2debug: mpd on test2 on port 50042mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -ddebug: startingrunning mpdallexit on test1debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2debug: mpd on test1 on port 37556debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2debug: mpd on test2 on port 50042mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}

Daniel,

Could you check that you can log-in without entering password (or passphrase) from test1 to test2 and vice versa.
[root@test1 ~] ssh test1

Passwordless ssh connection is one of requirements.

Regards!
Dmitry

passwordless ssh is working properly.

Daniel,

Looks like there are some limitations on the network ports. Do you use firewall? Or might be some ssh ports are restricted? Could you please check with your system administrator?

Regards!
Dmitry

Hi Dmitry,No firewall rules are defined, selinux is disabled. I can use ibping to ping between the machines and get replies. Still cannot create a ring. MPDs can start locally. Passworddless ssh works perfectly. Authentication is from the same NIS server as all the other nodes in the cluster that do work. It is an odd problem, IMHO!! Any more suggestions? Thanks,Dan

Hi Daniel,

Let's compare ssh versions! I'm using:
[root@cluster1002 ~]$ ssh -V
ssh: Reflection for Secure IT 6.1.2.1 (build 3005) on x86_64-redhat-linux-gnu (64-bit)

Could you check for mpd processes on both nodes?
[root@cluster1002 ~] ps ux
[root@claster1002 ~] ssh test2 -x ps ux
If there is an mpd process please kill it.

[root@claster1002 ~] echo test1 > mpd.hosts
[root@claster1002 ~] echo test2 >> mpd.hosts
[root@claster1002 ~] mpdboot -r ssh -n 2 -d
Check the ring:
[root@claster1002 ~] mpdtrace
If there is no ring, let's try to create it by hand:

[root@test1 ~] env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
You'll get port number (port_number) which will be used in the next command

[root@test1 ~] ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p port_number --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2

If ssh works correctly new mpd ring will be created.
[root@test1 ~] mpdtrace
test1
test2

If it doesn't work it means that you have some issues with configuration. If it work send me the output - probably your ssh outputs information in another format.

Regards!
Dmitry

Leave a Comment

Please sign in to add a comment. Not a member? Join today