mpdboot gives python error for more than one node

mpdboot gives python error for more than one node

Hi everybody!

I just installed ICTCE on my test machine (1 PC, 2 VMs as nodes). When I try to get an MPI ring up and running, this happens:
> mpdboot -n 3 -f mpd.hosts
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
Traceback (most recent call last):
File "", line 918, in
File "", line 669, in mpdboot
File "", line 758, in launch_one_mpd
File "/usr/lib/python2.6/subprocess.py", line 595, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1106, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

where mpd.hosts looks like this:
istanbul
cnode1
cnode2

mpdcheck -f mpd.hosts -v gives
obtaining hostname via gethostname and getfqdn
gethostname gives istanbul
getfqdn gives istanbul.site
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames; make sure other than 127.0.0.1
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
checking gethostbyXXX for unqualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for qualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for unqualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for qualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for unqualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
checking gethostbyXXX for qualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
obtain IP addrs via localhost name; make sure that it equal to 127.0.0.1
gethostbyname_ex: ('localhost', ['ipv6-localhost', 'ipv6-loopback'], ['127.0.0.1'])

ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.

Any ideas?

Thanks in advance.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Quoting - ictceeval
ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.

Hi ictceeval,

Thanks for posting. Since you're using ssh for remote shell access, you need to specify this on the mpdboot command line:

$ mpdboot -r ssh -n 3 -f mpd.hosts

The default for the Intel MPI Library is rsh.

Let us know how it goes.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Quoting - Gergana Slavova (Intel)

Hi ictceeval,

Thanks for posting. Since you're using ssh for remote shell access, you need to specify this on the mpdboot command line:

$ mpdboot -r ssh -n 3 -f mpd.hosts

The default for the Intel MPI Library is rsh.

Let us know how it goes.

Regards,
~Gergana

Thanks for your quick help! I didn't know that. Unfortunately this seems to lead to another issue:

> mpdboot -n 3 -f mpd.hosts -r ssh -v
running mpdallexit on istanbul
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
LAUNCHED mpd on cnode2 via istanbul
mpdboot_istanbul (handle_mpd_output 828): Failed to establish a socket connection with cnode1:41650 : [Errno 111] Connection refused
mpdboot_istanbul (handle_mpd_output 845): failed to connect to mpd on cnode1

How do I interpret that output? It says "LAUNCHED mpd on cnode1" and then again "Failed to establish...."?!

UPDATE:
Somehow, things seem to get out of hand. Now, I'm getting this message:
> mpdboot -n 3 -f mpd.hosts -r ssh -v --chkup
checking cnode1
checking cnode2
there are 3 hosts up (counting local)
running mpdallexit on istanbul
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
LAUNCHED mpd on cnode2 via istanbul
mpdboot_istanbul (handle_mpd_output 837): failed to ping mpd on cnode1; received output={}

Hi Gergana!

I'm glad to report: problem solved. The other issue was that the slave nodes couldn't talk back to the master node due to missing entries in /etc/hosts and missing ssh-keys. Having fixed that, I am now able to set up and MPI ring.

Thank you very much for your help!

ictceeval

Quoting - ictceeval
I'm glad to report: problem solved. The other issue was that the slave nodes couldn't talk back to the master node due to missing entries in /etc/hosts and missing ssh-keys. Having fixed that, I am now able to set up and MPI ring.

Thanks for letting me know, ictceeval. I'm glad things are working for you now. Have fun using the Intel Cluster Tools!

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today