"failed to ping mpd" with intel MPI

"failed to ping mpd" with intel MPI

Hi,

I am sometimes able to run parallel jobs, but very often they fail with errors - most often with:

mpdboot_cl1n052 (handle_mpd_output 575): failed to ping mpd on cl1n038; recvd output={}

but sometimes the error is:

mpdboot_cl1n003 (handle_mpd_output 583): failed to connect to mpd on cl1n040

The node names (cl1nNNN) in the error messages are not always the same, so I suspect it is something systemic.

The mpd commands I use are:

mpdallexit
mpdboot -n 64 -r ssh -f ${NODEFILE}
mpdtrace
mpiexec -np 64 ./a.out
mpdallexit

Can anyone give a suggestion? I should say that we have tcp and infiniband, but our infiniband is broken at the moment. Typically intel MPI doesn't mind that very much, and fails over to tcp. In case it helps, /etc/hosts is appended to this email.

Thanks,
Sean

127.0.0.1 localhost.localdomain localhost
10.11.12.7 files.tae.mysite.com files.mysite.com files

# special IPv6 addresses
::1 localhost ipv6-localhost ipv6-loopback

fe00::0 ipv6-localnet

ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
#The following was added by scance. Do not remove:
10.0.1.1 cl1n001
10.0.1.10 cl1n010
10.0.1.11 cl1n011
10.0.1.12 cl1n012
10.0.1.13 cl1n013
10.0.1.14 cl1n014
10.0.1.15 cl1n015
10.0.1.16 cl1n016
10.0.1.17 cl1n017
10.0.1.18 cl1n018
10.0.1.19 cl1n019
10.0.1.2 cl1n002
10.0.1.20 cl1n020
10.0.1.21 cl1n021
10.0.1.22 cl1n022
10.0.1.23 cl1n023
10.0.1.24 cl1n024
10.0.1.25 cl1n025
10.0.1.26 cl1n026
10.0.1.27 cl1n027
10.0.1.28 cl1n028
10.0.1.29 cl1n029
10.0.1.3 cl1n003
10.0.1.30 cl1n030
10.0.1.31 cl1n031
10.0.1.32 cl1n032
10.0.1.33 cl1n033
10.0.1.34 cl1n034
10.0.1.35 cl1n035
10.0.1.36 cl1n036
10.0.1.37 cl1n037
10.0.1.38 cl1n038
10.0.1.39 cl1n039
10.0.1.4 cl1n004
10.0.1.40 cl1n040
10.0.1.41 cl1n041
10.0.1.42 cl1n042
10.0.1.43 cl1n043
10.0.1.44 cl1n044
10.0.1.45 cl1n045
10.0.1.46 cl1n046
10.0.1.47 cl1n047
10.0.1.48 cl1n048
10.0.1.49 cl1n049
10.0.1.5 cl1n005
10.0.1.50 cl1n050
10.0.1.51 cl1n051
10.0.1.52 cl1n052
10.0.1.53 cl1n053
10.0.1.54 cl1n054
10.0.1.55 cl1n055
10.0.1.56 cl1n056
10.0.1.57 cl1n057
10.0.1.58 cl1n058
10.0.1.59 cl1n059
10.0.1.6 cl1n006
10.0.1.60 cl1n060
10.0.1.61 cl1n061
10.0.1.62 cl1n062
10.0.1.63 cl1n063
10.0.1.64 cl1n064
10.0.1.7 cl1n007
10.0.1.8 cl1n008
10.0.1.9 cl1n009
10.0.10.1 taz3.americas.sgi.com taz3
10.0.40.1 cl1n001-bmc
10.0.40.10 cl1n010-bmc
10.0.40.11 cl1n011-bmc
10.0.40.12 cl1n012-bmc
10.0.40.13 cl1n013-bmc
10.0.40.14 cl1n014-bmc
10.0.40.15 cl1n015-bmc
10.0.40.16 cl1n016-bmc
10.0.40.17 cl1n017-bmc
10.0.40.18 cl1n018-bmc
10.0.40.19 cl1n019-bmc
10.0.40.2 cl1n002-bmc
10.0.40.20 cl1n020-bmc
10.0.40.21 cl1n021-bmc
10.0.40.22 cl1n022-bmc
10.0.40.23 cl1n023-bmc
10.0.40.24 cl1n024-bmc
10.0.40.25 cl1n025-bmc
10.0.40.26 cl1n026-bmc
10.0.40.27 cl1n027-bmc
10.0.40.28 cl1n028-bmc
10.0.40.29 cl1n029-bmc
10.0.40.3 cl1n003-bmc
10.0.40.30 cl1n030-bmc
10.0.40.31 cl1n031-bmc
10.0.40.32 cl1n032-bmc
10.0.40.33 cl1n033-bmc
10.0.40.34 cl1n034-bmc
10.0.40.35 cl1n035-bmc
10.0.40.36 cl1n036-bmc
10.0.40.37 cl1n037-bmc
10.0.40.38 cl1n038-bmc
10.0.40.39 cl1n039-bmc
10.0.40.4 cl1n004-bmc
10.0.40.40 cl1n040-bmc
10.0.40.41 cl1n041-bmc
10.0.40.42 cl1n042-bmc
10.0.40.43 cl1n043-bmc
10.0.40.44 cl1n044-bmc
10.0.40.45 cl1n045-bmc
10.0.40.46 cl1n046-bmc
10.0.40.47 cl1n047-bmc
10.0.40.48 cl1n048-bmc
10.0.40.49 cl1n049-bmc
10.0.40.5 cl1n005-bmc
10.0.40.50 cl1n050-bmc
10.0.40.51 cl1n051-bmc
10.0.40.52 cl1n052-bmc
10.0.40.53 cl1n053-bmc
10.0.40.54 cl1n054-bmc
10.0.40.55 cl1n055-bmc
10.0.40.56 cl1n056-bmc
10.0.40.57 cl1n057-bmc
10.0.40.58 cl1n058-bmc
10.0.40.59 cl1n059-bmc
10.0.40.6 cl1n006-bmc
10.0.40.60 cl1n060-bmc
10.0.40.61 cl1n061-bmc
10.0.40.62 cl1n062-bmc
10.0.40.63 cl1n063-bmc
10.0.40.64 cl1n064-bmc
10.0.40.7 cl1n007-bmc
10.0.40.8 cl1n008-bmc
10.0.40.9 cl1n009-bmc
10.11.12.9 taz.mysite.com taz
192.168.10.1 linux.site linux
#End scance-section

7 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

If you're expecting the default device to fail, why not specify the device you want? I've run into ssm performing better than fail over.

Quoting - tim18
If you're expecting the default device to fail, why not specify the device you want? I've run into ssm performing better than fail over.

OK, so how should I specify tcp/ip? I tried this:

export I_MPI_DEVICE=rdssm:sock

It failed to ping again as before. Is my syntax wrong?

Quoting - sdettrick

OK, so how should I specify tcp/ip? I tried this:

export I_MPI_DEVICE=rdssm:sock

It failed to ping again as before. Is my syntax wrong?

Not being an expert, I thought if you expect rdssm to fail, you would set
I_MPI_DEVICE=ssm

Hi Sean,

To specify using TCP/IP, you need to set I_MPI_DEVICE=ssm. This will run over sockets across nodes, and using the shm device within a node.

Additionally, the error you provide could be due to a failed connection with the node, inability to start the mpd daemon on the remote node, etc. Can you verify that you're using the latest version of Intel MPI Library 3.2 Update 1? You can do so by running "mpiexec -V".

Also, make sure no leftover mpd python processes exist on the nodes. You can do so by running "ps aux | grep mpd". Go ahead and kill any left over mpd.py procs you find.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Quoting - Gergana Slavova (Intel)

Also, make sure no leftover mpd python processes exist on the nodes. You can do so by running "ps aux | grep mpd". Go ahead and kill any left over mpd.py procs you find.

You may simplify the cleanup task by running mpdallexit to close down your mpd, before looking for the rogue python processes.

Maybe it is related selinux/firewall. You can stop those services and try again.

发表评论

登录添加评论。还不是成员?立即加入