Error when executing mpirun on 2 nodes

Error when executing mpirun on 2 nodes

Hi!

I'am have 2 nodes and headnode.

OS: centos 5.5, mpi-rt-4.0.0.028 installed.

mdp-ring executing normally:

[root@head ~]# mpdboot -d -v -r ssh -f /root/mpd.hosts -n 3
debug: starting
running mpdallexit on head.kazntu.local
LAUNCHED mpd on head.kazntu.local via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/mpi-rt/4.0.0/bin64/mpd.py --ncpus=1 --myhost=head.kazntu.local -e -d -s 3
debug: mpd on head.kazntu.local on port 45035
RUNNING: mpd on head.kazntu.local
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 45035, 'entry_port': '', 'host': 'head.kazntu.local', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node01 via head.kazntu.local
debug: launch cmd= ssh -x -n -q node01 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/mpi-rt/4.0.0/bin64/mpd.py -h head.kazntu.local -p 45035 --ifhn=192.168.192.21 --ncpus=1 --myhost=node01 --myip=192.168.192.21 -e -d -s 3
LAUNCHED mpd on node02 via head.kazntu.local
debug: launch cmd= ssh -x -n -q node02 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/mpi-rt/4.0.0/bin64/mpd.py -h head.kazntu.local -p 45035 --ifhn=192.168.192.22 --ncpus=1 --myhost=node02 --myip=192.168.192.22 -e -d -s 3
debug: mpd on node01 on port 43150
RUNNING: mpd on node01
debug: info for running mpd: {'ip': '192.168.192.21', 'ncpus': 1, 'list_port': 43150, 'entry_port': 45035, 'host': 'node01', 'entry_host': 'head.kazntu.local', 'ifhn': '', 'pid': 6272}
debug: mpd on node02 on port 43164
RUNNING: mpd on node02
debug: info for running mpd: {'ip': '192.168.192.22', 'ncpus': 1, 'list_port': 43164, 'entry_port': 45035, 'host': 'node02', 'entry_host': 'head.kazntu.local', 'ifhn': '', 'pid': 6273}
[root@head ~]#

but mpirun crashed:

[root@head ~]# mpirun -n 8 -wdir /linpack/ -host node01 /linpack/xhpl_em64t : -host node02 /linpack/xhpl_em64t
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 head.kazntu.local_53559 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

How to execute mpirun?

Need help!

p.s. Sorry for my English..

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When i take mpirun with I_MPI_DEBUGoption:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
[1] MPI startup(): cannot open dynamic library libdat.so[2] MPI startup(): cannot open dynamic library libdat.so
[2] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat.so
[3] MPI startup(): cannot open dynamic library libdat2.so
[1] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat.so
[5] MPI startup(): cannot open dynamic library libdat.so
[5] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat2.so
[8] MPI startup(): cannot open dynamic library libdat.so
[8] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat.so
[6] MPI startup(): cannot open dynamic library libdat.so
[6] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat2.so
[4] MPI startup(): cannot open dynamic library libdat.so
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
[4] MPI startup(): cannot open dynamic library libdat2.so
[12] MPI startup(): cannot open dynamic library libdat.so
[12] MPI startup(): cannot open dynamic library libdat2.so
[14] MPI startup(): cannot open dynamic library libdat.so
[14] MPI startup(): cannot open dynamic library libdat2.so
[10] MPI startup(): cannot open dynamic library libdat.so
[10] MPI startup(): cannot open dynamic library libdat2.so
[15] MPI startup(): cannot open dynamic library libdat.so
[15] MPI startup(): cannot open dynamic library libdat2.so
[9] MPI startup(): cannot open dynamic library libdat.so
[9] MPI startup(): cannot open dynamic library libdat2.so
[11] MPI startup(): cannot open dynamic library libdat.so
[11] MPI startup(): cannot open dynamic library libdat2.so
[13] MPI startup(): cannot open dynamic library libdat.so
[13] MPI startup(): cannot open dynamic library libdat2.so
rank 0 in job 1 head.kazntu.local_57440 caused collective abort of all ranks
exit status of rank 0: return code 13

Where it is possible to take these libraries? What for they are necessary?

Hi Jabuin,

libdat2.so is part of OFED stack. You can download it from http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/
If you configure and install this package everything should be fine.

Might be this package has already been installed but not in default directory. Could you please check that this library is available either from pathes mentioned in /etc/ld.so.conf or in $LD_LIBRARY_PATH.

If your DAPL library is not properly configured you can try socket connection:
'mpirun -n 16 -nolocal -env I_MPI_FABRICS shm:tcp /linpack/xhpl_em64t'
Please try out this command line and let me know the result.

If you are using 'mpirun' you don't need to use 'mpdboot' before. If you are using 'mpdboot' please use 'mpiexec'.

Regards!
Dmitry

Hi Dmitry,

without OFED stack:

[root@head ~]# mpirun -n 16 -env I_MPI_FABRICS shm:tcp /linpack/xhpl_em64t
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 head.kazntu.local_53450 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

After installation

[root@head ~]# mpirun -n 16 -wdir /linpack/ -env I_MPI_FABRICS shm:tcp -host node01 /linpack/xhpl_em64t : -host node02 /linpack/xhpl_em64t
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
CMA: unable to get RDMA device list

Debugging:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list

Your name is Eugeny, isn't it?
Is your e-mail address real? Might be it would be better to communicate through e-mail.

Can I get access to your cluster?

Did you recompile xhpl with iMPI library? Could you provide 'ldd xhpl' output?

The error looks very strange - mpdman.py cannot parse a message. It means that either message incorrect or contains unexpected symbols. We've never met such cases before.

Regards!
Dmitry

Yes, it's my real e-mail. Can we speak Russian?

Further communication goes via e-mail.

Having corrected a file /etc/dat.conf and having executed /sbin/modinfo rdma_ucm on compute nodes, I have come to following errors:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
node02.kazntu.local:9961: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9960: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9955: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9959: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9957: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10329: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10327: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9962: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10324: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10323: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10325: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9958: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10328: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10326: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9956: open_hca: rdma_bind ERR No such device. Is eth0 configured?
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
node01.kazntu.local:10330: open_hca: rdma_bind ERR No such device. Is eth0 configured?
rank 0 in job 1 head.kazntu.local_38570 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

/etc/dat.conf:

[root@node01 ~]# cat /etc/dat.conf
#OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
#OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
#OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
#OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
#OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
#OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
#OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
#OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
#OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
#OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth0 0" ""
OpenIB-cma-roe-eth0 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth0 0" ""
#OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
#OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
#OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
#OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
#ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
#ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
#ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
#ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
#ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
#ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
#ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
#ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
#ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
#ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
#ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
#ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth0 0" ""
#ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
#ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
#ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
[root@node01 ~]#

ifconfig -a:

[root@node01 ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D2
inet addr:192.168.192.21 Bcast:192.168.195.255 Mask:255.255.252.0
inet6 addr: fe80::223:8bff:febd:5fd2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:415397 errors:0 dropped:0 overruns:0 frame:0
TX packets:406833 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:120265102 (114.6 MiB) TX bytes:116642856 (111.2 MiB)
Memory:fa9e0000-faa00000

eth1 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D3
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fa960000-fa980000

eth2 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D4
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:faae0000-fab00000

eth3 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D5
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:faa60000-faa80000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:62800 errors:0 dropped:0 overruns:0 frame:0
TX packets:62800 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8157453 (7.7 MiB) TX bytes:8157453 (7.7 MiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

[root@node01 ~]#

p.s.Changes occur slowly as the equipment test.

Please answer via e-mail :)

Leave a Comment

Please sign in to add a comment. Not a member? Join today