Intel mpi not work on diskless machines?

Intel mpi not work on diskless machines?

Imagen de mityh

I have built a diskless cluster. mounting root image over NFS through ethernet interfaces
and I want use intel mpi 4.0.2.003 on the cluster.

but it can not be installed, when I type ./install.sh -s cfg.txt -t /scratch/work/tmp
it hangs permenently.

I also tried to install it on a diskfull machine, and specified the target directory to
a nfs-mounted path. and then remount that path to my diskless cluster. but this time it
fails with such messages:

[root@c07b03 work]# /apps/intel/impi/4.0.2.003/bin64/mpdtrace;/apps/intel/impi/4.0.2.003/bin64/mpiexec -machinefile ./nodes -n 24 ./xhpl
ibc07b03
ibc07b04
c07b03:3590: open_hca: rdma_create_id ERR Invalid argument
c07b03:3588: open_hca: rdma_create_id ERR Invalid argument
c07b03:3585: open_hca: rdma_create_id ERR Invalid argument
c07b03:3594: open_hca: rdma_create_id ERR Invalid argument
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
c07b03:3591: open_hca: rdma_create_id ERR Invalid argument
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
c07b03:3584: open_hca: rdma_create_id ERR Invalid argument
c07b03:3583: open_hca: rdma_create_id ERR Invalid argument
c07b03:3589: open_hca: rdma_create_id ERR Invalid argument
c07b03:3592: open_hca: rdma_create_id ERR Invalid argument
c07b03:3593: open_hca: rdma_create_id ERR Invalid argument
c07b03:3587: open_hca: rdma_create_id ERR Invalid argument
c07b03:3586: open_hca: rdma_create_id ERR Invalid argument
c07b04:3626: open_hca: rdma_create_id ERR Invalid argument
c07b04:3621: open_hca: rdma_create_id ERR Invalid argument
c07b04:3624: open_hca: rdma_create_id ERR Invalid argument
c07b04:3622: open_hca: rdma_create_id ERR Invalid argument
rank 0 in job 1 ibc07b03_38813 caused collective abort of all ranks
exit status of rank 0: return code 13

the following are outputs of mount command
root@c07b03 work]# mount
rootfs on / type rootfs (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev type tmpfs (rw)
none on /dev/pts type devpts (rw)
172.16.38.8:/share/apps/hgadmin/hpcgateway/plugins/clusteros/rootimage-rhel5u5-ib1531-lustre185_allinone_mds01 on / type nfs (rw,vers=3,rsize=32768,wsize=32768,soft,intr,nolock,proto=udp,timeo=20,retrans=3,sec=sys,addr=172.16.38.8)
/dev/ram on /ram type tmpfs (rw)
/proc on /proc type proc (rw)
sunpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/proc/bus/usb on /proc/bus/usb type usbfs (rw)
devpts on /dev/pts type devpts (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
none on /ipathfs type ipathfs (rw)
tmpfs on /dev/shm type tmpfs (rw)
sysfs on /sys type sysfs (rw)
/dev/sda5 on /scratch type ext3 (rw,data=ordered)
/etc/auto.misc on /misc type autofs (rw,fd=7,pgrp=3160,timeout=300,minproto=5,maxproto=5,indirect)
-hosts on /net type autofs (rw,fd=13,pgrp=3160,timeout=300,minproto=5,maxproto=5,indirect)
/etc/auto.home on /home type autofs (rw,fd=19,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
/etc/auto.job on /jobmgr type autofs (rw,fd=25,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
/etc/auto.app on /apps type autofs (rw,fd=31,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
appserver:/apps/intel on /apps/intel type nfs (ro,vers=3,rsize=32768,wsize=32768,soft,intr,proto=udp,timeo=11,retrans=2,sec=sys,addr=appserver)

WHEN i mount appserver:/apps/intel on to a diskfull machine, the application can run correctly.

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Dmitry Kuzmin (Intel)

Hello,

About installation: could you attach cfg.txt and
/scratch/work/tmp/intel.*.log file?

About mpiexec error in diskless configuration...
The default path to the DAPL configuration file is /etc/dat.conf and dynamic libraries need to be found in standard search path. It's not clear what you configuration is in case of diskless nodes.
You can add '-env I_MPI_DEBUG 100' to your mpiexec command line and attach the output - I'll take a look.

Regards!
Dmitry

Imagen de mityh

Thanks very much for your response.
The Installation problem seems solved. I have removed /var/lib/rpm/__db* files sometimes before.
because I awared the rpmq daemon always start and take over some CPU times, so When I restore
the /var/lib/rpm/__db* files, the install process proceeds successfully.
the cfg.txt file are as following:
PSET_LICENSE_FILE=/apps/tool/intel/intel.lic
ACTIVATION=license_file
AUTOMOUNTED_CLUSTER=yes
UPDATE_LDSOCONF=no
REGISTER_IN_SELECTOR=no
CONTINUE_WITH_INSTALLDIR_OVERWRITE=yes
CONTINUE_WITH_OPTIONAL_ERROR=yes
PSET_INSTALL_DIR=/opt/intel/impi/4.0.2.003
INSTALL_MODE=RPM
ACCEPT_EULA=accept

------------------------------
mpiexec debug info and /etc/dat.conf are attached.
------------------------------
some system information:
[root@c07b03 zws_work]# uname -a
Linux c07b03 2.6.18-194.17.1.0.1.el5_lustre.1.8.5 #1 SMP Mon May 23 22:48:21 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
Infiniband Driver OFED-1.5.3.1
[root@c07b03 lib64]# service openibd status

HCA driver loaded

Configured IPoIB devices:
ib0

Currently active IPoIB devices:

The following OFED modules are loaded:

rdma_ucm
ib_sdp
rdma_cm
ib_addr
ib_ipoib
mlx4_core
mlx4_ib
mlx4_en
ib_mthca
ib_uverbs
ib_umad
ib_sa
ib_cm
ib_mad
ib_core
iw_cxgb3
iw_nes
ib_qib

[root@c07b03 lib64]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.200
Hardware version: a0
Node GUID: 0x002590ffff071b78
System image GUID: 0x002590ffff071b7b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 311
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x002590ffff071b79
Link layer: IB

[root@c07b03 l_mpi_p_4.0.2.003]# ldconfig -p| grep 'libdaplcma.so'
libdaplcma.so.1 (libc6,x86-64) => /usr/lib64/libdaplcma.so.1
libdaplcma.so (libc6,x86-64) => /usr/lib64/libdaplcma.so

[root@c07b03 lib64]# lsmod
Module Size Used by
autofs4 63240 6
ib_qib 521492 1
dm_mirror 54928 0
dm_log 45312 1 dm_mirror
dm_multipath 57112 0
scsi_dh 42368 1 dm_multipath
dm_mod 102096 3 dm_mirror,dm_log,dm_multipath
video 53260 0
backlight 40064 1 video
sbs 50112 0
power_meter 47244 0
hwmon 36744 1 power_meter
i2c_ec 38784 1 sbs
i2c_core 56832 1 i2c_ec
dell_wmi 37664 0
wmi 42176 1 dell_wmi
button 40736 0
battery 44040 0
asus_acpi 50980 0
acpi_memhotplug 40708 0
ac 38920 0
parport_pc 62504 0
lp 47312 0
parport 73356 2 parport_pc,lp
nfs 296652 1
nfs_acl 36864 1 nfs
fscache 52576 1 nfs
lockd 101744 1 nfs
sunrpc 200264 8 nfs,nfs_acl,lockd
iw_nes 213160 0
iw_cxgb3 111316 0
cxgb3 214896 1 iw_cxgb3
serio_raw 40708 0
pcspkr 36480 0
shpchp 71084 0
sg 70568 0
joydev 44032 0
ib_ipoib 115040 0
ipoib_helper 35728 2 ib_ipoib
ib_mthca 157092 0
mlx4_en 113164 0
mlx4_ib 110140 0
mlx4_core 150472 2 mlx4_en,mlx4_ib
ib_sdp 206588 0
rdma_ucm 49152 0
rdma_cm 73492 2 ib_sdp,rdma_ucm
iw_cm 43656 1 rdma_cm
ib_umad 50600 0
ib_uverbs 75696 1 rdma_ucm
ib_cm 71592 2 ib_ipoib,rdma_cm
ib_sa 76424 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_mad 72100 6 ib_qib,ib_mthca,mlx4_ib,ib_umad,ib_cm,ib_sa
ib_core 109440 15 ib_qib,iw_nes,iw_cxgb3,ib_ipoib,ib_mthca,mlx4_ib,ib_sdp,rdma_ucm,rdma_cm,iw_cm,ib_umad,ib_uverbs,ib_cm,ib_sa,ib_mad
ib_addr 43016 1 rdma_cm
ipv6 435680 74 ib_ipoib,ib_sdp,rdma_cm,ib_addr
xfrm_nalgo 43524 1 ipv6
crypto_api 43136 1 xfrm_nalgo
ata_piix 57220 0
ext3 169744 1
jbd 104048 1 ext3
uhci_hcd 57624 0
ehci_hcd 66444 0
ohci_hcd 56500 0
ahci 69896 1
libata 209936 2 ata_piix,ahci
sd_mod 61704 2
scsi_mod 198040 4 scsi_dh,sg,libata,sd_mod
igb 123416 0
dca 41412 2 ib_qib,igb
8021q 57616 1 igb

[root@c07b03 lib64]# chkconfig --list
NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off
acpid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
amd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
anacron 0:off 1:off 2:off 3:off 4:off 5:off 6:off
arptables_jf 0:off 1:off 2:off 3:off 4:off 5:off 6:off
arpwatch 0:off 1:off 2:off 3:off 4:off 5:off 6:off
atd 0:off 1:off 2:off 3:on 4:on 5:on 6:off
auditd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
autofs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
avahi-daemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
avahi-dnsconfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bgpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bluetooth 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bootparamd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
capi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
conman 0:off 1:off 2:off 3:off 4:off 5:off 6:off
cpuspeed 0:off 1:off 2:off 3:on 4:off 5:on 6:off
crond 0:off 1:off 2:off 3:on 4:off 5:on 6:off
cups 0:off 1:off 2:off 3:off 4:off 5:off 6:off
cyrus-imapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dc_client 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dc_server 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcp6r 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcp6s 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcrelay 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dnsmasq 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dovecot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dund 0:off 1:off 2:off 3:off 4:off 5:off 6:off
edac 0:off 1:off 2:off 3:off 4:off 5:off 6:off
fcoe 0:off 1:off 2:off 3:off 4:off 5:off 6:off
firstboot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
gpm 0:off 1:off 2:off 3:off 4:off 5:off 6:off
haldaemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
hidd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
hplip 0:off 1:off 2:off 3:off 4:off 5:off 6:off
httpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
innd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ip6tables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipmi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipmievd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipsec 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
irda 0:off 1:off 2:off 3:off 4:off 5:off 6:off
irqbalance 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iscsi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iscsid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
isdn 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kadmin 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kdump 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kprop 0:off 1:off 2:off 3:off 4:off 5:off 6:off
krb524 0:off 1:off 2:off 3:off 4:off 5:off 6:off
krb5kdc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ktune 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kudzu 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ldap 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lisa 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lm_sensors 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lsf 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lvm2-monitor 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mailman 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mcstrans 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mdmonitor 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mdmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
messagebus 0:off 1:off 2:off 3:off 4:off 5:off 6:off
microcode_ctl 0:off 1:off 2:off 3:off 4:off 5:off 6:off
multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mysqld 0:off 1:off 2:off 3:off 4:off 5:off 6:off
named 0:off 1:off 2:off 3:off 4:off 5:off 6:off
netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off
netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
netplugd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off
nfslock 0:off 1:off 2:off 3:on 4:on 5:on 6:off
nscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ntpd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
openibd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
opensmd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ospf6d 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ospfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
pand 0:off 1:off 2:off 3:off 4:off 5:off 6:off
pcscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
portmap 0:off 1:off 2:off 3:on 4:on 5:on 6:off
postgresql 0:off 1:off 2:off 3:off 4:off 5:off 6:off
privoxy 0:off 1:off 2:off 3:off 4:off 5:off 6:off
psacct 0:off 1:off 2:off 3:off 4:off 5:off 6:off
radiusd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
radvd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rarpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rawdevices 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
readahead_early 0:off 1:off 2:off 3:off 4:off 5:off 6:off
readahead_later 0:off 1:off 2:off 3:off 4:off 5:off 6:off
restorecond 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rhnsd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ripd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ripngd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcidmapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcsvcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rstatd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rusersd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rwhod 0:off 1:off 2:off 3:off 4:off 5:off 6:off
saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
sendmail 0:off 1:off 2:off 3:off 4:off 5:off 6:off
setroubleshoot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
smartd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
smb 0:off 1:off 2:off 3:off 4:off 5:off 6:off
snmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
snmptrapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
spamassassin 0:off 1:off 2:off 3:off 4:off 5:off 6:off
squid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
syslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off
sysstat 0:off 1:off 2:on 3:on 4:off 5:on 6:off
tcsd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tog-pegasus 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tomcat5 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tux 0:off 1:off 2:off 3:off 4:off 5:off 6:off
uuidd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
vncserver 0:off 1:off 2:off 3:off 4:off 5:off 6:off
vsftpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
watchdog 0:off 1:off 2:off 3:off 4:off 5:off 6:off
wdaemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
winbind 0:off 1:off 2:off 3:off 4:off 5:off 6:off
wpa_supplicant 0:off 1:off 2:off 3:off 4:off 5:off 6:off
xfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off
xinetd 0:off 1:off 2:off 3:on 4:on 5:on 6:off
ypbind 0:off 1:off 2:off 3:off 4:off 5:off 6:off
yppasswdd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypserv 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypxfrd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
yum-updatesd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
zebra 0:off 1:off 2:off 3:off 4:off 5:off 6:off

xinetd based services:
amanda: off
amandaidx: off
amidxtape: off
auth: off
chargen-dgram: off
chargen-stream: off
cvs: off
daytime-dgram: off
daytime-stream: off
discard-dgram: off
discard-stream: off
echo-dgram: off
echo-stream: off
eklogin: off
ekrb5-telnet: off
gssftp: off
klogin: off
krb5-telnet: off
kshell: off
ktalk: off
ntalk: off
rexec: off
rlogin: off
rmcp: off
rsh: off
rsync: off
talk: off
tcpmux-server: off
telnet: off
tftp: off
time-dgram: off
time-stream: off
uucp: off

Adjuntos: 

AdjuntoTamaño
Descargar debuginfo.txt51.13 KB
Descargar dat.conf2.58 KB
Imagen de Dmitry Kuzmin (Intel)

First of all, you running xhpl compiled with Intel MPI 3.2.2
[0] MPI startup(): Intel MPI Library, Version 3.2.2 Build 20090827

It's not clear why it hangs.

There is hello_world sample in the test directory - could you compile it and try to run.
Your default provider will be OpenIB-mlx4_0-1 (you can set it directly '-env I_MPI_DAPL_PROVIDER OpenIB-mlx4_0-1'). But it might be better to use 'ofa-v2-mlx4_0-1' you need to compare performance.

You also may try to set another fabric: -env I_MPI_FABRICS shm:ofa

Regards!
Dmitry

Imagen de mityh

Thanks for your comments.

It works fine, by re-buiding the diskless image and reinstalling impi4.0.2.

But I have no idea of what is wrong with the old environment.

Inicie sesión para dejar un comentario.