Hi!
I have a problem with Altair PBS PRO + Intel MPI. I can launch a task with mpiexec command on several nodes. But when I try to launch this task on several nodes under PBS I get error.
What I doing:
1) Starting mpd on nodes:
qwer@mgr:/mnt/share/piex> cat mpd.hosts
ib-mgr:10
ib-cn01:16
ib-cn02:16
ib-cn03:16
ib-cn04:16
ib-cn05:16
qwer@mgr:/mnt/share/piex> mpdboot -n 6 -f mpd.hosts -r ssh
2) Cheking:
qwer@mgr:/mnt/share/piex> mpdtrace
ib-mgr
ib-cn04
ib-cn03
ib-cn02
ib-cn01
ib-cn05
3) Start mpi-program without PBS:
qwer@mgr:/mnt/share/piex> mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
Process 24 on ib-cn04
Process 22 on ib-cn04
Process 13 on ib-mgr [Why -nolocal ignored?]
Process 29 on ib-cn04
Process 21 on ib-cn04
...
Process 25 on ib-cn04
Process 26 on ib-cn04
Process 36 on ib-cn03
pi = 3.1415926535897931
time = 0.435737 sec.
OK. Task was launched on all nodes right.
4) Make a job file for PBS:
qwer@mgr:/mnt/share/piex> cat test.job
#!/bin/bash
#PBS -q long
#PBS -l nodes=5:ppn=10,mem=100mb,walltime=1:30:00
#PBS -S /bin/bash
#PBS -N piex
echo " Start date:`/bin/date`"
mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
echo " End date:`/bin/date`"
5) Start mpi program with PBS:
qwer@mgr:/mnt/share/piex> qsub test.job
673.mgr
6) Where is my job?
qwer@mgr:/mnt/share/piex> qstat
7)What happend?
qwer@mgr:/mnt/share/piex> cat piex.o673
Start date:Втр Окт 27 13:55:47 VLAT 2009
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.673.mgr/mpd2.console_mgr_qwer); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
End date:Втр Окт 27 13:55:47 VLAT 2009
8) Realy mpd not runnig?
qwer@mgr:/mnt/share/piex> mpdtrace -l
ib-mgr_60696 (10.10.0.1)
ib-cn04_41952 (10.10.0.14)
ib-cn03_43736 (10.10.0.13)
ib-cn02_45542 (10.10.0.12)
ib-cn01_52394 (10.10.0.11)
ib-cn05_44083 (10.10.0.15)
What I doing else:
a) set env var
qwer@mgr: I_MPI_CPUINFO=/proc/cpuinfo
result - nothing.
b) try to find connection port, which locking PBS for mpd. I think, that pbs search connection with mpd deamon not in right port.
What reason of my problems?
About my system:
mgr:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 1
qwer@mgr:/mnt/share/piex> mpiexec -V
Intel(R) MPI Library for Linux, 64-bit applications, Version 3.2.1 Build 20090312
Copyright (C) 2003-2009 Intel Corporation. All rights reserved.
mgr:~ # qstat -Bf
Server: mgr
server_state = Active
server_host = extmgr.hp
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0 Begun
:0
acl_roots = foo,root@mgr
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.mem = 0kb
resources_assigned.ncpus = 1
resources_assigned.nodect = 1
scheduler_iteration = 600
FLicenses = 95
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_file_location = 7788@mgr
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 3600
license_count = Avail_Global:95 Avail_Local:0 Used:1 High_Use:96
pbs_version = PBSPro_10.0.0.82981
eligible_time_enable = False
qwer@mgr:/mnt/share/piex> cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 4
Cores : 16
Processors : 16
===== Processor identification =====
Processor Thread Core Package
0 0 0 0
1 0 0 2
2 0 0 4
3 0 0 6
4 0 1 0
5 0 1 2
6 0 1 4
7 0 1 6
8 0 2 0
9 0 2 2
10 0 2 4
11 0 2 6
12 0 3 0
13 0 3 2
14 0 3 4
15 0 3 6
===== Processor placement =====
Package Cores Processors
0 0,1,2,3 0,4,8,12
2 0,1,2,3 1,5,9,13
4 0,1,2,3 2,6,10,14
6 0,1,2,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 4 MB (0,4)(1,5)(2,6)(3,7)(8,12)(9,13)(10,14)(11,15)