Intel IMB 3.2.3 benchmarks stall with intel mpi v4.0.3.008

Intel IMB 3.2.3 benchmarks stall with intel mpi v4.0.3.008

Hello,

I am trying to run the IMB v3.2.3 MPI benchmarks on a x86_64 cluster but when I use the intel mpi 4.0.3.008 MPI stack IMB-MPI1 stalls when it runs on > 1 node.

Specifically, I built the IMB-* binaries using the IMPI 4.0.3.008 stack and then I use the mpiexec.hydra to launch it on the requested nodes. The binaries launch on all nodes and they seem to be consumming time but there is no progress. After logging on to one of the nodes I tried to "strace -f -p PID" and this is what is returned:

[pid 8624] sched_yield() = 0
[pid 8624] sched_yield() = 0
[pid 8624] sched_yield() = 0
[pid 8624] sched_yield() = 0
[pid 8624] sched_yield() = 0
....

Eventually the program is killed by the system after it exceeds the requsted wall clock time. Is there any known issue with MPI 4.0.3.008 that may casue MPI code to stall ?

Details

system iDataPlex mix Nehalem/westmere dx360-M2 / -M3 QDR Voltaire switch 24 GiB/nodel

OFED v1.4.2

$ lsb_release -a
LSB Version: :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 5.4 (Tikanga)
Release: 5.4
Codename: Tikanga

$ mpiicc -v
mpiicc for the Intel MPI Library 4.0 Update 3 for Linux*
Copyright(C) 2003-2011, Intel Corporation. All rights reserved.
Version 11.1

Snippet of execution script

This is passed to PBS after requesting te appropriate number of nodes/cores etc.

module load intel/compilers
module load mtintel-mpi_4.0.3.008

export I_MPI_DEBUG=3
export I_MPI_TIMER_KIND=rdtsc

export I_MPI_FABRICS=shm:ofa

export I_MPI_PIN=1
export I_MPI_MODE=lib

export TESTS1="IMB-MPI1";

## NOTE: M = 2 or 4 or 8 (Num of Nodes)
## N = cores/node x M

# if M = 1 : single node MPI experiments
if [ $M -eq 1 ]; then
## skipped the 1 node experimetns logic
....
else
# M >= 2 : multi node MPI experiments
# Scatter to nodes

export I_MPI_PIN_PROCESSOR_LIST="all:map=scatter"

# 2 < MPI ranks
for osu_bin in $TESTS1; do
echo "## Starting TESTS1 $osu_bin at ........ $(date)" ;
mpiexec.hydra -print-rank-map -l -rr -np $N \\
-genvlist I_MPI_PIN_PROCESSOR_LIST,I_MPI_FABRICS \\
$R/impi_4.0.3.008/$B/$osu_bin $OPT > ${J}_impi_4.0.3.008_scatter_${osu_bin}.out ;
done

fi

R/D High-Performance Computing and Engineering
8 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hey Mike,

To answer your question, no known issue that would cause the library to stall. This is probably a problem when Intel MPI is trying to connect to the other node.

A fewthings:

  • If you do "export I_MPI_FABRICS=shm:tcp", does this work?
  • You set I_MPI_DEBUG to 3. What's the debug output from that? It'll most likely be saved in your .out file
  • Finally, can I assume correctly that running on a single node is fine?

Thanks and regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Hi Gergana,

long time no see ...)

Single node runs are fine.

Output from I_MPI_DEBUG=3:

1) a 2 node run with 12 cores/node and 24 ranks:

(node318:0,2,4,6,8,10,12,14,16,18,20,22)
(node328:1,3,5,7,9,11,13,15,17,19,21,23)
[4] [4] MPI startup(): shm and ofa data transfer modes
[6] [6] MPI startup(): shm and ofa data transfer modes
[16] [16] MPI startup(): shm and ofa data transfer modes
[10] [10] MPI startup(): shm and ofa data transfer modes
[0] [0] MPI startup(): shm and ofa data transfer modes
[22] [22] MPI startup(): shm and ofa data transfer modes
[8] [8] MPI startup(): shm and ofa data transfer modes
[12] [12] MPI startup(): shm and ofa data transfer modes
[14] [14] MPI startup(): shm and ofa data transfer modes
[2] [2] MPI startup(): shm and ofa data transfer modes
[18] [18] MPI startup(): shm and ofa data transfer modes
[20] [20] MPI startup(): shm and ofa data transfer modes

and stops there until PBS kills it when it hits the wallclock limit.

2) a 8 node run with 12 cores/node and 96 ranks:

(node342:0,8,16,24,32,40,48,56,64,72,80,88)
(node343:1,9,17,25,33,41,49,57,65,73,81,89)
(node347:2,10,18,26,34,42,50,58,66,74,82,90)
(node348:3,11,19,27,35,43,51,59,67,75,83,91)
(node351:4,12,20,28,36,44,52,60,68,76,84,92)
(node352:5,13,21,29,37,45,53,61,69,77,85,93)
(node355:6,14,22,30,38,46,54,62,70,78,86,94)
(node356:7,15,23,31,39,47,55,63,71,79,87,95)
[8] [8] MPI startup(): shm and ofa data transfer modes
[64] [64] MPI startup(): shm and ofa data transfer modes
[56] [56] MPI startup(): shm and ofa data transfer modes
[80] [80] MPI startup(): shm and ofa data transfer modes
[72] [72] MPI startup(): shm and ofa data transfer modes
[88] [88] MPI startup(): shm and ofa data transfer modes
[32] [32] MPI startup(): shm and ofa data transfer modes
[48] [48] MPI startup(): shm and ofa data transfer modes
[16] [16] MPI startup(): shm and ofa data transfer modes
[40] [40] MPI startup(): shm and ofa data transfer modes
[24] [24] MPI startup(): shm and ofa data transfer modes
[0] [0] MPI startup(): shm and ofa data transfer modes
Ctrl-C caught... cleaning up processes

I am waiting for the SHM,TCP runs to see what happens there...

thanks for the reply
mike

R/D High-Performance Computing and Engineering

With export I_MPI_FABRICS="shm:tcp" I got

$ cat IMB_2x12--24_407895_impi_4.0.3.008_scatter_IMB-MPI1.out
(node329:0,2,4,6,8,10,12,14,16,18,20,22)
(node330:1,3,5,7,9,11,13,15,17,19,21,23)
[12] [12] MPI startup(): shm and tcp data transfer modes
[16] [16] MPI startup(): shm and tcp data transfer modes
[20] [20] MPI startup(): shm and tcp data transfer modes
[4] [4] MPI startup(): shm and tcp data transfer modes
[8] [8] MPI startup(): shm and tcp data transfer modes
[18] [18] MPI startup(): shm and tcp data transfer modes
[22] [22] MPI startup(): shm and tcp data transfer modes
[6] [6] MPI startup(): shm and tcp data transfer modes
[14] [14] MPI startup(): shm and tcp data transfer modes
[0] [0] MPI startup(): shm and tcp data transfer modes
[2] [2] MPI startup(): shm and tcp data transfer modes
[10] [10] MPI startup(): shm and tcp data transfer modes

and it is still hanging there.

R/D High-Performance Computing and Engineering

I was trying to attach as afile but it failed so I am pasting output here from what is going on on the two nodes in the above experiment. Basically I am showing that the code has launched on both nodes but it is spinning idly.

$ jrpt -d 407895.login006

## jrpt $Revision: 1.4 $ : a resource usage analysis tool for PBS/Torque-Maui Jobs on RHEL Linux
## (C) 2010 Michael E. Thomadakis for Texas A&M University
## Note: Time unit is second, Memory unit KiB (1024 B);
## a '-1' or '?' means user did not provide.

## ====================================================================================================== #
## Section 1 : Parsed Torque/PBS Section for job 407895.login006 #
## ====================================================================================================== #
## euser= miket , Job_Name= IMB , job_state= R , currently on queue= medium
# --- Requested Resources (Resource_List) -------------
# walltime = 12300 (wall-clock time, sec)
# nodect = 2 (cluster nodes)
# ppn = 1 (processes per node)
# ncpus = -1 (processors)
# mem = 46137344 (physical memory, KiB)
# neednodes = ? (PBS nspec)
# nodes = 2:ppn=1:westmere (PBS spec)
# x = ? (special PBS conditions)
# --- Allocated Resources (resources_used) ------------
# cput = 36838 (total CPU time, sec)
# walltime = 2694 (wall-clock time elapsed, sec)
# mem = 5878584 (total physical memory, KiB)
# vmem = 10250096 (total virtual memory, KiB)
# Walltime.Remaining = 9566 (PBS, 10 sec ?)
# Remaining Walltime = 9606 (calculated, sec)
## Host node list = [ node329 node330 ]

## ====================================================================================================== #
## Section 2 : Per Node Report for job 407895.login006 #
## ====================================================================================================== #
# Total Nthreads VM_total SW_total RSS_total Pmem_total Total_cput Pcpu_total
## ---------------------------------------------------------------------------- #
#+S node329 SysLoad=[ 12.00 11.98 11.33 13/467 21545 ]
# ............................................................................. #
## PhysMem (KiB) [total= 24659680 free= 23371684 used= 1287996]
## PhysCores :[0,(0, 0)], [1,(0, 1)], [2,(0, 2)], [3,(0, 8)], [4,(0, 9)], [5,(0, 10)], [6,(1, 0)], [7,(1, 1)], [8,(1, 2)], [9,(1, 8)], [10,(1, 9)], [11,(1, 10)], NumCores= 12
sched_getaffinity: No such process
failed to get pid 21548's affinity
## ..................... Resources Consumed by Process ........................ #
## Host PPID PID NLWP TID PRI NI S SCH F CLS SZ VSZ SZ RSS %MEM TIME %CPU COMMAND CPUID AffSet ## D level= 1
node329 21458 21460 1 21460 24 0 S 0 5 TS 23560 94240 804 1816 0.0 0:00 0.0 sshd 6 0-11
node329 21460 21461 1 21461 21 0 S 0 0 TS 16480 65920 276 1328 0.0 0:00 0.0 bash 0 0-11
node329 21461 21543 1 21543 20 0 S 0 0 TS 16480 65920 276 1136 0.0 0:00 0.0 sysjstate.s 0 0-11
node329 21543 21548 1 21548 19 0 R 0 0 TS 16396 65584 596 904 0.0 0:00 0.0 ps 0 0-11
node329 5162 21031 1 21031 14 0 S 0 4 TS 17132 68528 792 1428 0.0 0:00 0.0 bash 11 0-11
node329 21031 21032 1 21032 24 0 S 0 0 TS 3340 13360 1312 816 0.0 0:00 0.0 pbs_demux 2 0-11
node329 21031 21118 1 21118 14 0 S 0 0 TS 16480 65920 276 1132 0.0 0:00 0.0 407895.login0 1 0-11
node329 21118 21119 1 21119 21 0 S 0 0 TS 16608 66432 788 1240 0.0 0:00 0.0 IMB_3.2.3.s 1 0-11
node329 21119 21421 1 21421 24 0 S 0 0 TS 3935 15740 2824 1264 0.0 0:00 0.0 mpiexec.h 3 0-11
node329 21421 21422 1 21422 22 0 S 0 0 TS 3407 13628 796 1128 0.0 0:00 0.0 pmi_pro 6 0-11
node329 21422 21424 1 21424 14 0 R 0 0 TS 32473 129892 616 2760 0.0 11:59 99.9 IMB-M 0 0
node329 21422 21425 1 21425 14 0 S 0 0 TS 32473 129892 616 2752 0.0 11:59 99.9 IMB-M 6 6
node329 21422 21426 1 21426 14 0 R 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 1 1
node329 21422 21427 1 21427 14 0 R 0 0 TS 32473 129892 616 2760 0.0 11:59 99.9 IMB-M 7 7
node329 21422 21428 1 21428 14 0 R 0 0 TS 32473 129892 616 2760 0.0 11:59 99.9 IMB-M 2 2
node329 21422 21429 1 21429 14 0 R 0 0 TS 32473 129892 616 2760 0.0 11:59 99.9 IMB-M 8 8
node329 21422 21430 1 21430 14 0 R 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 3 3
node329 21422 21431 1 21431 14 0 S 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 9 9
node329 21422 21432 1 21432 14 0 R 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 4 4
node329 21422 21433 1 21433 14 0 S 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 10 10
node329 21422 21434 1 21434 14 0 R 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 5 5
node329 21422 21435 1 21435 14 0 R 0 0 TS 32473 129892 616 2756 0.0 11:59 99.9 IMB-M 11 11
node329 21421 21423 1 21423 22 0 S 0 0 TS 14588 58352 712 3236 0.0 0:00 0.0 ssh 2 0-11
#+U node329 20 2152328 16844 48512 0.00 8628.00 1198.80
# .................... Logical Process Tree ................................... #
bash,21031
407895.login006,21118 /var/spool/torque/mom_priv/jobs/407895.login006.sc.cluster.tamu.SC
IMB_3.2.3.sh,21119 ./IMB_3.2.3.sh 2 12 24
mpiexec.hydra,21421 -print-rank-map -l -rr -np 24 -genvlist I_MPI_PIN_PROCESSOR_LIST,I_MPI_FABRICS/scratch/miket/cs691/performance/benchm
pmi_proxy,21422 --control-port node329:57756 --pmi-connect lazy-cache--pmi-agg
IMB-MPI1,21424 -includePingPon
IMB-MPI1,21425 -includePingPon
IMB-MPI1,21426 -includePingPon
IMB-MPI1,21427 -includePingPon
IMB-MPI1,21428 -includePingPon
IMB-MPI1,21429 -includePingPon
IMB-MPI1,21430 -includePingPon
IMB-MPI1,21431 -includePingPon
IMB-MPI1,21432 -includePingPon
IMB-MPI1,21433 -includePingPon
IMB-MPI1,21434 -includePingPon
IMB-MPI1,21435 -includePingPon
ssh,21423 -x -q node330 /scratch/miket/software/intelXE/impi/4.0.3.008/intel64/bin/pmi_proxy --control-port node329:57756--pmi-c
pbs_demux,21032

sshd,21460
bash,21461 -c cd\040/g/home/miket/SC/cluster/PBS/JR;\040export\040D=1;\040./sysjstate.sh\040miket\040-U\040miket\040-H\040\040-ww
sysjstate.sh,21543 ./sysjstate.sh miket -U miket -H -ww
pstree,21573 -a -U -c -h -l -p miket -G
## ---------------------------------------------------------------------------- #
#+S node330 SysLoad=[ 11.99 11.97 11.61 13/502 20879 ]
# ............................................................................. #
## PhysMem (KiB) [total= 24659680 free= 23165080 used= 1494600]
## PhysCores :[0,(0, 0)], [1,(0, 1)], [2,(0, 2)], [3,(0, 8)], [4,(0, 9)], [5,(0, 10)], [6,(1, 0)], [7,(1, 1)], [8,(1, 2)], [9,(1, 8)], [10,(1, 9)], [11,(1, 10)], NumCores= 12
sched_getaffinity: No such process
failed to get pid 20882's affinity
## ..................... Resources Consumed by Process ........................ #
## Host PPID PID NLWP TID PRI NI S SCH F CLS SZ VSZ SZ RSS %MEM TIME %CPU COMMAND CPUID AffSet ## D level= 1
node330 20792 20794 1 20794 21 0 S 0 5 TS 23560 94240 804 1820 0.0 0:00 0.0 sshd 6 0-11
node330 20794 20795 1 20795 14 0 S 0 0 TS 16485 65940 276 1324 0.0 0:00 0.0 bash 6 0-11
node330 20795 20877 1 20877 14 0 S 0 0 TS 16485 65940 276 1140 0.0 0:00 0.0 sysjstate.s 6 0-11
node330 20877 20882 1 20882 14 0 R 0 0 TS 16401 65604 596 896 0.0 0:00 0.0 ps 0 0-11
node330 20877 20883 1 20883 14 0 S 0 0 TS 15979 63916 280 836 0.0 0:00 0.0 gawk 7 0-11
node330 20617 20619 1 20619 21 0 S 0 5 TS 23560 94240 804 1816 0.0 0:00 0.0 sshd 3 0-11
node330 20619 20620 1 20620 24 0 S 0 0 TS 4966 19864 748 1340 0.0 0:00 0.0 pmi_proxy 5 0-11
node330 20620 20702 1 20702 14 0 R 0 0 TS 32472 129888 612 2780 0.0 11:59 99.9 IMB-MPI1 0 0
node330 20620 20703 1 20703 14 0 S 0 0 TS 32472 129888 612 2772 0.0 11:59 99.9 IMB-MPI1 6 6
node330 20620 20704 1 20704 14 0 S 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 1 1
node330 20620 20705 1 20705 14 0 S 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 7 7
node330 20620 20706 1 20706 14 0 S 0 0 TS 32472 129888 612 2780 0.0 11:59 99.9 IMB-MPI1 2 2
node330 20620 20707 1 20707 14 0 R 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 8 8
node330 20620 20708 1 20708 14 0 R 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 3 3
node330 20620 20709 1 20709 14 0 S 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 9 9
node330 20620 20710 1 20710 14 0 R 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 4 4
node330 20620 20711 1 20711 14 0 S 0 0 TS 32472 129888 612 2780 0.0 11:59 99.9 IMB-MPI1 10 10
node330 20620 20712 1 20712 14 0 S 0 0 TS 32472 129888 612 2776 0.0 11:59 99.9 IMB-MPI1 5 5
node330 20620 20713 1 20713 14 0 R 0 0 TS 32472 129888 612 2764 0.0 11:59 99.9 IMB-MPI1 11 11
#+U node330 15 2028400 11128 42480 0.00 8628.00 1198.80
# .................... Logical Process Tree ................................... #
sshd,20619
pmi_proxy,20620 --control-port node329:57756 --pmi-connect lazy-cache--pmi-agg
IMB-MPI1,20702 -includePingPon
IMB-MPI1,20703 -includePingPon
IMB-MPI1,20704 -includePingPon
IMB-MPI1,20705 -includePingPon
IMB-MPI1,20706 -includePingPon
IMB-MPI1,20707 -includePingPon
IMB-MPI1,20708 -includePingPon
IMB-MPI1,20709 -includePingPon
IMB-MPI1,20710 -includePingPon
IMB-MPI1,20711 -includePingPon
IMB-MPI1,20712 -includePingPon
IMB-MPI1,20713 -includePingPon

sshd,20794
bash,20795 -c cd\040/g/home/miket/SC/cluster/PBS/JR;\040export\040D=1;\040./sysjstate.sh\040miket\040-U\040miket\040-H\040\040-ww
sysjstate.sh,20877 ./sysjstate.sh miket -U miket -H -ww
pstree,20903 -a -U -c -h -l -p miket -G

## =========================================================================================================== #
## Section 3 : Aggregate Report for job 407895.login006 across all 2 nodes #
## =========================================================================================================== #
# Total Nthreads VM_total SW_total RSS_total Pmem_total Total_cput Pcpu_total
# --------------------------------------------------------------------------------------------------------
#++ Measured JR 35 4180728 27972 90992 0.00 17256.00 2397.60
#++ Requested 2 4180728 90992 46137344 200.00 5388.00 200.00
#++ Utilization % 1750.000 100.000 30.741 0.197 0.000 320.267 1198.800
#++ Measured PBS -1 10250096 -1 5878584 -1.00 36838.00 -1.00
# --------------------------------------------------------------------------------------------------------
miket@login003[pts/8]IMB_3.2.3_results $

R/D High-Performance Computing and Engineering

When PBS kills the job due to wall clock time elapsing this is where mpiexec.hydra was :

[proxy:0:0@node212] [mpiexec@node212] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@node212] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@node212] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@node212] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

I can tell that the processes associated with the first node are responding but none of the processes on the other nodes responds to the initialization..

R/D High-Performance Computing and Engineering

Hello,

finally the initialization stall problem was resolved after I removed the following part

-envlist I_MPI_PIN_PROCESSOR_LIST,I_MPI_FABRICS 

from

	mpiexec.hydra  -print-rank-map -l -np $N

	    -genvlist I_MPI_PIN_PROCESSOR_LIST,I_MPI_FABRICS

	    $R/impi_4.0.3.008/$B/$osu_bin $OPT > ${J}_impi_4.0.3.008_scatter_${osu_bin}.out ;

to look like

[sectionBodyText] mpiexec.hydra -print-rank-map -l -rr -np $N
$R/impi_4.0.3.008/$B/$osu_bin $OPT > ${J}_impi_4.0.3.008_scatter_${osu_bin}.out ;

[/sectionBodyText]

R/D High-Performance Computing and Engineering

The problem was resolved using the workaround mentioned in the previous post.

Michael

R/D High-Performance Computing and Engineering

登陆并发表评论。