MPI - core assignment/core utilization/process manager difficulties

MPI - core assignment/core utilization/process manager difficulties

I am having troubles running my Fortran MPI program on a cluster using  PBSPro_11.3.0.121723.

When i execut my script (select information shown here)

#PBS -l select=1:ncpus=8

mpirun -n 8 -env I_MPI_DEBUG 5 ./xhpl_intel64

The scheduler allocates 8 cores for my program however if i ssh into the node and use top i can see that 4 mpiprocesses gets a core each and the last 4 processes shares a core. Thus providing very bad performance.

The wierd thing is that when using the Intel MPI library 4.0.0.028 runtime version this does not happen.

And it does not happen when executed from outside the batch queue

Using the Intel MPI library 4.0.1 and up this happens.

I notice that the < 4.0.1 runtime does not complain about

Setting I_MPI_PROCESS_MANAGER=mpd andadding machinefile $PBS_NODEFILE

will let the previously mentionen 8 core execution run close to 100% CPU however.

if i run 1 8 core job, then submits a 16 core job (thus using the first 8 cores on the same node as the first job and the next 8 nodes on another node)

and follows up with another 8 core job, then the last job and the last 8 cores og the 16 core job are placed on the same cores.

Best regards

Jesper

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Jesper,

Are you using cpuset?  What is the most recent version of the Intel® MPI Library that you've tested?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

As far as i can see the pbs system does not use cpuset (it launches pbs_mon not pbs_mon.cpuset).

I have tried libraries

4.0.0.027

4.0.0.028

these worked, however the software intended to run is compiled using a newer MPI library

-----------------

4.0.1.007

4.0.2.003

4.1.0.024

fails.

I have tried with mpd and with hydra

Hi Jesper,

Please send me the debug output (I_MPI_DEBUG=5) from 4.1.0.024.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

I submitted the 3 jobs as described above and got the following output - first qstat - then top on the 2 nodes then output.

qstat

489606.pbsserve s2874169 gccm     linp_8     124701   1   8    --  48:00 R 00:00 n041[0]/0*8

489607.pbsserve s2874169 gccm     linp_8     124837   2  16    --  48:00 R 00:00 n041[0]/1*8+n042[0]/0*8

489608.pbsserve s2874169 gccm     linp_8      77214   1   8    --  48:00 R 00:00 n042[0]/1*8

 

top

[s2874169@n041 ~]$ top

top - 18:46:17 up 38 days,  4:17,  1 user,  load average: 8.12, 2.12, 0.71 Tasks: 486 total,  17 running, 467 sleeping,   0 stopped,   2 zombie Cpu(s): 11.6%us,  0.2%sy,  0.0%ni, 88.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  32826988k total, 29043736k used,  3783252k free,   167704k buffers Swap:  8191992k total,      496k used,  8191496k free, 25257036k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

124804 s2874169  20   0  112m  12m 2388 R 98.9  0.0   0:45.31 xhpl_intel64_1

124953 s2874169  20   0  113m  13m 2420 R 98.9  0.0   0:37.90 xhpl_intel64_2

124802 s2874169  20   0  112m  12m 2380 R 97.0  0.0   0:45.23 xhpl_intel64_1

124805 s2874169  20   0  112m  12m 2416 R 97.0  0.0   0:45.23 xhpl_intel64_1

124806 s2874169  20   0  112m  12m 2384 R 97.0  0.0   0:45.14 xhpl_intel64_1

124808 s2874169  20   0  112m  12m 2408 R 97.0  0.0   0:45.27 xhpl_intel64_1

124954 s2874169  20   0  113m  13m 2436 R 97.0  0.0   0:37.87 xhpl_intel64_2

124955 s2874169  20   0  113m  13m 2456 R 97.0  0.0   0:37.84 xhpl_intel64_2

124956 s2874169  20   0  113m  13m 2436 R 97.0  0.0   0:37.84 xhpl_intel64_2

124957 s2874169  20   0  113m  13m 2424 R 97.0  0.0   0:37.85 xhpl_intel64_2

124958 s2874169  20   0  113m  13m 2444 R 97.0  0.0   0:37.89 xhpl_intel64_2

124959 s2874169  20   0  113m  13m 2456 R 97.0  0.0   0:37.86 xhpl_intel64_2

124960 s2874169  20   0 1222m 1.1g 2824 R 97.0  3.5   0:37.88 xhpl_intel64_2

124803 s2874169  20   0  112m  12m 2404 R 95.1  0.0   0:45.32 xhpl_intel64_1

124807 s2874169  20   0  112m  12m 2420 R 95.1  0.0   0:45.26 xhpl_intel64_1

124809 s2874169  20   0 1221m 1.1g 2784 R 95.1  3.5   0:45.15 xhpl_intel64_1

125220 s2874169  20   0 17360 1528  904 R  3.8  0.0   0:00.02 top

[s2874169@n042 ~]$ top top - 18:46:34 up 38 days,  3:37,  0 users,  load average: 9.61, 2.69, 0.92 Tasks: 480 total,  17 running, 462 sleeping,   0 stopped,   1 zombie Cpu(s):  5.7%us,  0.1%sy,  0.0%ni, 94.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  32826988k total, 27297112k used,  5529876k free,   163004k buffers Swap:  8191992k total,        0k used,  8191992k free, 24774616k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 77314 s2874169  20   0  121m  13m 2428 R 83.7  0.0   0:54.96 xhpl_intel64_2

 77320 s2874169  20   0  121m  13m 2452 R 83.7  0.0   0:49.40 xhpl_intel64_2

 77313 s2874169  20   0  122m  13m 2456 R 82.1  0.0   0:54.95 xhpl_intel64_2

 77315 s2874169  20   0  121m  13m 2444 R 82.1  0.0   0:54.99 xhpl_intel64_2

 77394 s2874169  20   0  112m  12m 2408 R 82.1  0.0   0:51.32 xhpl_intel64_3

 77396 s2874169  20   0 1221m 1.1g 2776 R 82.1  3.5   0:43.57 xhpl_intel64_3

 77316 s2874169  20   0  121m  13m 2436 R 80.5  0.0   0:54.94 xhpl_intel64_2

 77391 s2874169  20   0  112m  12m 2424 R 66.0  0.0   0:40.98 xhpl_intel64_3

 77390 s2874169  20   0  112m  12m 2416 R 56.3  0.0   0:38.42 xhpl_intel64_3

 77395 s2874169  20   0  112m  12m 2396 R 41.8  0.0   0:27.30 xhpl_intel64_3

 77317 s2874169  20   0  121m  13m 2444 R 40.2  0.0   0:29.60 xhpl_intel64_2

 77318 s2874169  20   0  121m  13m 2432 R 40.2  0.0   0:27.78 xhpl_intel64_2

 77319 s2874169  20   0  121m  13m 2444 R 40.2  0.0   0:27.77 xhpl_intel64_2

 77389 s2874169  20   0  112m  12m 2384 R 40.2  0.0   0:44.89 xhpl_intel64_3

 77392 s2874169  20   0  112m  12m 2404 R 40.2  0.0   0:29.10 xhpl_intel64_3

 77393 s2874169  20   0  112m  12m 2384 R 38.6  0.0   0:27.18 xhpl_intel64_3

 77675 s2874169  20   0 17360 1536  904 R  1.6  0.0   0:00.01 top

 

xhpl_intel64_1

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/mpirun [2] MPI startup(): shm and ofa data transfer modes [4] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [1] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       124809   n041       {0,1}

[0] MPI startup(): 1       124802   n041       {2,3}

[0] MPI startup(): 2       124803   n041       {4,5}

[0] MPI startup(): 3       124804   n041       {6,7}

[0] MPI startup(): 4       124805   n041       {8,9}

[0] MPI startup(): 5       124806   n041       {10,11}

[0] MPI startup(): 6       124807   n041       {12,13}

[0] MPI startup(): 7       124808   n041       {14,15}

[0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:ofa

 

xhpl_intel64_2

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/mpirun [4] MPI startup(): shm and ofa data transfer modes [10] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [12] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [1] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [15] MPI startup(): shm and ofa data transfer modes [9] MPI startup(): shm and ofa data transfer modes [2] MPI startup(): shm and ofa data transfer modes [13] MPI startup(): shm and ofa data transfer modes [8] MPI startup(): shm and ofa data transfer modes [11] MPI startup(): shm and ofa data transfer modes [14] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       124960   n041       {0,1}

[0] MPI startup(): 1       124953   n041       {2,3}

[0] MPI startup(): 2       124954   n041       {4,5}

[0] MPI startup(): 3       124955   n041       {6,7}

[0] MPI startup(): 4       124956   n041       {8,9}

[0] MPI startup(): 5       124957   n041       {10,11}

[0] MPI startup(): 6       124958   n041       {12,13}

[0] MPI startup(): 7       124959   n041       {14,15}

[0] MPI startup(): 8       77313    n042       {0,1}

[0] MPI startup(): 9       77314    n042       {2,3}

[0] MPI startup(): 10      77315    n042       {4,5}

[0] MPI startup(): 11      77316    n042       {6,7}

[0] MPI startup(): 12      77317    n042       {8,9}

[0] MPI startup(): 13      77318    n042       {10,11}

[0] MPI startup(): 14      77320    n042       {12,13}

[0] MPI startup(): 15      77319    n042       {14,15}

[0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:ofa

 

xhpl_intel64_3

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/mpirun [2] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [1] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [4] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       77396    n042       {0,1}

[0] MPI startup(): 1       77389    n042       {2,3}

[0] MPI startup(): 2       77390    n042       {4,5}

[0] MPI startup(): 3       77391    n042       {6,7}

[0] MPI startup(): 4       77392    n042       {8,9}

[0] MPI startup(): 5       77393    n042       {10,11}

[0] MPI startup(): 6       77394    n042       {12,13}

[0] MPI startup(): 7       77395    n042       {14,15}

[0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:ofa

 

 

Hi Jesper,

Please try setting I_MPI_PIN_MODE=mpd and see if that helps.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

I ran a test with I_MPI_PIN_MODE=mpd and it made one of the 8 core executions not run at all.

--------------------------

first 8 core execution pbs error file

########################### Execution Started ############################# JobId:492050.pbsserver UserName:s2874169 GroupName:12874169 ExecutionHost:n041 WorkingDir:/var/spool/PBS/mom_priv ###############################################################################

sched_setaffinity: Invalid argument failed to set pid 0's affinity.

sched_setaffinity: Invalid argument failed to set pid 0's affinity.

sched_setaffinity: Invalid argument failed to set pid 0's affinity.

sched_setaffinity: Invalid argument failed to set pid 0's affinity.

=>> PBS: job killed: walltime 135 exceeded limit 120

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/mpirun: line 1: kill: SIGTERM: invalid signal specification

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

id: cannot find name for group ID 12874169

########################### Job Execution History #############################

pbs output

Starting job Mon Feb 18 19:51:19 EST 2013

/export/home/s2874169/Variable_Runs/LINPACK

n041

n041

n041

n041

n041

n041

n041

n041

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/mpirun

 job aborted; reason = mpd disappeared

----------------------------------------------------------------

The 16 core execution

 the following was added to the I_MPI_DEBUG output (only showing the new/different parts)

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=c74030

[0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       36663    n041       n/a

[0] MPI startup(): 1       36656    n041       n/a

[0] MPI startup(): 2       36657    n041       n/a

[0] MPI startup(): 3       36658    n041       n/a

[0] MPI startup(): 4       36659    n041       n/a

[0] MPI startup(): 5       36660    n041       n/a

[0] MPI startup(): 6       36661    n041       n/a

[0] MPI startup(): 7       36662    n041       n/a

---------------------------------------------------------------

The last 8 core execution

 

 

If i use I_MPI_PIN_MODE=mpd but remove I_MPI_PROCESS_MANAGER=mpd

i get

[0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       37907    n041       {10,12}

[0] MPI startup(): 1       37908    n041       {11,13}

[0] MPI startup(): 2       37909    n041       {0,1}

[0] MPI startup(): 3       37910    n041       {2,3}

[0] MPI startup(): 4       37911    n041       {4,5}

[0] MPI startup(): 5       37912    n041       {6,7}

[0] MPI startup(): 6       37913    n041       {8,9}

[0] MPI startup(): 7       37914    n041       {10,12}

[0] MPI startup(): I_MPI_DEBUG=5

[0] MPI startup(): I_MPI_FABRICS=shm:ofa

[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 2,2 4,3 6,4 8,5 10,6 11,7 0

----------

[2] MPI startup(): shm and ofa data transfer modes [10] MPI startup(): shm and ofa data transfer modes [9] MPI startup(): shm and ofa data transfer modes [8] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [4] MPI startup(): shm and ofa data transfer modes [1] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [14] MPI startup(): shm and ofa data transfer modes [11] MPI startup(): shm and ofa data transfer modes [15] MPI startup(): shm and ofa data transfer modes [12] MPI startup(): shm and ofa data transfer modes [13] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       37984    n041       {0,1}

[0] MPI startup(): 1       37985    n041       {4,5}

[0] MPI startup(): 2       37986    n041       {8,9}

[0] MPI startup(): 3       37987    n041       {10,12}

[0] MPI startup(): 4       37988    n041       {2,3}

[0] MPI startup(): 5       37989    n041       {6,13}

[0] MPI startup(): 6       37990    n041       {7,11}

[0] MPI startup(): 7       37991    n041       {0,1}

[0] MPI startup(): 8       113141   n042       {0,1}

[0] MPI startup(): 9       113142   n042       {2,3}

[0] MPI startup(): 10      113143   n042       {4,5}

[0] MPI startup(): 11      113144   n042       {6,7}

[0] MPI startup(): 12      113145   n042       {8,9}

[0] MPI startup(): 13      113146   n042       {10,11}

[0] MPI startup(): 14      113147   n042       {12,13}

[0] MPI startup(): 15      113148   n042       {14,15}

[0] MPI startup(): I_MPI_DEBUG=5

[0] MPI startup(): I_MPI_FABRICS=shm:ofa

[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 2,2 4,3 6,4 7,5 8,6 10,7 0

-----------

[1] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [4] MPI startup(): shm and ofa data transfer modes [2] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       113218   n042       {0,1}

[0] MPI startup(): 1       113219   n042       {4,5}

[0] MPI startup(): 2       113220   n042       {8,9}

[0] MPI startup(): 3       113221   n042       {10,12}

[0] MPI startup(): 4       113222   n042       {2,3}

[0] MPI startup(): 5       113223   n042       {6,13}

[0] MPI startup(): 6       113224   n042       {7,11}

[0] MPI startup(): 7       113225   n042       {0,1}

[0] MPI startup(): I_MPI_DEBUG=5

[0] MPI startup(): I_MPI_FABRICS=shm:ofa

[0] MPI startup(): I_MPI_PIN_MAPPING=8:0 0,1 2,2 4,3 6,4 7,5 8,6 10,7 0

 

However then the processes seems to share cores, looking at top processes that are normally close to 100 %CPU are now close to 50 or 25

 

Hi Jesper,

Please send the output from the following commands:

/export/home/s2874169/intel/impi/4.1.0.024/intel64/bin/cpuinfo

env | grep I_MPI

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

I am sorry for the late reply, somehow the update email got lost from my outlook view (error 40).

The cpuinfo was:

Intel(R) Processor information utility, Version 4.1.0 Build 20120831

Copyright (C) 2005-2012 Intel Corporation.  All rights reserved.

=====  Processor composition  =====

Processor name    : Intel(R) Xeon(R)  E5-2670 0

Packages(sockets) : 0

Cores             : 0

Processors(CPUs)  : 0

=====  Cache sharing  =====

Cache Size  Processors

L1 32  KB  no sharing

L2 256 KB  no sharing

L3 20  MB  no sharing

----------------

I tried env | grep I_MPI with a few different mpirun options (changing some I_MPI to see the different thing we have done already)

 Initially:

I_MPI_FABRICS=shm:ofa

I_MPI_RDMA_CREATE_CONN_QUAL=0

I_MPI_SUBSTITUTE_INSTALLDIR=/sw/dhi/intel/impi/4.0.3.008

I_MPI_ROOT=/export/home/s2874169/intel/impi/4.1.0.024

----------------

Adding I_MPI_PROCESS_MANAGER=mpd made it show up in the env | grep I_MPI list

Adding I_MPI_PIN_MODE=mpd made it show up in the env | grep I_MPI list

Adding I_MPI_SUBSTITUTE_INSTALLDIR=/export/home/s2874169/intel/impi/4.1.0.024 mad it show up it did not change anything with the performance though.

 

Hi James,

Update i forgot to mention that the previous post about cpuinfo and env | grep I_MPI was done in the pbs script.

using ssh to login to one of the nodes and then run cpuinfo i got:

Intel(R) Processor information utility, Version 4.1.0 Build 20120831

Copyright (C) 2005-2012 Intel Corporation.  All rights reserved.

=====  Processor composition  =====

Processor name    : Intel(R) Xeon(R)  E5-2670 0

Packages(sockets) : 2 Cores             : 16

Processors(CPUs)  : 16

Cores per package : 8

Threads per core  : 1

=====  Processor identification  =====

Processor       Thread Id.      Core Id.        Package Id.

0               0               0               0

1               0               1               0

2               0               2               0

3               0               3               0

4               0               4               0

5               0               5               0

6               0               6               0

7               0               7               0

8               0               0               1

9               0               1               1

10              0               2               1

11              0               3               1

12              0               4               1

13              0               5               1

14              0               6               1

15              0               7               1

=====  Placement on packages  =====

Package Id.     Core Id.        Processors

0               0,1,2,3,4,5,6,7         0,1,2,3,4,5,6,7

1               0,1,2,3,4,5,6,7         8,9,10,11,12,13,14,15

=====  Cache sharing  =====

Cache   Size            Processors

L1      32  KB          no sharing

L2      256 KB          no sharing

L3      20  MB          (0,1,2,3,4,5,6,7)(8,9,10,11,12,13,14,15)

----------------------------------------------------------------------

And while just logged in using ssh i got

[s2874169@n041 ~]$ env | grep I_MPI

I_MPI_FABRICS=shm:ofa

I_MPI_RDMA_CREATE_CONN_QUAL=0

I_MPI_SUBSTITUTE_INSTALLDIR=/sw/dhi/intel/impi/4.0.3.008

I_MPI_ROOT=/sw/dhi/intel/impi/4.0.3.008

Hi Jesper,

The "env" command I asked you to run simply displays the environment, using "grep I_MPI" reduces it to the Intel® MPI Library specific environment variables.  So setting a different environment variable would show in the output.  Everything there looks fine.  However, the cpuinfo output from within the batch script is very wrong.  The output when you are logged in directly is correct.  I'll investigate that and see if I can find out why cpuinfo isn't reporting anything from a script.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Leave a Comment

Please sign in to add a comment. Not a member? Join today