Large jobs with Hydra won't start

Large jobs with Hydra won't start

Hello

I have a simple test MPI job that I'm having trouble running on IMPI 4.1.036 with large node counts (>1500 processes).  This is using Hydra as the process manager.  It gets stuck at the following place in a verbose debug output:

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 39): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 42): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 7): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 45): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 12): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 48): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 15): barrier_in

[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): get_maxes

[proxy:0:160@cf-sb-cpc-223] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:160@cf-sb-cpc-223] got pmi command (from 30): barrier_in

[proxy:0:160@cf-sb-cpc-223] forwarding command (cmd=barrier_in) upstream

It runs consistently fine for process counts up to 1536 approx, but 2k or 3k cores breaks as above.  I thought initially it might be ulimit related, but fixed those - as proven by the output from the same script:

core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515125
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 9316
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 515125
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

If one sets I_PROCESS_MANAGER to mpd, the large jobs work faultlessly.  So, my question is how to go about debugging this further?

Many thanks

Ade

publicaciones de 11 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de James Tullos (Intel)

Hi Ade,

Try setting I_MPI_HYDRA_BRANCH_COUNT=128.  You could also try using a uDAPL provider if you aren't already doing that.  To do so, set I_MPI_DAPL_UD=1 and I_MPI_DAPL_UD_PROVIDER to the uDAPL provider on your system.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James

Many thanks for your assistance. It has helped, although there is still a limit.  We are using DAPL (although we default to SHM:OFA) and fiddling with that has had no effect.  Setting I_MPI_HYDRA_BRANCH_COUNT has, however, enabled jobs up to 3000 procs to work 8 out of 10 times.  Larger jobs still fail, and the 2/10 that fail at that size do make me wonder what on earth could be causing this troublesome issue.

Would you have any further advice?

Many thanks

Ade

Imagen de James Tullos (Intel)

Hi Ade,

What DAPL provider are you using?  What is the output from

env | grep I_MPI

James.

Hi James:

I_MPI_MPD_RSH
I_MPI_PROCESS_MANAGER hydra
I_MPI_HYDRA_BOOTSTRAP lsf
I_MPI_HYDRA_BRANCH_COUNT 128
I_MPI_DAPL_UD 1
I_MPI_DAPL_UD_PROVIDER ofa-v2-mlx4_0-2
I_MPI_FABRICS
I_MPI_HYDRA_BRANCH_COUNT=128
I_MPI_DAPL_UD=1
I_MPI_F77=ifort
I_MPI_ADJUST_ALLTOALLV=2
I_MPI_FALLBACK=disable
I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx4_0-2
I_MPI_CXX=icpc
I_MPI_FC=ifort
I_MPI_HYDRA_BOOTSTRAP=lsf
I_MPI_PROCESS_MANAGER=hydra
I_MPI_ROOT=/app/libraries/impi/4.1.1.036

~~
Ade

Hi again James

It's funny how you only see these things after you post.....

[14] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

Let me get that sorted before I waste any more of your time.

Many thanks

Ade

Dear James

Having now resolved the DAPL issue, and with the following evidence from a 1024 proc job to suggest its working:

[948] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_UD_PROVIDER: ofa-v2-mlx4_0-1

[103] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 with IB UD extension

[967] MPI startup(): shm and dapl data transfer modes

The larger jobs still don't work.  The env variables you asked for are as follows:

I_MPI_HYDRA_BRANCH_COUNT=128
I_MPI_DAPL_UD=1
I_MPI_F77=ifort
I_MPI_FALLBACK=disable
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx4_0-1
I_MPI_CXX=icpc
I_MPI_DEBUG=2
I_MPI_FC=ifort
I_MPI_HYDRA_BOOTSTRAP=lsf
I_MPI_PROCESS_MANAGER=hydra
I_MPI_ROOT=/app/libraries/impi/4.1.1.036

Really appreciate your time & advice.

Thanks

Ade

Imagen de James Tullos (Intel)

Hi Ade,

Ok, try raising ulimit -n.  Somewhere on the order of 15000 to 20000.  I don't know if this will help in this case, but the open file limit has sometimes lead to problems.  Also, set

[plain]I_MPI_DYNAMIC_CONNECTION=0[/plain

This will increase the startup time, but having all connections opened at the start could avoid the hang you're seeing.

Is there any chance you can provide a reproducer?  I can run 2000 ranks on one of our clusters (though getting the reservation could take some time) and try to work with it here.

James.

Hi James

Apologies for the delay responding to this - it has proven difficult to get access to enough of the cluster at this end as well!

I have tweeked the environment so that there is now a ulimit of 65k to remove that as a possible issue.  I've also tried your suggestions regarding the I_MPI_DYNAMIC_CONNECTION, but also to no avail.  Tried both with shm:ofa and shm:dapl options. 'mpd' process manager works fine all the way up to as many processors as we have. 

The test case is actually any MPI file - I am actually testing it with a hello world variant.  In this case it is being run via LSF using Mellanox IB.  Is there anything I can provide you as a test case any more than that?

Many thanks

Ade

Imagen de James Tullos (Intel)

Hi Ade,

Please send me the full output with

 -verbose -genv I_MPI_DEBUG 5

James.

Hi James

Just catching up on things....I sent you over the requested logs, did you get them OK?  Is it likely this is something that may be picked up in SPs?

Many thanks

Ade

Inicie sesión para dejar un comentario.