pmi_proxy stalls the HPC job

pmi_proxy stalls the HPC job

Hi HPC enthusiasts,
We are having a Sandy Bridge cluster of 8 nodes having the following:

Hardware:
1U rackmount enclosure
Intel S2400SC2 board
2 x Xeon E5-2450 processor
96GB ECC DDR3 RDIMM
Intel True Scale QLE7340-CK HCA
500GB Enterprise SATA
36 port QLogic switch
24-port 1GbE switch

Software:
CentOS 6.2 x64
Intel MPI Library 4.1.1.036
Intel Fortran Composer XE 2013.3.163
NetCDF 4.0
FFTW 3.3.3
Open Grid Engine 2011.11.p1
NFS share
Passphraseless SSH from any machine to any machine (meshed)

Of late, whenever we submit the job (home-grown code) either via mpirun direct or through Grid Engine qsub, invariably (~90% times) the job does not start execution, it just appears to stay stalled. On inspection of process runs, we find that randomly few nodes shows 'pmi_proxy' with status 'D' (uninterruptible sleep).

We have tested IMB (Intel MPI Benchmark), test codes (that comes with Grid Engine and Intel MPI) on the cluster both via mpirun and also through qsub, and it functions fine.

What is pmi_proxy process, and how to eliminate stalling of job. Non-functioning of job is driving me crazy. Please excuse me if it is already discussed somewhere, or, if this is not the correct forum. I'm a new novice HPC user.

Any guidance would be appreciated.

My advance thanks for an early and valuable suggestion(s).

With regards
Girish Nair
+91 98457 36460
girishnairisonline <at> gmail <dot> com

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Girish,

pmi_proxy is part of Hydra.  It seems the test codes are running as expected.  Can you provide more details on the program you are attempting to run when it hangs?  Also, if you can provide the output with

I_MPI_DEBUG=5
I_MPI_HYDRA_DEBUG=1

when it hangs, that could help.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,
Thanks for your effort to support.

The code is on MHD (Magneto Hydro Dynamics) that deals with the study of magnetic forces of earth. It basically works on compressed data using NetCDF and also provides compressed results to be read by NetCDF. The code is home-grown written on Fortran 90 and uses libraries like FFTW, MKL etc.

The equivalent open source code in my opinion could be 'Pencil Code' (http://pencil-code.nordita.org/)

As you've suggested, I'll run the code with the following parameters:

I_MPI_DEBUG=5  I_MPI_HYDRA_DEBUG=1

And I'll publish the output.

Additionally, do you think using mpdboot instead of hydra might help?

With regards
Girish Nair
+91 98457 36460
girishnairisonline <at> gmail <dot> com

If it does run under MPD and not under Hydra, we need to know about that so we can get it corrected.  We are trying to move to Hydra and away from MPD.

Leave a Comment

Please sign in to add a comment. Not a member? Join today