Hi HPC enthusiasts,
We are having a Sandy Bridge cluster of 8 nodes having the following:
1U rackmount enclosure
Intel S2400SC2 board
2 x Xeon E5-2450 processor
96GB ECC DDR3 RDIMM
Intel True Scale QLE7340-CK HCA
500GB Enterprise SATA
36 port QLogic switch
24-port 1GbE switch
CentOS 6.2 x64
Intel MPI Library 4.1.1.036
Intel Fortran Composer XE 2013.3.163
Open Grid Engine 2011.11.p1
Passphraseless SSH from any machine to any machine (meshed)
Of late, whenever we submit the job (home-grown code) either via mpirun direct or through Grid Engine qsub, invariably (~90% times) the job does not start execution, it just appears to stay stalled. On inspection of process runs, we find that randomly few nodes shows 'pmi_proxy' with status 'D' (uninterruptible sleep).
We have tested IMB (Intel MPI Benchmark), test codes (that comes with Grid Engine and Intel MPI) on the cluster both via mpirun and also through qsub, and it functions fine.
What is pmi_proxy process, and how to eliminate stalling of job. Non-functioning of job is driving me crazy. Please excuse me if it is already discussed somewhere, or, if this is not the correct forum. I'm a new novice HPC user.
Any guidance would be appreciated.
My advance thanks for an early and valuable suggestion(s).
+91 98457 36460
girishnairisonline <at> gmail <dot> com