mpiexec fails under SGE

mpiexec fails under SGE

Hi everyone,

I'm trying to run Intel MPI-3.2.1 on a SGI Altix Linux cluster under SGE-6.2. It fails with following error:

cat output.32.Hello
/var/sge/default/spool/r1i0n12/active_jobs/32.1/pe_hostfile
r1i0n12
r1i0n12
r1i0n12
r1i0n12
r1i0n12
r1i0n12
r1i0n12
r1i0n12
mpdroot: cannot connect to local mpd at: /tmp/32.1.all.q/mpd2.console_root_r1i0n12
probable cause: no mpd daemon on this machine
possible cause: unix socket /tmp/32.1.all.q/mpd2.console_root_r1i0n12 has been removed
mpiexec_r1i0n12 (__init__ 1162): forked process failed; status=255

But, if job is submitted without using SGE(i.e. from command line) then it works well on the same set of nodes

The mpi job is submitted using mpiexec command and mpd's are already booted by root and user has MPD_USE_ROOT_MPD=1 in .mpd.conf file in his home directory.

What could be the reason for failure here?

Thanks

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi San,

It seems to me that SGE changes TMPDIR environment variable and after that mpdroot cannot find console file.
Could you set I_MPI_MPD_TMPDIR=/tmp before you create an mpd ring and give it a try? Don't forget to set this variable for the user.

Please let me know if it doesn't help.

Regards!
Dmitry

Login to leave a comment.