Using Intel MPI with PBSPro and Kerberos

Using Intel MPI with PBSPro and Kerberos

Hello,

We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.

The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.

As you can expect, a simple

mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/echo-node.sh" # that simply calls bash builtin echo

fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:

[...]
[mpiexec@node028.sis.cnes.fr] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0
[mpiexec@node028.sis.cnes.fr] Launch arguments: /bin/ssh -x -q node029.sis.cnes.fr /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1
[proxy:0:0@node028.sis.cnes.fr] Start PMI_proxy 0
[proxy:0:0@node028.sis.cnes.fr] STDIN will be redirected to 1 fd(s): 17
[0] node: 0 /  /
=>> PBS: job killed: walltime 23 exceeded limit 15
[mpiexec@node028.sis.cnes.fr] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@node028.sis.cnes.fr] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@node028.sis.cnes.fr] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@node028.sis.cnes.fr] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@node028.sis.cnes.fr] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@node028.sis.cnes.fr] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.

I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?

BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a

[proxy:0:0@node028.sis.cnes.fr] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "node028.sis.cnes.fr" to "node028.sis.cnes.fr" (Connection refused)
[proxy:0:0@node028.sis.cnes.fr] main (../../pm/pmiserv/pmip.c:461): unable to connect to server node028.sis.cnes.fr at port 49813 (check for firewalls!)

So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(

Regards,

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I answer my own question, it appears the solution lies in PBSPro User Guide §6.2.6.1.

Setting

export I_MPI_HYDRA_BOOTSTRAP=rsh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh

fixed my issue. I was mislead by the fact we don't have `rsh` installed.

Leave a Comment

Please sign in to add a comment. Not a member? Join today