Problems when trying to run symmetric MPI jobs with MPSS 3.2, MLNX HCA and ofed-3.5.1-mic-beta1

Problems when trying to run symmetric MPI jobs with MPSS 3.2, MLNX HCA and ofed-3.5.1-mic-beta1

Hi,

We have been struggling to get symmetric MPI jobs running on our cluster. MPI works fine on host to host and also mic native MPI works between compute nodes. Intra node host <-> mic communication also works but internode just hangs. It won't get "PMI response: cmd=barrier_out". Is it supposed to work at all with this HW/SW combination?

Centos 6.5, MPSS 3.2, Slurm 2.6.7 and OFED 3.5.1.MIC.beta1. Mellanox ConnectX3 HCA and mpxyd is running.

I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u

I_MPI_FABRICS=shm:dapl

    Network:       Static bridge br0
        MIC IP:    10.10.5.X
        Host IP:   10.10.4.X
        Net Bits:  16
        NetMask:   255.255.0.0
        MtuSize:   1500

net.ipv4.ip_forward = 1

Here are last lines from the debug output

[mpiexec@m41] Launch arguments: /usr/bin/ssh -x -q m42-mic0 sh -c 'export I_MPI_ROOT="/appl/opt/cluster_studio_xe2013/impi/4.1.3.045" ; export PATH="/appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/bin//../../mic/bin:${I_MPI_ROOT}:${I_MPI_ROOT}/mic/bin:${PATH}" ; exec "$0" "$@"' pmi_proxy --control-port 10.10.4.41:33072 --debug --pmi-connect lazy-cache --pmi-aggregate -s 0 --enable-mic --i_mpi_base_path /appl/opt/cluster_studio_xe2013/impi/4.1.3.045/intel64/bin/ --i_mpi_base_arch 0 --rmk slurm --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1532590129 --proxy-id 3

[mpiexec@m41] STDIN will be redirected to 1 fd(s): 9
[proxy:0:0@m41] Start PMI_proxy 0
[proxy:0:0@m41] STDIN will be redirected to 1 fd(s): 15
[proxy:0:0@m41] got pmi command (from 10): init
pmi_version=1 pmi_subversion=1
[proxy:0:0@m41] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@m41] got pmi command (from 10): get_maxes

[proxy:0:0@m41] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0:0@m41] got pmi command (from 10): barrier_in

[proxy:0:0@m41] forwarding command (cmd=barrier_in) upstream
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:2@m41-mic0] Start PMI_proxy 2
[proxy:0:1@m42] Start PMI_proxy 1
[proxy:0:2@m41-mic0] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:2@m41-mic0] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:2@m41-mic0] got pmi command (from 6): get_maxes

[proxy:0:2@m41-mic0] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:2@m41-mic0] got pmi command (from 6): barrier_in

[proxy:0:2@m41-mic0] forwarding command (cmd=barrier_in) upstream
[proxy:0:3@m42-mic0] Start PMI_proxy 3
[proxy:0:3@m42-mic0] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:3@m42-mic0] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:3@m42-mic0] got pmi command (from 6): get_maxes

[proxy:0:3@m42-mic0] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=1024
[mpiexec@m41] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:3@m42-mic0] got pmi command (from 6): barrier_in

[proxy:0:3@m42-mic0] forwarding command (cmd=barrier_in) upstream

Hangs here.

 

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hey Tommi,

Let's see I understand correctly.  You have a cluster with a list of Xeon hosts (e.g. node0, node1) with attached Phis (e.g. node0-mic0, node1-mic0).

What does work is Xeon-to-Xeon communication (node0 <-> node1) and Phi-to-Phi (node0-mic0 <-> node1-mic0).  But it does not work if you're doing remote Phi to remote Xeon (node0 <-> node1-mic0).  Do I have this correctly?

Of course, I have to ask that you have your hosts file setup so that *all* Xeons and *all* Phis can communicate with each other.  In your case, I'll make sure that host m42-mic0 can directly ssh (without being prompted for a password) to host m41 and all other Xeon hosts on the system.  So in the /etc/hosts file on m42-mic0, you'll see an entry for m41.  You should also make sure that the /etc/dat.conf file is available on all hosts (Xeons and Phis).

Let's not set I_MPI_DAPL_PROVIDER explicitly.  Intel MPI automatically handles which providers it'll use based on whether the communication is local or remote (between Xeons and Phis).  I would only set I_MPI_FABRICS=shm:dapl.

Finally, can you let me know how large your job is and what your mpirun/mpiexec command looks like?  The software stack looks fine but I'd like to see how you run this under slurm.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today