Issue with MPI 2019U6 and MLX provider

Issue with MPI 2019U6 and MLX provider


Hi

We have two clusters that are almost identical except that one is now running Mellanox OFED 4.6 and the other 4.5.

With MPI 2019U6 from Studio 2020 distribution, one cluster (4.5) works OK, the other (4.6) does not and throws some UCX errors:

]$ cat slurm-151351.out
I_MPI_F77=ifort
I_MPI_PORT_RANGE=60001:61000
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_CXX=icpc
I_MPI_DEBUG=999
I_MPI_FC=ifort
I_MPI_HYDRA_BOOTSTRAP=slurm
I_MPI_ROOT=/apps/compilers/intel/2020.0/compilers_and_libraries_2020.0.166/linux/mpi
MPI startup(): Imported environment partly inaccesible. Map=0 Info=0
[0] MPI startup(): libfabric version: 1.9.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 512, addrname_firstlen: 512
[0] MPI startup(): val_max: 4096, part_len: 4095, bc_len: 1030, num_parts: 1
[1578327353.181131] [scs0027:247642:0]         select.c:410  UCX  ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
[1578327353.180508] [scs0088:378614:0]         select.c:410  UCX  ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
Abort(1091471) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed

 

Is this possibly an Intel MPI issue or something at our end (where 2018 and early 2019 versions worked OK)?

Thanks
A

7 posts / 0 new

Hi Ade,

Thanks for reaching out to us. We are working on your issue. we will get back to you soon.

-Shubham


Are you encountering this error with every program you are running, or only with certain programs?

Also, if you have installed Intel® Cluster Checker, please run

clck -f ./<nodefile> -F mpi_prereq_user

This will run diagnostic checks related to Intel® MPI Library functionality and help verify that the cluster is configured as expected.


It seems to be with every program, although admittedly I'm only trying noddy examples 'hello world' and a primes counting example.

All seem to work on the OFED 4.5 cluster, but fail on the OFED 4.6 cluster, when Studio 2020 is used.

Cluster checker happy except for the logical processor count as we have it enabled in BIOS but twiddled at boot on all our systems:

SUMMARY
  Command-line:   clck -F mpi_prereq_user
  Tests Run:      mpi_prereq_user
  ERROR:          2 tests encountered errors. Information may be incomplete. See
                  clck_results.log and search for "ERROR" for more information.
  Overall Result: 1 issue found - FUNCTIONALITY (1)
--------------------------------------------------------------------------------
2 nodes tested:         cdcs[0003-0004]
0 nodes with no issues:
2 nodes with issues:    cdcs[0003-0004]
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
  1. There is a mismatch between number of available logical cores and maximum
     logical cores. Cores '40-79' are offline.
       2 nodes: cdcs[0003-0004]

HARDWARE UNIFORMITY
No issues detected.

PERFORMANCE
No issues detected.

SOFTWARE UNIFORMITY
No issues detected.

See clck_results.log for more information.


Hello Ade,

 

Have you tried to measure the performance "mlx" provider with MOFED 4.5? Can you run the standard "IMB" or OSU benchmarks? 

Have you tried any other MPI stacks? OpenMPI is available with MOFED distributions and you can quickly try any of these benchmarks that come prebuilt.

 

regards

Michael

 


Hi Michael et al.

We only have this problem with 2020.  2019, 2018, OpenMPI, MPICH, Mellanox's HPCX OpenMPI all OK. 

I have now - I think - isolated it to something between the mlx FI_PROVIDER and the MLNX_OFED 4.6 we have.  Setting the provider to verbs appears to cure the problem, although is perhaps less than ideal.  Equally the mlx provider has no issue on the MLNX_OFED 4.5 deployments we have.

Michael - if you are interested in performance separately - rather than just making it work - I can provide some IMB output.

Cheers

Ade


Ade, 

In my tests, verbs provider offers 2-3GB/s at best which is really not good (6X below line speed for EDR).

Is your CPU Zen2 or Intel based? 

Sure I can see some numbers :)

regards

Michael

 

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today