MPI errors on large OPA fabric

MPI errors on large OPA fabric

Hello,

We're getting MPI communication errors using Intel MPI on our cluster using omnipath.  This is a job using 931 nodes, smaller runs using 600 nodes execute properly.

Other details:

We're using Intel Parallel Studio 2017 update 4 (compilers_and_libraries_2017.4.196).

There are 1024 total nodes on the fabric, we would like to run jobs utilizing the entire cluster.

This is an HPL run using Intel l_mklb_p_2017.3.017.

This is an example of the errors we see - what is interesting is the buffer and target size is the same, however the error states it is truncated.  Is there normally a header the target buffer needs to have space for?

Fatal error in MPI_Recv: Message truncated, error stack:
MPI_Recv(224)................: MPI_Recv(buf=0x2b1ee8401840, count=1455, MPI_DOUBLE, src=17, tag=10001, comm=0x84000002, status=0x7ffef5ddfe50) failed
MPID_nem_tmi_handle_rreq(738): Message from rank 17 and tag 10001 truncated; 11640 bytes received but buffer size is 11640
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b93ba000000, scount=1164, MPI_DOUBLE, dest=13, stag=10001, rbuf=0x2b93ba002460, rcount=1746, MPI_DOUBLE, src=13, rtag=10001, comm=0x84000002, status=0x7ffcec3f3f50) failed
MPID_nem_tmi_handle_rreq(738): Message from rank 13 and tag 10001 truncated; 13968 bytes received but buffer size is 13968
Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(259)............: MPI_Sendrecv(sbuf=0x2b30f5880808, scount=24576, MPI_DOUBLE, dest=16, stag=10001, rbuf=0x2b30ef400000, rcount=1164, MPI_DOUBLE, src=16, rtag=10001, comm=0x84000002, status=0x7ffc4278ec10) failed

 

Thread Topic: 

Help Me
2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The question seems more appropriate to the cluster hpc forum, if you could quote intel cluster checker diagnoses.

Leave a Comment

Please sign in to add a comment. Not a member? Join today