intel mpi error

intel mpi error

Bild des Benutzers bernieb@yahoo.com

We have a new cluster with Mellanox FDR Infiniband interconnect and sometimes get the following error when running Intel MPI :

[15] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c[

16] Abort: Error code in polled desc!

[16] Abort: Got FATAL event 3at line 1010 in file ../../ofa_utility.c

at line 2346 in file ../../ofa_init.c[

159] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c

[0] Abort: Error code in polled desc!at line 2346 in file ../../ofa_init.c

We have also seen this error when running over a very large nodeset :

send desc error[400] Abort: Got completion with error 9, vendor code=8a, dest rank=at line 870 in file ../../ofa_poll.c

I am not seeing this type of error at all using OPENMPI.  The cluster is using OFED (not the mellanox vendor supplied one).  We are using Torque as our resource manager.

Any help diagnosing this would be appreciated.

Bernie

6 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Gergana Slavova (Intel)

Hey Bernie,

Thanks for posting.

The errors you're seeing are coming from the OFED software stack. It's very likely you're not using a suitable provider when running your Intel MPI jobs. Can you provide a couple of pieces of information?

It'll be good to know what Intel MPI Library version you're running, as well as your full command line and if you're setting any Intel MPI-specific environment variables. Also, please provide your /etc/dat.conf file. I should be able to tell you which provider you'd need to use based on that.

I look forward to hearing back soon.

Regards,
~Gergana

Gergana Slavova
Technical Consulting Engineer
Intel® Cluster Tools
E-mail: gergana.s.slavova_at_intel.com
Bild des Benutzers bernieb@yahoo.com

Gergana - thanx for the quick reply.

I am running the latest Intel MPI - 4.1.0.027
I am setting the following environment variables :

I_MPI_FABRICS=shm:ofa
I_MPI_DEBUG=2
I_MPI_ROOT=/fltapps/boeing/mpi/intel/impi/4.1.0.027
I_MPI_EXTRA_FILESYSTEM=1
I_MPI_EXTRA_FILESYSTEM_LIST=panfs

We have a panasas file system.

I was under the impression that that version did not require an /etc/dat.conf and Intel MPI supported IB natively without a DAPL layer.

Here is the command line :

/usr/bin/time mpirun -np $NPROCS $OVEREXE >& over.1.out

$OVEREXE is the program that we are running and NPROCS is the number of processors to use.

This job is run under torque and should be able to pick up the node list from the queuing system.

To clarify we don't see the error all the time, just sometimes when the job is submitted.

So it sounds like we have an OFED problem on one or more of the nodes since it works most of the time.

any other clues you can give me to diagnose what is wrong would be appreciated.

Bernie

Bild des Benutzers John Gilmore

Hi guys, I seem to have the same issue. When I run my Intel MPI job (that runs on both MVAPICH2 and Open MPI), I receive the following error:

[6] Abort: Got FATAL event 3
at line 1010 in file ../../ofa_utility.c
recv desc error, 128, 0x61b880
[1] Abort: Got completion with error 9, vendor code=8a, dest rank=
at line 870 in file ../../ofa_poll.c

After this, the application just blocks. I'm running MPI 4.1.0.024.

It's an MPMD application. My command line is:
mpirun -perhost 3 -f /home/john/App/src/hostfile_intel \
-n 3 -env I_MPI_FABRICS shm:ofa ./AppA : \
-n 9 -env I_MPI_FABRICS shm:ofa ./AppB

I don't have an etc/dat.conf file. My OS and architecture is:Linux hostname 3.3.8-1.fc16.x86_64 #1 SMP Mon Jun 4 20:49:02 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Does anyone have any ideas? I've not installed any OFA libraries explicitly, but for the other MPI implementation I didn't to. We're using Mellanox Infiniband adaptors. Any help would be greatly appreciated.

Regards
John

Bild des Benutzers pankajd

we are getting same error while running vasp.5.3.3 on centos 6.3, intel composer_xe_2013.1.117, intel mpi 4.1.0, with mellanox ofed, 56 gbps ib connected nodes (snb processors).

errror

send desc error
> > [8] Abort: [13] Abort: Got completion with error 12, vendor code?, dest
> > rank> at line 870 in file ../../ofa_poll.c
> > [14] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [9] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [10] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [11] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > [12] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c
> > send desc error
> > [30] Abort: Got completion with error 12, vendor code?, dest rank> at line 870 in file ../../ofa_poll.c

Bild des Benutzers bernieb@yahoo.com

We were eventually able to track this to bad cables that caused the IB to tail.  So it had nothng to do with the software or Ofed stack at all.

Bernie

Melden Sie sich an, um einen Kommentar zu hinterlassen.