Error message form dapl_module_poll.c while running MPI job

Error message form dapl_module_poll.c while running MPI job

Dear experts,

 

I have experienced an error while running a parallel code compiled with Intel MPI. To start my jobs I am using an environmental variable $DO_PARALLEL, having the following content:

mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5

Our cluster uses PBS to submit jobs.

I am getting a rather unpredictable behavior, sometimes my code runs without problems, while others It fails with the following error:

OS: Scientific Linux SL release 5.5 (Boron)
[0:n010106] unexpected DAPL connection event 0x4008 from 34
Assertion failed in file ../../dapl_module_poll.c at line 4287: 0
internal ABORT - process 0
[9:n010404] unexpected disconnect completion event from [0:n010106]
[11:n010404] unexpected disconnect completion event from [0:n010106]
[22:n010312] unexpected disconnect completion event from [0:n010106]
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
Assertion failed in file ../../dapl_module_util.c at line 1593: 0I guess that this is a communication problem.
[7:n010106] unexpected disconnect completion event from [15:n010404]
Assertion failed in file ../../dapl_module_util.c at line 1593: 0
internal ABORT - process 7

Each node is equipped with 8 Quad-Core Intel® Xeon® Processor 5400 Series processors and has 16 GB of memory.

I have performed a little research on the internet and came to the conclusion that this might be a communication issue. Those errors started appearing, when I began communicating large arrays to the slaves. I would appreciate any Ideas and/or explanations what is the reasoning behind this rather strange behavior.

 

Thanks,

Alex

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Alex,

What are the contents of your /etc/dat.conf file?  What version of the Intel® MPI Library are you using?  Can you run with I_MPI_DEBUG=5 and provide the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

HI James,

 

Actually the output I provided is generated using I_MPI_DEBUG=5. It is included by default in the enviormental variable $DO_PARALLEL, used to start my MPI-run. In my first message I have an echo of it:

echo $DO_PARALLEL =mpiexec -machinefile /tmp/user/mpihosts-22 -np 16 -env I_MPI_DEBUG 5

I am using 11.1 version of the Intel mpif90 compiler. I have no/etc/dat.conf file.

 

Thanks,

Alex

Hi Alex,

Strange.  You should have additional output with I_MPI_DEBUG=5.  Can you send the output from

env | grep I_MPI

11.1 is a compiler version.  I need the version of the Intel® MPI Library, which can be found by using

mpirun -v

James.

Hi James,

mpirun –version
Intel(R) MPI Library for Linux, 64-bit applications, Version 4.0  Build 20100422
env | grep I_MPI
I_MPI_F77=ifort
I_MPI_F90=ifort
I_MPI_CC=icc
I_MPI_CXX=icpc
I_MPI_FC=ifort

Alex

Hi Alex,

Is there any chance you could try with the latest version of the Intel® MPI Library, Version 4.1 Update 1?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

Unfortunately I can’t do that. The only Intel(R) MPI library available on this machine is 4.0. Researching the forum here, I found a topic, which gave me some clues what to do. I have now the follwoing options included in my PBS script:

[plain]

export I_MPI_MPD_RSH=ssh

export I_MPI_USE_DYNAMIC_CONNECTIONS=0

export I_MPI_FABRICS_LIST="ofa,dapl,tcp,tmi"

export I_MPI_FALLBACK_DEVICE=1

[\plain]

I was able to run two jobs using this modification. I can’t conclude that this solves the problem because none of the nodes causing the problems was included by the job handler of PBS. Actually is it possible to see the source code of dapl_module_poll.c?

 

Thanks,

Alex

Hi Alex,

I'm glad you were able to find a workaround.  If you do run into further problems, let us know.

Unless you have the source code already, then no.  We typically don't share the source code for our proprietary software products.

You could also get an evaluation of the latest version and install it into your user folder.

James.

Hi James,

Well, as I indicated in my previous message the workaround I figured out is not a real solution to the problem. My calculation failed today once again. In general isn’t it possible to say what is going on, by looking at line where the error occurred?

Thanks,

Alex

Hi Alex,

Unfortunately, not really.  That error simply indicates that something caused the DAPL connection to fail.  Can you try to get the output from a failing run with I_MPI_HYDRA_DEBUG=1?

James.

Hi James,

I will try this. The situation is really strange, my code failed, than 2 min later I resubmitted the job and now it runs. It look like a quite random pattern.

Thanks,

Alex

Hi Alex,

I'd recommend checking your IB setup then.  You can try using I_MPI-FABRICS=shm:tcp to bypass it, and if this runs consistently, then I'd definitely suspect an IB problem.

James.

Leave a Comment

Please sign in to add a comment. Not a member? Join today