INTEL MPI Hydra Crash

INTEL MPI Hydra Crash

ptsouts's picture

i am getting the following message arbitrarily at times when running a parallel job using the latest intel fortran compiler and intel mpi. [proxy:0:12@n020] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed [proxy:0:12@n020] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event [mpiexec@n032] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting [mpiexec@n032] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion [mpiexec@n032] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion [mpiexec@n032] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

[proxy:0:12@n020] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP)) failed[proxy:0:12@n020] main (./pm/pmiserv/pmip.c:387): demux engine error waiting for event[mpiexec@n032] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting[mpiexec@n032] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion[mpiexec@n032] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion[mpiexec@n032] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion I am currently using the following command: mpirun -np N ./a.exe Should i specify anything else in order to ensure this error will not happen again?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
James Tullos (Intel)'s picture

Hi ptsouts,

The error you are seeing is caused because one of the processes in your job ended incorrectly. However, by itself the information you have provided isn't sufficient to pin down the cause of the error. Can you try running this command:

mpirun -np N -check_mpi ./a.exe

That will give additional information regarding the MPI calls being made. Please post the output of this command, preferably from one of the failed runs.

Can you provide any details of the program you are attempting to run? It would be best if you could provide a small snippet of the program that shows this behavior, so I can attempt to reproduce it here. Or if it is a publicly available code, a link to the source would work as well.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

James Tullos (Intel)'s picture

Hi ptsouts,

Have you tried running your program with the -check_mpi option? Are you able to provide any of the source code for the program, or another that can reproduce this behavior?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

gryghash's picture

I seem to be having a similar problem. I either get the same error messages as the original post, or I get "APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)"

I am running this example program: http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program

It has worked with OpenMPI. It also works with Intel MPI on a single node, multiple cores. However, the multi-node runs all crash.

I am using mpiexec.hydra's Torque/PBS integration, and it works: it finds all the assigned nodes, and knows how many cores per node are to be used.

Here's the job:

cd ${PBS_O_WORKDIR}

mpiexec.hydra -verbose -rmk pbs -tmpdir /scratch/${PBS_JOBID} ./hello_mpi

I am attaching the verbose output, redacted.

I have tried turning of the firewall on the compute nodes, but it didn't help. The errors remained the same.

Thanks for your attention,
--Dave Chin

Attachments: 

AttachmentSize
Download output_redacted.txt59.91 KB
James Tullos (Intel)'s picture

Hi Dave,

What version of OFED are you using? What does your /etc/dat.conf file look like?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Login to leave a comment.