Intel MPI with LSF got stdoe_cb assert (!closed) failed.

Intel MPI with LSF got stdoe_cb assert (!closed) failed.

Dear all,

I am trying to run an application with intel mpi and LSF on our cluster but I still got trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Platform LSF 7.

The application is an extention of RAMS - High Resolution Forecast Europe, Greece, Athens compiled with HDF5, Intel fortran, and Intel mpi. The application normally runs for 6 hours. But sometime, we will get the errors like below:

[mpiexec@cn104] stdoe_cb (./ui/utils/uiu.c:385): assert (!closed) failed
[mpiexec@cn104] control_cb (./pm/pmiserv/pmiserv_cb.c:831): error in the UI defined callback
[mpiexec@cn104] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn104] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:430): error waiting for event
[mpiexec@cn104] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion

The error happens very often but is not repeatable. Retrying the error run with the same settings will pass.

The bsub command:

$ bsub -x -n 144 -oo ini.log -eo error.log -K 'mpirun -np 144 ./iclams_opt -f ICLAMSIN'

Do you have any idea?

Thanks in advance,

Tingyang Xu

publicaciones de 8 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de James Tullos (Intel)

Being an intermittent error, this will obviously be more difficult to debug.  What fabric are you using?  Does this occur with I_MPI_FABRICS=shm:tcp as well?

Hello James,

Thank you for your reply. I did not specify the I_MPI_FABRICS when I was using the mpirun. But since we are using the InfiniBand with Mellanox Switches, I think the fabric is ofa.

I will try the I_MPI_FABRICS=shm:tcp with the mpirun. BTW, if I switch the fabric to tcp, will it lower down the performance of the software? We hope the software can finish computing in 6-7 hours.

 

Thanks,

Tingyang Xu

Imagen de James Tullos (Intel)

Using TCP will almost certainly lower the performance.  However, for debugging purposes, we're trying to isolate the cause of the problem, and whether or not it fails under TCP helps to do that.

Thank you for your explanation. Let me try the tcp first.

Hello James,

I just find that this issue have not appeared for at least 5 days since I changed number of cores from 144 to 160. Before that, I was facing that issue almost every day. We have 16 cores for each node. So do you think the odd number of nodes will cause that issue?

 

Thanks,

Tingyang Xu

Imagen de James Tullos (Intel)

It's possible.  If you think that's the concern, try running with different rank placement options, maybe run 144 ranks across 10 nodes instead of 9, and decrease the number of ranks per node using -ppn.  Have you seen this error with other applications?

I see. I never encountered that issue before because this was the first time that I tried odd number of nodes. Thank you for your help.

Inicie sesión para dejar un comentario.