I am trying to run an application with intel mpi and LSF on our cluster but I still got trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Platform LSF 7.
The application is an extention of RAMS - High Resolution Forecast Europe, Greece, Athens compiled with HDF5, Intel fortran, and Intel mpi. The application normally runs for 6 hours. But sometime, we will get the errors like below:
[mpiexec@cn104] stdoe_cb (./ui/utils/uiu.c:385): assert (!closed) failed
[mpiexec@cn104] control_cb (./pm/pmiserv/pmiserv_cb.c:831): error in the UI defined callback
[mpiexec@cn104] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn104] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:430): error waiting for event
[mpiexec@cn104] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion
The error happens very often but is not repeatable. Retrying the error run with the same settings will pass.
The bsub command:
$ bsub -x -n 144 -oo ini.log -eo error.log -K 'mpirun -np 144 ./iclams_opt -f ICLAMSIN'
Do you have any idea?
Thanks in advance,