HYDT_bscu_wait_for_completion

HYDT_bscu_wait_for_completion

imagem de Eh C.

I am getting the following message arbitrarily at times when running a parallel job using OpenFoam Application complied by Icc and  compiler and intel mpi. When I have one running job, it is fine, but all the jobs crashe for multiple running jobs.

lsb_launch(): Failed while waiting for tasks to finish.
[mpiexec@ys0271] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@ys0271] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@ys0271] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@ys0271] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

16 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de James Tullos (Intel)

Hi,

Unfortunately, this error alone gives very little information.  Does the crash happen at the beginning, or during execution?  Is this on a single node, or multiple nodes?  When you say job, do you mean MPI ranks/processes, or do you mean separate sets of linked processes?

Also, please run with I_MPI_DEBUG=5 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

Thanks. I attached the out put of the job.  There are different  MPI jobs which is running simultaneously by the same user. I think there is confliction between them. All of the jobs carshed and only one of them keeps execution.

Anexos: 

AnexoTamanho
Download output.txt1.59 KB
imagem de James Tullos (Intel)

Hi,

It is certainly possible that there is a conflict between the jobs, if there is a resource that is used by the first job but unavailable and needed by other jobs.  The output you sent has no MPI debug output, it appears to be LSF output only.  Are the jobs being run in the same folder?  If so, and all are using the same launch script, the host file is being overwritten.  I would recommend letting mpirun detect the host file provided by LSF rather than building a separate one if possible.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

All the jobs are in the same folder but I used different names for host files. I wonder how it is possible to remove the conflicting between them.

imagem de James Tullos (Intel)

Hi,

I would recommend putting them in different folders.  OpenFOAM* could be using a file in that folder that is common between the jobs.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

I put them in different folders, and again jobs crashed.

imagem de James Tullos (Intel)

Hi,

Are the jobs running on different hosts?  Please try adding

-genv I_MPI_DEBUG 5

to the mpirun options and send the mpirun output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

I checked. all the jobs are running on different hosts. I attached log file as well as damping error and ouput.

Anexos: 

AnexoTamanho
Download log.txt23.04 MB
Download dam.txt600 bytes
Download out.txt2.26 KB
imagem de James Tullos (Intel)

Hi,

I'm still not seeing anything really indicative of the problem in the output.  Does the crash ever occur with only a single job running?  Are there common files used by OpenFOAM that could be locked by one job and thus inaccessible to a different job.  Does this crash happen with any other programs?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

I used OF of many different clusters and was fine. now, the issue which is different is that the platform is LSF and I compiled with ICC and itel_mpi. All the jobs unless one crash and only that one keeps running. It may happen after 2 or 3 hours running.

imagem de James Tullos (Intel)

Hi,

On this cluster, can you run outside of LSF?  Have you been able to run with ICC and IMPI on a different cluster?  Let's try to isolate one change at a time.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

I do not think it is possible to change platform but I can compile OF with gcc and mpi_cc.

I am also very suspicious that this may be a memory problem but I do not know how to configure it.

imagem de James Tullos (Intel)

If the jobs are running on separate nodes, there should be no conflicts in memory usage.  Can you send me the case you are using with OpenFOAM, along with the OpenFOAM configuration you're using?  I'll see if I can replicate it here as well.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

imagem de Eh C.

James, Now I am sure that the problem is a consequence of running several jobs on cluster simultaneously. There are two scenarios:1. memory for the user is limited and it is a problem with stack. 2. Jobs are conflicting and it is a problem with job scheduler. I do not have idea how to isolate this problem. I think it takes long time to replicate it on your cluster.

imagem de James Tullos (Intel)

Hi,

Does this occur with any other programs?  One of the clusters I use has LSF, and is not showing this problem, but I don't use OpenFOAM.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Faça login para deixar um comentário.