HYDT_bscu_wait_for_completion

HYDT_bscu_wait_for_completion

Bild des Benutzers Eh C.

I am getting the following message arbitrarily at times when running a parallel job using OpenFoam Application complied by Icc and  compiler and intel mpi. When I have one running job, it is fine, but all the jobs crashe for multiple running jobs.

lsb_launch(): Failed while waiting for tasks to finish.
[mpiexec@ys0271] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@ys0271] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@ys0271] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@ys0271] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

32 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers James Tullos (Intel)

Hi,

Unfortunately, this error alone gives very little information.  Does the crash happen at the beginning, or during execution?  Is this on a single node, or multiple nodes?  When you say job, do you mean MPI ranks/processes, or do you mean separate sets of linked processes?

Also, please run with I_MPI_DEBUG=5 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

Thanks. I attached the out put of the job.  There are different  MPI jobs which is running simultaneously by the same user. I think there is confliction between them. All of the jobs carshed and only one of them keeps execution.

Anlagen: 

AnhangGröße
Herunterladen output.txt1.59 KB
Bild des Benutzers James Tullos (Intel)

Hi,

It is certainly possible that there is a conflict between the jobs, if there is a resource that is used by the first job but unavailable and needed by other jobs.  The output you sent has no MPI debug output, it appears to be LSF output only.  Are the jobs being run in the same folder?  If so, and all are using the same launch script, the host file is being overwritten.  I would recommend letting mpirun detect the host file provided by LSF rather than building a separate one if possible.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

All the jobs are in the same folder but I used different names for host files. I wonder how it is possible to remove the conflicting between them.

Bild des Benutzers James Tullos (Intel)

Hi,

I would recommend putting them in different folders.  OpenFOAM* could be using a file in that folder that is common between the jobs.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

I put them in different folders, and again jobs crashed.

Bild des Benutzers James Tullos (Intel)

Hi,

Are the jobs running on different hosts?  Please try adding

-genv I_MPI_DEBUG 5

to the mpirun options and send the mpirun output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

I checked. all the jobs are running on different hosts. I attached log file as well as damping error and ouput.

Anlagen: 

AnhangGröße
Herunterladen log.txt23.04 MB
Herunterladen dam.txt600 Bytes
Herunterladen out.txt2.26 KB
Bild des Benutzers James Tullos (Intel)

Hi,

I'm still not seeing anything really indicative of the problem in the output.  Does the crash ever occur with only a single job running?  Are there common files used by OpenFOAM that could be locked by one job and thus inaccessible to a different job.  Does this crash happen with any other programs?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

I used OF of many different clusters and was fine. now, the issue which is different is that the platform is LSF and I compiled with ICC and itel_mpi. All the jobs unless one crash and only that one keeps running. It may happen after 2 or 3 hours running.

Bild des Benutzers James Tullos (Intel)

Hi,

On this cluster, can you run outside of LSF?  Have you been able to run with ICC and IMPI on a different cluster?  Let's try to isolate one change at a time.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

I do not think it is possible to change platform but I can compile OF with gcc and mpi_cc.

I am also very suspicious that this may be a memory problem but I do not know how to configure it.

Bild des Benutzers James Tullos (Intel)

If the jobs are running on separate nodes, there should be no conflicts in memory usage.  Can you send me the case you are using with OpenFOAM, along with the OpenFOAM configuration you're using?  I'll see if I can replicate it here as well.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

James, Now I am sure that the problem is a consequence of running several jobs on cluster simultaneously. There are two scenarios:1. memory for the user is limited and it is a problem with stack. 2. Jobs are conflicting and it is a problem with job scheduler. I do not have idea how to isolate this problem. I think it takes long time to replicate it on your cluster.

Bild des Benutzers James Tullos (Intel)

Hi,

Does this occur with any other programs?  One of the clusters I use has LSF, and is not showing this problem, but I don't use OpenFOAM.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

Hi

Our setup do not support launching job using hydra. That essentially means many of the ports hydra needs is not open.

How it is possible to launch the parallel job with out hydra?

Bild des Benutzers James Tullos (Intel)

Hi,

To use MPD instead, set I_MPI_PROCESS_MANAGER=mpd and use mpirun.  This will start an MPD ring on the hosts involved in the job, run the job, and stop the MPD ring.  Or, if you set up the MPD ring across your entire cluster ahead of time, you can use mpiexec instead.

James.

Bild des Benutzers Eh C.

Thanks James. I run mpdallexit and no mpd was running on the host and I export   I_MPI_PROCESS_MANAGER=mpd. But mpirun was still running and there was no error. It shows that it was still using hydra. How I can check wether it is running with mpd ring. 

Bild des Benutzers James Tullos (Intel)

Hi,

If you see "mpiexec.hydra" in the output from ps, then you are using Hydra.  If you just see "mpiexec", then you are using MPD.  Also, as I previously said, you can launch the MPD ring ahead of time, and then use mpiexec to launch using the MPD ring.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

Hi James,

I have compiled my code by Icc and intelmpi and can run by mpirun. Using the LSF platform, bsub -a openmpi -n number_cpus mpirun.lsf a.out , it is not running and I face the error.

Bild des Benutzers James Tullos (Intel)

Hi,

What does

-a openmpi

do?  The Intel® MPI Library is not compatible with OpenMPI.  When I use LSF*, I run with a job script similar to the attached file(renamed to .txt for attaching), using

bsub -W <time> < run.sh

Try something similar to this and see if it works.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Anlagen: 

AnhangGröße
Herunterladen run.txt225 Bytes
Bild des Benutzers Eh C.

Hi James, Thanks.

I used -a intelmpi. The problem is that I get frequently the error below ys1466:5738:ee08e700: 22629011 us(22629011 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 43628981 us(20999970 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 64629008 us(21000027 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 85628981 us(20999973 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 106628981 us(21000000 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
When I use your script, It asks for the number of processors. Like mpirun -n $number of processors and can not run it with just mpirun.

Bild des Benutzers James Tullos (Intel)

Hi,

That appears to be a problem with InfiniBand*.  Please check your IB connections.  You can use ibdiagnet to do this.

Add a -n <numranks> to my script if needed.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

I used ibdiagnet  and get the error below. I wonder whether it is the reason I can not launch jobs using mpirun.lsf
Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin                Succeeded  Plugin loaded
libibdiagnet_cable_diag_plugin-2.1.1          Failed     Plugin options issue - Option "get_cable_info" from requester "Cable Diagnostic (Plugin)" already exists in requester "Cable Diagnostic (Plugin)"

---------------------------------------------
Discovery
-E- Failed to initialize
---------------------------------------------
Summary
-I- Stage                     Warnings   Errors     Comment   
-I- Discovery                                       NA
-I- Lids Check                                      NA
-I- Links Check                                     NA
-I- Subnet Manager                                  NA
-I- Port Counters                                   NA
-I- Nodes Information                               NA
-I- Speed / Width checks                            NA
-I- Partition Keys                                  NA
-I- Alias GUIDs                                     NA

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log

-E- A fatal error occurred, exiting...

Bild des Benutzers James Tullos (Intel)

Hi,

Yes, that is very likely part of the problem.  Try running with I_MPI_FABRICS=shm:tcp to use sockets instead of InfiniBand*, and that will help determine if there is another problem.  Once you have the InfiniBand* working correctly, try again without setting the fabric.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Bild des Benutzers Eh C.

Hi James,

I compile my code with IBMPE and it launched with mpirun.lsf, but still wondering why it can not launch with intelmpi.

Thanks

Bild des Benutzers James Tullos (Intel)

Is mpirun.lsf using InfiniBand*?

James.

Bild des Benutzers Eh C.

Both of them mpirun.lsf and mpitun are using InfiniBand*. I wonder whether intelmpi support IBM-PE.

Thanks

Bild des Benutzers James Tullos (Intel)

Are you using the IBM* MPI implementation to compile, and the Intel® MPI Library to run?  That is not supported.  You will need to compile and run with the same implementation.

James.

Bild des Benutzers Eh C.

I compiled with intelmpi and try to launch it using mpirun.lsf

Bild des Benutzers James Tullos (Intel)

Hi,

Please try running with I_MPI_FABRICS=shm:tcp and let me know if this works.  Also, please attach your /etc/dat.conf file.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Melden Sie sich an, um einen Kommentar zu hinterlassen.