MPI Library 4.1 and Torque

MPI Library 4.1 and Torque

Dear all,

I'm trying to run a classical MPI test code on our cluster, and I'm still in trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Torque 4.1.3. 

If I don't use torque "mpirun -f machine -np 18 ./code", it runs fine (machine is the list of nodes). If i use torque, it runs and stop at the end of walltime with the following errors

=>> PBS: job killed: walltime 143 exceeded limit 120
[mpiexec@node4] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@node4] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@node4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@node4] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@node4] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completio

Do you have any idea ?

Thanks in advance,

M.

publicaciones de 16 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi Marc,

First I have to ask the obvious question.  How long does the job take to complete without Torque*?  If the job takes more than 2 hours, increase the allocated time for the job.

If that is not the case, then please send me the output with I_MPI_DEBUG=5 and we'll proceed from there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

The code should give things like "hello world i'm proccessor number ". I will run it for longer time.

the first test with "mpirun -genv I_MPI_DEBUG 5 -np 32 ./code"  in my batch file, doesn't give more information...

M.

Hi Marc,

Please send me the output from the following commands:

which mpirun

env | grep I_MPI

ldd ./code

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

Sorry for the delay. So, these are the output you asked for:

$ which mpirun
/opt/intel/impi/4.1.0.024/intel64/bin/mpirun

$ env |grep I_MPI
I_MPI_ROOT=/opt/intel/impi/4.1.0.024

$ ldd ./code
linux-vdso.so.1 => (0x00007ffff81ff000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000036efc00000)
libmpi.so.4 => /opt/intel/impi/4.1.0.024/intel64/lib/libmpi.so.4 (0x00007f3e13c0f000)
libmpigf.so.4 => /opt/intel/impi/4.1.0.024/intel64/lib/libmpigf.so.4 (0x00007f3e139df000)
librt.so.1 => /lib64/librt.so.1 (0x0000003fac200000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003fad200000)
libm.so.6 => /lib64/libm.so.6 (0x00000036f0800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000036fb000000)
libc.so.6 => /lib64/libc.so.6 (0x00000036f0000000)
/lib64/ld-linux-x86-64.so.2 (0x00000036ef800000)

Thank for the interest,

M.

Hi Marc,

Please send the output with -verbose.  Let's see if that offers any insight about what's going on.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

I finally solved the problem. It was coming from interactions between OpenMPI and Intel MPI... Thanks a lot for your help

M.

Hi Marc,

I'm glad to hear it's resolved now.  Are you attempting to use both OpenMPI and the Intel® MPI Library on the same program?  The two are not binary compatible.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

On our new cluster, I let the users choose the MPI they want. They only use one of them in a program (I have implemented the module files, so they can load the MPI and compilers they want). I have to say that on the first tests, the intel MPI is much more efficient with our codes.

Sincerely,

M.

Hi Marc,

There is no problem with having both installed on the same cluster.  You just need to make certain that you are running with the same implementation that you use in compiling/linking.

I'm glad to hear that our implementation is working well for you.  If you do have performance concerns, or any others, feel free to let us know, and we'll see what can be done.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Hi James,

We have a user with similar problem as described by Marc, on top of the discussion.

This user is trying to launch a job with PBSPro (12.1.1) and here is the beginning of his script:

[here are the PBS parameters]

module load intelmpi/4.1.0
module load intel/13.0.1
module load fftw/3.3.3-intel_intelmpi-13.0.1_4.1.0


PW_LAUNCH="mpirun -genv I_MPI_FABRICS shm:tmi -np 160 /home/user/QE/espresso-5.0.2/bin/pw.x"

NEB_LAUNCH="mpirun -genv I_MPI_FABRICS shm:tmi -np 160 /home/user/QE/espresso-5.0.2/bin/neb.x"
...

I tried what you asked to Marc:

which mpirun
/opt/software/intel/13.0.1/impi/4.1.0/bin64/mpirun

env | grep I_MPI
I_MPI_FABRICS=shm:tmi
I_MPI_ROOT=/opt/software/intel/13.0.1/impi/4.1.0

ldd ./job.sh
not a dynamic executable

Any clue?

Thank you.

Jean-Claude

Hi Jean-Claude,

Please provide the exact error message the user is getting.  Also, running ldd on the job script will not provide anything useful, you'll need to run it on the actual binary that is called.

Hi James,

Sorry for the late reply.

Here is the error reported by our user:

[mpiexec@b325] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@b325] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@b325] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@b325] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@b325] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completion

 

And here are the correct ldd (sorry about that):

ldd pw.x
    linux-vdso.so.1 =>  (0x00007fff6cb65000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000003ecc800000)
    libmkl_scalapack_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_scalapack_lp64.so (0x00007f33d11ed000)
    libmkl_blacs_intelmpi_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so (0x00007f33d0fb1000)
    libmkl_intel_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007f33d08a0000)
    libmkl_intel_thread.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_intel_thread.so (0x00007f33cf90b000)
    libmkl_core.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_core.so (0x00007f33ce6fd000)
    libmpi_mt.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpi_mt.so.4 (0x00007f33ce0c3000)
    libmpigf.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpigf.so.4 (0x00007f33cde92000)
    librt.so.1 => /lib64/librt.so.1 (0x0000003ecd000000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003eccc00000)
    libm.so.6 => /lib64/libm.so.6 (0x0000003ecc400000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003ecc000000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ed0000000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003ecbc00000)

 

Thank you.

Jean-Claude

And the second binary called by mpirun:

ldd neb.x
    linux-vdso.so.1 =>  (0x00007fff8e7b9000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000003ecc800000)
    libmkl_scalapack_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_scalapack_lp64.so (0x00007fd682970000)
    libmkl_blacs_intelmpi_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so (0x00007fd682734000)
    libmkl_intel_lp64.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fd682023000)
    libmkl_intel_thread.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_intel_thread.so (0x00007fd68108e000)
    libmkl_core.so => /opt/software/intel/13.0.1/mkl/lib/intel64/libmkl_core.so (0x00007fd67fe80000)
    libmpi_mt.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpi_mt.so.4 (0x00007fd67f846000)
    libmpigf.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpigf.so.4 (0x00007fd67f615000)
    librt.so.1 => /lib64/librt.so.1 (0x0000003ecd000000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003eccc00000)
    libm.so.6 => /lib64/libm.so.6 (0x0000003ecc400000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003ecc000000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ed0000000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003ecbc00000)

Regards,

Jean-Claude

Try running the test program provided and see if it shows the same behavior.  The source is located at $I_MPI_ROOT/test, compile the one for whichever language you prefer.

Hi James,

Thanks for your answer (and sorry for my very late one...).

I tried to compile the $I_MPI_ROOT/test/test.c, and it works fine.

Compilation:
mpicc test.c -o intel-test

Library check:
ldd intel-test
    linux-vdso.so.1 =>  (0x00007fff4651d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000003ecc800000)
    libmpi.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpi.so.4 (0x00007f5ae50f0000)
    libmpigf.so.4 => /opt/software/intel/13.0.1/impi/4.1.0/lib64/libmpigf.so.4 (0x00007f5ae4ec0000)
    librt.so.1 => /lib64/librt.so.1 (0x0000003ecd000000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003eccc00000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003ecc000000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003ecbc00000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ed0000000)

 

module load intel/13.0.1
module load intelmpi/4.1.0
module load fftw/3.3.3-intel_intelmpi-13.0.1_4.1.0

So, the problem comes probably from our client script.

Thank you.

Jean-Claude

 

 

 

 

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya