MPI Library 4.1 and Torque

MPI Library 4.1 and Torque

Marc O.的头像

Dear all,

I'm trying to run a classical MPI test code on our cluster, and I'm still in trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Torque 4.1.3. 

If I don't use torque "mpirun -f machine -np 18 ./code", it runs fine (machine is the list of nodes). If i use torque, it runs and stop at the end of walltime with the following errors

=>> PBS: job killed: walltime 143 exceeded limit 120
[mpiexec@node4] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@node4] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@node4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@node4] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@node4] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completio

Do you have any idea ?

Thanks in advance,

M.

10 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项.
James Tullos (Intel)的头像

Hi Marc,

First I have to ask the obvious question.  How long does the job take to complete without Torque*?  If the job takes more than 2 hours, increase the allocated time for the job.

If that is not the case, then please send me the output with I_MPI_DEBUG=5 and we'll proceed from there.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Marc O.的头像

Hi James,

The code should give things like "hello world i'm proccessor number ". I will run it for longer time.

the first test with "mpirun -genv I_MPI_DEBUG 5 -np 32 ./code"  in my batch file, doesn't give more information...

M.

James Tullos (Intel)的头像

Hi Marc,

Please send me the output from the following commands:

which mpirun

env | grep I_MPI

ldd ./code

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Marc O.的头像

Hi James,

Sorry for the delay. So, these are the output you asked for:

$ which mpirun
/opt/intel/impi/4.1.0.024/intel64/bin/mpirun

$ env |grep I_MPI
I_MPI_ROOT=/opt/intel/impi/4.1.0.024

$ ldd ./code
linux-vdso.so.1 => (0x00007ffff81ff000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000036efc00000)
libmpi.so.4 => /opt/intel/impi/4.1.0.024/intel64/lib/libmpi.so.4 (0x00007f3e13c0f000)
libmpigf.so.4 => /opt/intel/impi/4.1.0.024/intel64/lib/libmpigf.so.4 (0x00007f3e139df000)
librt.so.1 => /lib64/librt.so.1 (0x0000003fac200000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003fad200000)
libm.so.6 => /lib64/libm.so.6 (0x00000036f0800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000036fb000000)
libc.so.6 => /lib64/libc.so.6 (0x00000036f0000000)
/lib64/ld-linux-x86-64.so.2 (0x00000036ef800000)

Thank for the interest,

M.

James Tullos (Intel)的头像

Hi Marc,

Please send the output with -verbose.  Let's see if that offers any insight about what's going on.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Marc O.的头像

Hi James,

I finally solved the problem. It was coming from interactions between OpenMPI and Intel MPI... Thanks a lot for your help

M.

James Tullos (Intel)的头像

Hi Marc,

I'm glad to hear it's resolved now.  Are you attempting to use both OpenMPI and the Intel® MPI Library on the same program?  The two are not binary compatible.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Marc O.的头像

Hi James,

On our new cluster, I let the users choose the MPI they want. They only use one of them in a program (I have implemented the module files, so they can load the MPI and compilers they want). I have to say that on the first tests, the intel MPI is much more efficient with our codes.

Sincerely,

M.

James Tullos (Intel)的头像

Hi Marc,

There is no problem with having both installed on the same cluster.  You just need to make certain that you are running with the same implementation that you use in compiling/linking.

I'm glad to hear that our implementation is working well for you.  If you do have performance concerns, or any others, feel free to let us know, and we'll see what can be done.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

登陆并发表评论。