[SOLVED] Intel MPI crashes in more than one node

[SOLVED] Intel MPI crashes in more than one node

Iván S.的头像

Dear all,

I am compiling different codes (details at the end) using the Intel Cluster Studio 2013 for Linux (C and Fortran compilers, MKL BLACS and MKL FFT3W) + Intel MPI 4.0.3.008. The programs run without problems when using one computing node, but they crash when I try to use more than one computing node.

I have gathered all the possible information from the execution and MPI calls with these options of mpirun: -v -check_mpi -genv I_MPI_DEBUG 5. The resulting information is in the attached files.

The interesting information is at the end of the files, where you can find:

from vasp.log:
[23] ERROR: LOCAL:EXIT:SIGNAL: fatal error
[23] ERROR: Fatal signal 11 (SIGSEGV) raised.
[23] ERROR: Signal was encountered at:
[23] ERROR: hamil_mp_hamiltmu_ (/home/ivasan/programas/VASP/vasp.5.3_test/vasp)
[23] ERROR: After leaving:
[23] ERROR: mpi_allreduce_(*sendbuf=0x7fff5d1ce340, *recvbuf=0x18e19c0, count=1, datatype=MPI_DOUBLE_PRECISION, op=MPI_SUM, comm=0xffffffffc4060000 CART_SUB CART_CREATE CART_SUB CART_CREATE COMM_WORLD [18:23], *ierr=0x7fff5d1ce2ac->MPI_SUCCESS)

from abinit.log:
[23] ERROR: LOCAL:MPI:CALL_FAILED: error
[23] ERROR: Null communicator.
[23] ERROR: Error occurred at:
[23] ERROR: mpi_comm_rank_(comm=MPI_COMM_NULL, *rank=0x29319b8, *ierr=0x7fff83fabb74)
[23] ERROR: initmpi_grid_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/51_manage_mpi/initmpi_grid.F90:178)
[23] ERROR: invars1_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1.F90:1015)
[23] ERROR: invars1m_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/invars1m.F90:186)
[23] ERROR: m_ab6_invars_mp_ab6_invars_load_ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/57_iovars/m_ab6_invars_f90.F90:548)
[23] ERROR: MAIN__ (/home/ivasan/programas/abinit/abinit-6.12.3b/src/98_main/abinit.F90:260)
[23] ERROR: main (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)
[23] ERROR: (/lib64/libc-2.5.so)
[23] ERROR: (/home/ivasan/programas/abinit/abinit-6.12.3b/bin/abinit)

So in both cases the problems seem to be related to MPI.

What can I do to solve these errors?

Thanks in advance for your help.

Iván

CODES:

- VASP V5.3.2 (http://www.vasp.at/). I posted this issue at the support forum: http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?3.12037

- Abinit V6.12.3 (http://www.abinit.org/). I posted this issue at the support forum: http://forum.abinit.org/viewtopic.php?f=3&t=1851

附件尺寸
下载 abinit.log139.24 KB
下载 vasp.log103.93 KB
Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
24 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
James Tullos (Intel)的头像

Hi Ivan,

Would it be possible for you to try running with the latest version of the Intel® MPI Library, Version 4.1? This version is available at the Intel® Registration Center (https://registrationcenter.intel.com).

Also, please try running across two nodes with a simple hello world program using the same settings. You could also try oversubscribing a single node with the same jobs (rather than letting it go to two nodes) and see if that causes the same problem.

The VASP error may or may not be related to MPI. The message you are seeing simply indicates the last MPI call (which was successful based on the lack of error messages related to it) that occurred before the segmentation fault.

The Abinit error could be due to a communicator not being initialized correctly. It could be due to an incorrect communicator being passed to the MPI_Comm_rank routine. Check the MPI_Comm_rank call in the initmpi_grid routine.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Dear James,

Thanks for your comment. The same jobs (i.e. same programs with same inputs) run without problems in one single node. The problem arises with trying to use more than one computing node.

I will try using the Intel MPI 4.1. In addition, people from Abinit have recommended me to use the latest update of the Intel Compiler V12 instead of the initial release version of V13.

I will try both suggestions. I will post here the results of both tests just in case they can be useful (although it might take me some time).

Kind regards,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

Try running a hello world program with "-genv I_MPI_DEBUG 5" on two nodes and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Dear James,

I have compiled the test program provided with the Intel MPI pack using:

>mpiicc -check_mpi test.c -o test.x

and then I have executed

>mpirun -IB --bootstrap ssh -genv I_MPI_DEBUG 5 -np NN -machinefile ./machines ./test.x

being NN=12 for 1 node, and NN=24 for 2 nodes. Output files are attached, in addition to "test.x.prot_.Xnode.txt" that were also generated. I had to add the extension .txt to all the attached files in order to upload them.

Kind regards,

Iván

PS: the test.c program is located at test directory of the Intel MPI installation folder.

附件: 

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

Ok, so you are able to run a simple application across multiple nodes. What happens if you run either VASP or Abinit with only 12 ranks, six on each node?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Hi James,

It seems that both programs also fail when using 12 ranks, 6 in each node.

Attached files are the outputs when using:
- abinit.6+6.log_.txt: -genv I_MPI_DEBUG 5
- abinit.6+6.log_.checkmpi.txt: -check_mpi -genv I_MPI_DEBUG 5
- vasp.6+6.log_.txt: -genv I_MPI_DEBUG 5
- vasp.6+6.log_.checkmpi.txt: -check_mpi -genv I_MPI_DEBUG 5

It seems that there is the same error as the case of 12+12 ranks.

Kind regards,

Iván

附件: 

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
Iván S.的头像

Hi James,

I have installed Intel MPI 4.1 and the same errors occur. Nevertheles, this time I tried without -check_mpi and I got new errors that might help:

- When using -IB I get:

[11] Abort: Got completion with error 12, vendor code=81, dest rank=
at line 870 in file ../../ofa_poll.c

- When not using -IB or when using "-genv I_MPI_FABRICS shm:dapl -genv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1"

[18:compute-0-3] unexpected disconnect completion event from [1:compute-0-2]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 18

I have seen this kind of errors in other posts of this forum, but I couldn't find the solution to them.

At the moment I will try to recompile OFED just in case the new version of the Intel MPI "doesn't like" how it was previously compiled.

Any other suggestion is welcome.

Kind regards,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

Recompiling OFED is not likely to solve the problem. Can you please try running the Intel® MPI Benchmarks across two nodes, preferably with 12 ranks per node just to match what you were seeing earlier?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Hi James,

You can find attached the output of:

mpirun -bootstrap ssh -v -genv I_MPI_DEBUG 5 -np 24 -machinefile ./machines IMB-MPI1

I have run this job from the command line (output.12+12.txt) and using SGE (output.12+12.sge.txt).

They have finished without problems.

By the way, our cluster has passed all the Intel Cluster Ready tests.

Kind regards,

Iván

附件: 

附件尺寸
下载 output.1212.txt293.13 KB
下载 output.1212.sge.txt300.47 KB
Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

Have you tried running either of these programs with another MPI implementation, such as MPICH?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Hi James,

I didn't try any other MPI implementation, only Intel MPI. I would like to have Intel MPI working. And somehow it is working ... only when I use one computing node. I don't know what else I can check ... I am running out of ideas.

Kind regards,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

I'm trying to run ABINIT here. On the tutorial datasets I haven't run into a problem yet, but I can't run the one you posted on the ABINIT forum. What is the atomic psp file you are using?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Iván S.的头像

Hi James,

You can find attached the files you need to run the job (just delete '.txt' extensions and '_' characters).

For execution you have to do (with the mpirun options befor abinit, i.e. mpirun -v -genv I_MPI_DEBUG ...):

abinit < files >& log

Some comments about the cSi.in file: you can find at the beginning the variables related to the parallelization:
paral_kgb 1
npkpt 2
npband 12
wfoptalg 4
istwfk *1
nloalg 4
fftalg 401
fft_opt_lob 2
accesswff 1

Important points:
- The number of processors used must be equal to npkpt * npband (in this example 12*2=24)
- There is another parameter in the input file (nband) that has to be a multiple of npband (in this example nband = 1200 = npband *10).

Let me know if you have any comment.

Thanks a lot for your time and kind regards,

Iván

附件: 

附件尺寸
下载 files.txt50 字节
下载 csi.in.txt17.95 KB
下载 14-si.lda.fhi.txt132.68 KB
Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Ivan,

I'm trying to run the dataset you sent. However, even on a single node it appears to be hanging. How long should this set take to run?

James.

Iván S.的头像

Dear James,

It is a long job. To check that it works without problems, you can search for this sentence:

"Iteration: ( 1/50) Internal Cycle: (1/1)"

in the log file. If you find it, the job is running without problems (and you can kill it).

When I send this job in one node I have no problems. When I choose more than one node, the job just stops before starting that section and the previous sentence does not appear.

Regards,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
Iván S.的头像

Hi James,

I have been also doing my homework, and I have found the origin of the problems: the submitted jobs require a large amount of memory, and there are problems with the stack.

I have found different possible solutions to this problem:

- Including "ulimit -s unlimited" in the .bashrc or in the .bash_profile files (but not in my case).

- Including the option "-heap-arrays" when compiling the application, but in my case the tasks "eat" all the memory of the computing nodes and they "died".

- Including the option "-mcmodel=large" when compiling the application, but it didn't work in my case.

- Adding this type of file to the application:
#include "<"sys/time.h">"
#include "<"sys/resource.h">"
#include "<"stdio.h">"
void stacksize_()
{
int res;
struct rlimit rlim;

getrlimit(RLIMIT_STACK, &rlim);
printf("Before: cur=%d,hard=%d\n",(int)rlim.rlim_cur,(int)rlim.rlim_max);

rlim.rlim_cur=RLIM_INFINITY;
rlim.rlim_max=RLIM_INFINITY;
res=setrlimit(RLIMIT_STACK, &rlim);

getrlimit(RLIMIT_STACK, &rlim);
printf("After: res=%d,cur=%d,hard=%d\n",res,(int)rlim.rlim_cur,(int)rlim.rlim_max);
}
which explicitly imposes no limitation in the stack, and call it at the very beginning of the application. This was the option that worked for me.

In summary, there was a problem that occurred when jobs were submitted and they required a large amount of memory.

In any case, thanks for your time!

Best regards,

Iván

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
James Tullos (Intel)的头像

Hi Ivan,

Great, I'm glad the problem is solved. Please feel free to contact us again if there are issues in the future.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

somanath999gmail.com的头像

Hiii Ivan,

 I am having the same problem as u during the running of UM model. I compiled this model using Intel® MPI .but during running I am getting following error

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
libc.so.6          0000003488A30215  Unknown               Unknown  Unknown
libc.so.6          0000003488A31CC0  Unknown               Unknown  Unknown
N216L85.exe        000000000040A416  Unknown               Unknown  Unknown
N216L85.exe        000000000044D172  Unknown               Unknown  Unknown
libmpi.so.4        00002B8DB34627FA  Unknown               Unknown  Unknown
libmpi.so.4        00002B8DB3386661  Unknown               Unknown  Unknown
libmpigf.so.4      00002B8DB393D279  Unknown               Unknown  Unknown
N216L85.exe        000000000175EADD  Unknown               Unknown  Unknown
N216L85.exe        0000000001058602  Unknown               Unknown  Unknown
N216L85.exe        000000000106DFEA  Unknown               Unknown  Unknown
N216L85.exe        0000000000B91CFD  Unknown               Unknown  Unknown
N216L85.exe        000000000089B3FA  Unknown               Unknown  Unknown
N216L85.exe        00000000004E1F04  Unknown               Unknown  Unknown
N216L85.exe        000000000048548F  Unknown               Unknown  Unknown
N216L85.exe        000000000040D76B  Unknown               Unknown  Unknown
N216L85.exe        0000000000404C7C  Unknown               Unknown  Unknown
N216L85.exe        0000000000404BAC  Unknown               Unknown  Unknown
libc.so.6          0000003488A1D974  Unknown               Unknown  Unknown
N216L85.exe        0000000000404AB9  Unknown               Unknown  Unknown
send desc error
send desc error
[11] Abort: Got completion with error 12, vendor code=81, dest rank=
 at line 870 in file ../../ofa_poll.c
[9] Abort: Got completion with error 12, vendor code=81, dest rank=
 at line 870 in file ../../ofa_poll.c

I need a solution of this error.suggest me some ways so that i can resove the error

Thanks in advance for your help.

Somanath Moharana

附件: 

附件尺寸
下载 um-error.txt12.49 KB
Iván S.的头像

Dear Somanath,

According to your attached file, it seems that your problem is related to the shared libraries, that are not found in the executing nodes:

    pbs_demux: error while loading shared libraries: libtorque.so.2: cannot open shared object file: No such file or directory

Make sure that you have access to the libraries you need in all the nodes you use (check your LD_LIBRARY_PATH in your profile). In addition, it seems that you are using PBS/Torque. The problem could be due to a bad integration between Intel MPI and PBS/Torque.

Anyway, check this post where I explained in more detail what I did to solve my problem:

http://software.intel.com/en-us/forums/topic/370967

Regards,

Ivan

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
somanath999gmail.com的头像

Hii Ivan,

Thanx for your sugestion. I tried to run the UM model without PBS/Torque but the same problem I am facing as before

N216L85.exe        000000000044D172  Unknown               Unknown  Unknown
libmpi.so.4        00002B11F5A697FA  Unknown               Unknown  Unknown
libmpi.so.4        00002B11F598D661  Unknown               Unknown  Unknown
libmpigf.so.4      00002B11F5F44279  Unknown               Unknown  Unknown
N216L85.exe        000000000175EADD  Unknown               Unknown  Unknown
N216L85.exe        0000000001059A61  Unknown               Unknown  Unknown
N216L85.exe        000000000106DFEA  Unknown               Unknown  Unknown
N216L85.exe        0000000000B91CFD  Unknown               Unknown  Unknown
N216L85.exe        000000000089B3FA  Unknown               Unknown  Unknown
N216L85.exe        00000000004E1F04  Unknown               Unknown  Unknown
N216L85.exe        000000000048548F  Unknown               Unknown  Unknown
N216L85.exe        000000000040D76B  Unknown               Unknown  Unknown
N216L85.exe        0000000000404C7C  Unknown               Unknown  Unknown
N216L85.exe        0000000000404BAC  Unknown               Unknown  Unknown
libc.so.6          000000354F81D974  Unknown               Unknown  Unknown
N216L85.exe        0000000000404AB9  Unknown               Unknown  Unknown
[7:compute-0-11.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 7
[6:compute-0-11.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 6
[11:compute-0-17.local] unexpected disconnect completion event from [2:compute-0-5.local]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 11

I think it is the same memory issue as you said before but I dont find a solution for that error ......

Kind Regards,

Somanath Moharana

附件: 

附件尺寸
下载 um-log.txt13.38 KB
Iván S.的头像

Dear Somanath,

Just one quick test: can you login to one of the computing nodes an execute the program there? It is just to check that libraries are accesible from computing nodes.

Regards,

Ivan

Iván Santos Tejido Dpto. Electricidad y Electrónica Universidad de Valladolid, Spain
somanath999gmail.com的头像

Dear Ivan,

 There is no problem during running of sample mpi codes in compute nodes.I think the something else

Regards,

Somanath Moharana

somanath999gmail.com的头像

Dear Ivan,

       The problem is solved.The erreor was coming due to incompatible libraries of the Model which was initially configured for IBM machine.

Thanks for ur help and support

Regards,

Somanath Moharana

登陆并发表评论。