muliple thread forrtl: error (78): process killed (SIGTERM)

muliple thread forrtl: error (78): process killed (SIGTERM)

I am implementing the cam on one node.

mpif90 -V :

Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.1.163 Build 20130313
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

mpif90 --showme

ifort -I/home/jph/openmpi-icc/include -I/home/jph/openmpi-icc/lib -L/home/jph/openmpi-icc/lib -lmpi_f90 -lmpi_f77 -lmpi -ldl -lm -lnuma -Wl,--export-dynamic -lrt -lnsl -lutil

mpirun -V : mpirun (Open MPI) 1.6.3

LDFLAGS=

-heap-arrays 10    -L/home/jph/netcdf/lib -lnetcdf -lnetcdff -L/home/jph/nersc/cam/CAM_1.0/benchmark/bld/esmf/lib/libO/linux_intel -lesmf -L/home/jph/openmpi-icc/lib -lmpi -I/home/jph/openmpi-icc/include -I/home/jph/openmpi-icc/lib -L/home/jph/openmpi-icc/lib -lmpi_f90 -lmpi_f77 -lmpi -ldl -lm -lnuma -Wl,--export-dynamic -lrt -lnsl -lutil

ldd cam
linux-vdso.so.1 => (0x00007fff515ff000)
libnetcdf.so.7 => /home/jph/netcdf/lib/libnetcdf.so.7 (0x00007f109d128000)
libnetcdff.so.5 => /home/jph/netcdf/lib/libnetcdff.so.5 (0x00007f109cce6000)
libmpi.so.1 => /home/jph/openmpi-icc/lib/libmpi.so.1 (0x00007f109c8e9000)
libmpi_f90.so.3 => /home/jph/openmpi-icc/lib/libmpi_f90.so.3 (0x00007f109c6e6000)
libmpi_f77.so.1 => /home/jph/openmpi-icc/lib/libmpi_f77.so.1 (0x00007f109c4ae000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003060000000)
libm.so.6 => /lib64/libm.so.6 (0x0000003060800000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003077000000)
librt.so.1 => /lib64/librt.so.1 (0x0000003061400000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003072200000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003070600000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003060c00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003060400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003066800000)
libcurl.so.4 => /usr/lib64/libcurl.so.4 (0x000000306f800000)
libimf.so => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libimf.so (0x00007f109bfc8000)
libsvml.so => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libsvml.so (0x00007f109b5fd000)
libirng.so => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libirng.so (0x00007f109b3f6000)
libintlc.so.5 => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libintlc.so.5 (0x00007f109b1a8000)
libifport.so.5 => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libifport.so.5 (0x00007f109af78000)
libifcore.so.5 => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libifcore.so.5 (0x00007f109ac42000)
libifcoremt.so.5 => /opt/intel/composer_xe_2013.3.163/compiler/lib/intel64/libifcoremt.so.5 (0x00007f109a8dc000)
/lib64/ld-linux-x86-64.so.2 (0x000000305fc00000)
libidn.so.11 => /lib64/libidn.so.11 (0x000000306ec00000)
libldap-2.4.so.2 => /lib64/libldap-2.4.so.2 (0x0000003074400000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x0000003069000000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003068800000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003068c00000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003067400000)
libz.so.1 => /lib64/libz.so.1 (0x0000003061000000)
libssl3.so => /usr/lib64/libssl3.so (0x0000003072800000)
libsmime3.so => /usr/lib64/libsmime3.so (0x0000003073400000)
libnss3.so => /usr/lib64/libnss3.so (0x0000003071200000)
libnssutil3.so => /usr/lib64/libnssutil3.so (0x0000003071600000)
libplds4.so => /lib64/libplds4.so (0x0000003070a00000)
libplc4.so => /lib64/libplc4.so (0x0000003070e00000)
libnspr4.so => /lib64/libnspr4.so (0x0000003071a00000)
libssh2.so.1 => /usr/lib64/libssh2.so.1 (0x000000306ac00000)
liblber-2.4.so.2 => /lib64/liblber-2.4.so.2 (0x0000003074000000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003062000000)
libsasl2.so.2 => /usr/lib64/libsasl2.so.2 (0x0000003073c00000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x0000003068400000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003068000000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x0000003069c00000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x0000003067800000)
libcrypt.so.1 => /lib64/libcrypt.so.1 (0x000000306e400000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003061c00000)
libfreebl3.so => /lib64/libfreebl3.so (0x000000306f000000)

limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc 1024

free
total used free shared buffers cached
Mem: 65922644 13222824 52699820 0 41260 1123156
-/+ buffers/cache: 12058408 53864236
Swap: 33030136 67608 32962528

when I mpirun -np 4 ./cam ,It works well

but when I mpirun -np 56 ./cam  ,I got a lot of errors..

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cam        0000000000572ACE       dp_coupling_mp_d_ 271 dp_coupling.F90
cam        000000000084AE59        stepon_ 550 stepon.F90
cam        00000000004BAF1F       MAIN__ 241 cam.F90
cam        000000000043CF7C       Unknown Unknown Unknown
libc.so.6  000000306041ECDD       Unknown Unknown Unknown
cam        000000000043CE39        Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cam         0000000000572ACE          dp_coupling_mp_d_ 271 dp_coupling.F90
cam         000000000084AE59          stepon_ 550 stepon.F90
cam         00000000004BAF1F         MAIN__ 241 cam.F90
cam         000000000043CF7C         Unknown Unknown Unknown
libc.so.6   000000306041ECDD        Unknown Unknown Unknown
cam         000000000043CE39         Unknown Unknown Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6                  00000030604CE117        Unknown Unknown Unknown
libmpi.so.1              00002B5F7C7D8CB5       Unknown Unknown Unknown
libmpi.so.1              00002B5F7C70B682        Unknown Unknown Unknown
mca_coll_tuned.so 00002B5F83A538EE        Unknown Unknown Unknown
mca_coll_tuned.so 00002B5F83A58FFE        Unknown Unknown Unknown
libmpi.so.1              00002B5F7C719809        Unknown Unknown Unknown
libmpi_f77.so.1       00002B5F7CCCDCD8      Unknown Unknown Unknown
cam                        00000000008CD29D        mpialltoallint_ 879 wrap_mpi.F90
cam                        000000000075FCA1         phys_grid_mp_tran 2402 phys_grid.F90
cam                        0000000000572BFE         dp_coupling_mp_d_ 291 dp_coupling.F90
cam                        000000000084AE59         stepon_ 550 stepon.F90
cam                        00000000004BAF1F         MAIN__ 241 cam.F90
cam                        000000000043CF7C        Unknown Unknown Unknown
libc.so.6                  000000306041ECDD       Unknown Unknown Unknown
cam                        000000000043CE39         Unknown Unknown Unknown

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libc.so.6 00000030604CE117 Unknown Unknown Unknown
libmpi.so.1 00002AD849D52CB5 Unknown Unknown Unknown
libmpi.so.1 00002AD849C85682 Unknown Unknown Unknown
mca_coll_tuned.so 00002AD850FCD8EE Unknown Unknown Unknown

Stack trace terminated abnormally.
forrtl: error (78): process killed (SIGTERM)

The code is CAM 3.1.p2 Source distribution....,I don't eidt it ......

I follow http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/  until # 2,but I don't get answer ..

I don't know why .... Can you help me ? Thanks for your help ..

3 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

What does your mpi hostfile look like?   See the file /home/jph/openmpi-icc/etc/openmpi-default-hostfile for information on setting one up if you don't already have one.  

Thank you .I have the file of /home/jph/openmpi-icc/etc/openmpi-default-hostfile,but It have nothing .

I have one node ,this node have 2 physical processor ,each  physical processor have 8 cores . have 32 processor.

I edit the openmpi-default-hostfile :node1 slots=2 ...but , the program still have errors ...

登陆并发表评论。