Wrf 3.9.1.1 run time failures with intel compiler (Intel debug mode/Openmpi/MPICH works fine)

Wrf 3.9.1.1 run time failures with intel compiler (Intel debug mode/Openmpi/MPICH works fine)

Hi,
I am trying to run  a 168 hour wrf simulation (2 domains - nested) on rhel 7.4 cluster in intel 19u2 environment.
I had prepared Wrf's input files (wrfinput01,wrfinput02 and wrfbdy) using wps (+real.exe) compiled with intel compilers.

When i run the simulation with wrf compiled with debug settings  - ./configure -D  (select sm) ; ./compile em_real , the simulation works fine  (was able to complete simulate 48 hours of simulation without any issue) but was very slow - so i terminated it.

Then i compiled the optimized version of wrf , using intel 2019 i faced several types of issues - 
1. simulation hanging up - initially i used to compile wrf with hybrid setting (sm+dm)  but with this - the simulation on same set  (as mentioned above) of input got stuck after simulating 10 minutes. 
                                          after removing -DMPI2_THREAD_SUPPORT, -ip and downgrading O3 to O2, simulation ran for 24 hours and got hung up again. I tried 3 times to check if it was a fluke - but the simulation gets stuck at same timestep (after 24 hours simulation). Same simulation with 2018r3 gets terminated at 16th hour with segfault. Though now i am using dm (instead of sm+dm) setting for compilation of wrf.

2. Simulation segfault - as debug version was working fine, i decided to introduce optimization gradually with intel 2019, but here - simulation fails (segfault) after 2 minutes. With O2 i tried various combinations (-heap-arrays/-no-heap-arrays) but simulation fails to go through. 

 

Then i tried 3 other compilers - MPICH, MVAPICH2,OPENMPI. Note that i used the input generated by intel's wps (_real.exe) for following - 
MPICH2  -  was slow but simulation completed >48 hours without any issue
MVAPICH2 - hung up
OPENMPI - succesfully completed 168 hours of simulation in 21 hours of walltime.

 

This is definitely not an issue with the input/simulation setup and in debug mode and with openmpi the simulation works fine. Since the issue varies depending on the compiler version and compilation flags used - i am not sure if debugger(gdb) would help me in this case.

Could you please advice me on the methodology which i need to follow to figure out the valid compilation flags for this issue.

 

here is the compilation line with which wrf.exe was generated in debug mode -

mpiifort -f90=ifort -o wrf.exe   -ip -fp-model precise -w -ftz -align all -fno-alias -FR -convert big_endian -xHost -fp-model fast=2 -no-heap-arrays -no-prec-div -no-prec-sqrt -fno-common -xCORE-AVX2 -g -g -O0 -fno-inline -no-ip -g -traceback  -ip -xHost -fp-model fast=2 -no-prec-div -no-prec-sqrt -ftz -align all -fno-alias -fno-common -xCORE-AVX2 -g  wrf.o ../main/module_wrf_top.o libwrflib.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/fftpack/fftpack5/libfftpack.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/io_grib1/libio_grib1.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/io_grib_share/libio_grib_share.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/io_int/libwrfio_int.a -L/home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/esmf_time_f90 -lesmf_time /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/RSL_LITE/librsl_lite.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/frame/module_internal_header_util.o /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/frame/pack_utils.o  -L/home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_intel19u2_debug/WRFV3/external/io_netcdf -lwrfio_nf -L/home/puneet/MyTempSoftwares/WRFV3.9.1.1_Deps_intelmpi2019u2//lib -lnetcdff -lnetcdf     -L/home/puneet/MyTempSoftwares/WRFV3.9.1.1_Deps_intelmpi2019u2//lib -lhdf5_fortran -lhdf5 -lm -lz

here is the compilation line for openmpi - 
 

time mpif90 -DMPI2_SUPPORT -o wrf.exe -fopenmp -O2 -ftree-vectorize -funroll-loops -w -ffree-form -ffree-line-length-none -fconvert=big-endian -frecord-marker=4 wrf.o ../main/module_wrf_top.o libwrflib.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/fftpack/fftpack5/libfftpack.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/io_grib1/libio_grib1.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/io_grib_share/libio_grib_share.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/io_int/libwrfio_int.a -L/home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/esmf_time_f90 -lesmf_time /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/RSL_LITE/librsl_lite.a /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/frame/module_internal_header_util.o /home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/frame/pack_utils.o -L/home/puneet/MySoftwares/UTILS/WRF/3.9.1.1_openmpi3.1.2_fullopt/WRFV3/external/io_netcdf -lwrfio_nf -L/home/puneet/MyTempSoftwares/WRFV3.9.1.1_Deps_gcc4.8.5//lib -lnetcdff -lnetcdf -L/home/puneet/MyTempSoftwares/WRFV3.9.1.1_Deps_gcc4.8.5//lib -lhdf5_fortran -lhdf5 -lm -lz

 

Please let me know if compilation logs / more information is required from my end. 

publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Two questions: does your code do I/O which might be in conflict with the MPI parallelization? and did you try to use debug flags (all compile and runtime checks on like nan initialization, FPE trapping, bounds checking to see whether the hangup might be a programming error?

 

Hi,
I am using an open source software - http://www2.mmm.ucar.edu/wrf/users/download/get_sources.html#WRF-ARW. Yes the software performs IO i.e. after completing a 3 hours of simulation, a .nc/netcdf  file is generated. and i am not sure if MPI parallelization gets affected while filw writing happens. here are the files generated during a simulation which got hanged up - 

-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 05:18 wrfout_d01_2016-03-10_00:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 05:18 wrfout_d02_2016-03-10_00:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 05:32 wrfout_d01_2016-03-10_03:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 05:32 wrfout_d02_2016-03-10_03:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 05:45 wrfout_d01_2016-03-10_06:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 05:45 wrfout_d02_2016-03-10_06:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 05:58 wrfout_d01_2016-03-10_09:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 05:58 wrfout_d02_2016-03-10_09:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 06:11 wrfout_d01_2016-03-10_12:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 06:11 wrfout_d02_2016-03-10_12:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 06:24 wrfout_d01_2016-03-10_15:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 06:25 wrfout_d02_2016-03-10_15:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 06:38 wrfout_d01_2016-03-10_18:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 06:38 wrfout_d02_2016-03-10_18:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 06:51 wrfout_d01_2016-03-10_21:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 06:51 wrfout_d02_2016-03-10_21:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 07:04 wrfout_d01_2016-03-11_00:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 07:05 wrfout_d02_2016-03-11_00:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 07:18 wrfout_d01_2016-03-11_03:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 07:18 wrfout_d02_2016-03-11_03:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 07:31 wrfout_d01_2016-03-11_06:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 07:32 wrfout_d02_2016-03-11_06:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 07:45 wrfout_d01_2016-03-11_09:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 07:45 wrfout_d02_2016-03-11_09:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 07:58 wrfout_d01_2016-03-11_12:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 07:59 wrfout_d02_2016-03-11_12:00:00.nc
-rw-r--r-- 1 puneet internalusers 1504737500 Mar 19 08:12 wrfout_d01_2016-03-11_15:00:00.nc
-rw-r--r-- 1 puneet internalusers 3405553220 Mar 19 08:12 wrfout_d02_2016-03-11_15:00:00.nc

I have only tried the debug flag mentioned in  "configure.wrf_inteldebug_intelforum.txt" file. If it was programming error it should have behaved same (hangup/segfault) with openmpi and intel+ debug settings.

Though, Currently i am trying to gradually add better optimization flags in configure.wrf_inteldebug_intelforum.txt (example- replacing O0 with O2) .
also, i will try out your suggestion of adding additional flags with optimized setting (O2+-xHost) - 
-fpe0 -check noarg_temp_created,bounds,format,output_conversion,pointers,uninit -ftrapuv -unroll0 -u

If there are any additional flags which can help me on this , please let me know.

Here is the file (having compilation flags) with which the generated wrf.exe was able to simulate 168 hours (nested domain ) succesfully - i confirmed this observation on 3 runs.

Now i will incrementally add optimization flags to narrow down the issue.
 

Adjuntos: 

AdjuntoTamaño
Descargartext/plain configure.wrf_working.txt23.76 KB

Hi

I installed parallel_studio_xe_2019_update4_cluster_edition on ubuntu. When I am installing WRF using intel compiler it is giving me the errror .

"To DISABLE large filesupport in NetCDF, set the environment variable WRFIO_NCD_NO_LARGE_FILE_SUPPORTto 1 and run configure again. Set to any other value to avoid this message.Testing for NetCDF, C and Fortran compilerOne of compilers testing failed!Please check your compiler."

and when I typed  " dpkg --list | grep compiler" I got

jaipur:/opt/intel/parallel_studio_xe_2019.4.070/bin> dpkg --list | grep compiler
ii  g++                                        4:5.3.1-1ubuntu1                                            amd64        GNU C++ compiler
ii  g++-5                                      5.4.0-6ubuntu1~16.04.11                                     amd64        GNU C++ compiler
ii  g++-5-multilib                             5.4.0-6ubuntu1~16.04.11                                     amd64        GNU C++ compiler (multilib support)
ii  g++-multilib                               4:5.3.1-1ubuntu1                                            amd64        GNU C++ compiler (multilib files)
ii  gcc                                        4:5.3.1-1ubuntu1                                            amd64        GNU C compiler
ii  gcc-5                                      5.4.0-6ubuntu1~16.04.11                                     amd64        GNU C compiler
ii  gcc-5-multilib                             5.4.0-6ubuntu1~16.04.11                                     amd64        GNU C compiler (multilib support)
ii  gcc-multilib                               4:5.3.1-1ubuntu1                                            amd64        GNU C compiler (multilib files)
ii  gfortran                                   4:5.3.1-1ubuntu1                                            amd64        GNU Fortran 95 compiler
ii  gfortran-5                                 5.4.0-6ubuntu1~16.04.11                                     amd64        GNU Fortran compiler
ii  hardening-includes                         2.7ubuntu2                                                  all          Makefile for enabling compiler flags for security hardening
ii  libllvm3.8:amd64                           1:3.8-2ubuntu4                                              amd64        Modular compiler and toolchain technologies, runtime library
ii  libxkbcommon0:amd64                        0.5.0-1ubuntu2.1                                            amd64        library interface to the XKB compiler - shared library

As I installed my intel compiler.

jaipur:/opt/intel/parallel_studio_xe_2019.4.070/bin> whereis icc
icc: /opt/intel/bin/icc /opt/intel/compilers_and_libraries_2019.4.243/linux/bin/intel64/icc /opt/intel/compilers_and_libraries_2019.4.243/linux/bin/intel64/icc.cfg

So I am not able to see the intel compiler in my list and how can I link my intel complier to WRF .

Please share me some experience how to insatll wrf using intel compliers not gfortan compilers.

wait for a positive reply

Note that icc is the name of the Intel C/C++ compiler driver. You may or may not have installed the Intel Fortran compiler (ifort) when you installed Parallel Studio, so check that first.

It would be far easier for you to build and run WRF using a supported compiler such as Gfortran. If the build system for WRF does not contain a configuration that uses the Intel compiler, chances are slim that you can build WRF using Intel Fortran. Nor is this forum a support forum for questions specific to third party packages such as WRF.

Hi,

I installed intel parallel_studio_xe_2019_update4_cluster_edition.I am trying to install WRF(3.9.1.1) component in NU-WRF using intel compiles.

The following Errors are there:

-------------------------------------------------------------------------------------------------------------------------------

libwrflib.a(module_wrf_error.o): In function `wrf_message_':
module_wrf_error.f90:(.text+0x0): multiple definition of `wrf_message_'
libwrflib.a(noahmp36_wrf_routines.o):noahmp36_wrf_routines.F90:(.text+0x0): first defined here
libwrflib.a(module_wrf_error.o): In function `wrf_error_fatal_':
module_wrf_error.f90:(.text+0xa50): multiple definition of `wrf_error_fatal_'
libwrflib.a(noahmp36_wrf_routines.o):noahmp36_wrf_routines.F90:(.text+0x1e0): first defined here
libwrflib.a(module_wrf_error.o): In function `wrf_error_fatal3_':
module_wrf_error.f90:(.text+0xf60): multiple definition of `wrf_error_fatal3_'
libwrflib.a(noahmp36_wrf_routines.o):noahmp36_wrf_routines.F90:(.text+0x80): first defined here
real_em.o: In function `med_sidata_input_':
real_em.f90:(.text+0x167a): undefined reference to `module_wps_io_arw_mp_read_wps_'
real_em.o: In function `assemble_output_':
real_em.f90:(.text+0x2d1a): undefined reference to `module_big_step_utilities_em_mp_couple_'
real_em.f90:(.text+0x2f5b): undefined reference to `module_big_step_utilities_em_mp_couple_'
real_em.f90:(.text+0x319b): undefined reference to `module_big_step_utilities_em_mp_couple_'
real_em.f90:(.text+0x33d4): undefined reference to `module_big_step_utilities_em_mp_couple_'
real_em.f90:(.text+0x3c58): undefined reference to `module_big_step_utilities_em_mp_couple_'

---------------------------------------------------------------------------------------------------------------------------------------------------
  configure.wrf :

DESCRIPTION     =       INTEL ($SFC/$SCC)
DMPARALLEL      =        1
OMPCPP          =       # -D_OPENMP
OMP             =       # -openmp -fpp -auto
OMPCC           =       # -openmp -fpp -auto
SFC             =       ifort
SCC             =       icc
CCOMP           =       icc
DM_FC           =       mpif90 -f90=$(SFC)
DM_CC           =       mpicc -cc=$(SCC) -DMPI2_SUPPORT
FC              =       time $(DM_FC)
CC              =       $(DM_CC) -DFSEEKO64_OK
LD              =       $(FC)
RWORDSIZE       =       $(NATIVE_RWORDSIZE)
PROMOTION       =       -real-size `expr 8 \* $(RWORDSIZE)` -i4
ARCH_LOCAL      =       -DNONSTANDARD_SYSTEM_FUNC  -DWRF_USE_CLM
CFLAGS_LOCAL    =       -w -O2 -ip #-xHost -fp-model fast=2 -no-prec-div -no-prec-sqrt -ftz -no-multibyte-chars
LDFLAGS_LOCAL   =       -ip #-xHost -fp-model fast=2 -no-prec-div -no-prec-sqrt -ftz -align all -fno-alias -fno-common
CPLUSPLUSLIB    =       
ESMF_LDFLAG     =       $(CPLUSPLUSLIB)
FCOPTIM         =       -O2
FCREDUCEDOPT    =       $(FCOPTIM)
FCNOOPT        =       -O0 -fno-inline -no-ip
FCDEBUG         =       # -g $(FCNOOPT) -traceback # -fpe0 -check noarg_temp_created,bounds,format,output_conversion,pointers,uninit -ftrapuv -unroll0 -u
FORMAT_FIXED    =       -FI
FORMAT_FREE     =       -FR
FCSUFFIX        =
BYTESWAPIO      =       -convert big_endian
RECORDLENGTH    =       -assume byterecl
FCBASEOPTS_NO_G =       -ip -fp-model precise -w -ftz -align all -fno-alias $(FORMAT_FREE) $(BYTESWAPIO) #-xHost -fp-model fast=2 -no-heap-arrays -no-prec-div -no-prec-sqrt -fno-common
FCBASEOPTS      =       $(FCBASEOPTS_NO_G) $(FCDEBUG)
MODULE_SRCH_FLAG =     
TRADFLAG        =      -traditional-cpp
CPP             =      /lib/cpp -P -nostdinc
AR              =      ar
ARFLAGS         =      ru
M4              =      m4
RANLIB          =      ranlib
RLFLAGS        =    
CC_TOOLS        =      $(SCC)

I dont know where it is going wrong .Kindly help me to solve this issue                                   

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya