Still segmentation ending of wrf simulations

Still segmentation ending of wrf simulations

We have successfully compiled WRFV3.1.1 with intel compiler v11.1.

We have also successfully run simulations with the 'unlimit -s unlimited'. But this only work for 'small' domains. We have run simulations with a domain of: 60x80x26 points, but we can not simulate a domain of 390x250x40 points (Our nodes have 8GB of memory (we can simulate this domain with PortlandGroup compiler).

Any suggestion will be very welcome

Thank you in advance

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As you are aware, WRF is complicated. Thus it is impossible to say "oh, obviously the problem is XYZ".

I would assume that you have read through our articles on WRF here: http://software.intel.com/en-us/articles/building-the-wrf-with-intel-compilers-on-linux-and-improving-performance-on-intel-architecture/

As in all cases, I would recommend trying to find out where the code is dying by compiling with -g -traceback OR running the code under a suitable debugger ( IDB for single-node runs, TotalView for MPI runs or serial runs ).

We do have a couple of engineers on staff who work with our customers on WRF. If you have a supported version of the compiler ( anything other than "non-commercial license" ) that is still under a non-expired license, you can send in a support request at http://premier.intel.com. In the problem report you can send in your specific input files for this simulation so that our folks can try to reproduce the problem.

If you have just a non-commercial license, use the -g -traceback and send us the source file and line. That might help us identify where the code is dying.

Also, do you run the code directly on the node in serial mode OR do you launch the code with an script to a batch subsystem like PBS, Moab, LSF, etc? Does your launch script 'ulimit -s unlimited'? And also, insure that your nodes allow ulimit - unlimited, that is, confirm that after that command is entered that the shell did indeed up the limit to unlimited. Many systems are configured with hard limits and ulimit is ignored (Mac OS for example).

And of course, what compiler options are you using for the build, are you building serial or parallel, OpenMP or non-OpenMP. Compiler version used, WRF version, etc.

ron

You also don't make clear whether you are running on a 32 bit or a 64 bit OS. You might run into memory limits, especially on 32 bits. My recollection is that WRF uses mostly dynamic memory allocation, so you should not need -mcmodel medium -shared-intel to support 64 bit addressing.

** re-edited post **

Sorry for my first short post.

I am using 64-bits compilation and I follow previous article suggestions of WRF V3.1.1 compilation.

I have also to correct initial description of post. I forgot to mention that WRF simulations work propertly in a serial compilated version of WRF (with ifort and icc). It fails when a paralelised version (with MPICH v1.2.7) is used. I used an other domain to check compilation and with it compilation also didn't work (now: 167x139x40).

Compilation is done with distributed memory (not shared memory). It is running under a pbs queue system and 64-bit nodes accept 'ulimit -s unlimited' specifications.

I try to use propertly 'totalview' software and I find that 'segmentation fault' occurs in 'solve_em.f90' module (between line #284 and #293, just when it calls 'get_ijk_from_grid' module). Lines numbers are after preprocessor. WRF source are .F90 files and when they are preprocessed become .f90. I attached 'totalview' results.

Thank you in advance,

Llus

Attachments: 

AttachmentSize
Download total_view.inf111.46 KB

Have you resolved this issue? I notice the message is from February.If not, hopefully I can help you because you just helped me solve the nearly identical problem. Working on a new architecture, I was seeing a segmentation violation in the call to'get_ijk_from_grid' in both serial and MPI builds. The new machine had a very small stacksize set and I forgot to set it to 'unlimited'. Doing that resolved my problems completely. That subroutine call passes a LOT of rather deep derived types and structs which need a lot of room on the stack.If you are still able to run a serial job, but not parallel, my suggestion would be to check your stacksize on a compute node after your job is launched. I would bet that during job startup, your MPI or batch environment is giving you a small stacksize on the compute nodes.Just a guess.But regardless, thanks for reminding me to grow my own stacksize.

I have solved this issue. The segmentation fault only occurs with a given physics parameterization of the model (CAM sw/lw radiation scheme). A colleague of my institution was advised about this issue as a simple problem with the optimization of the compilation.

The recommended WRF compilation (intel shared memory) is:

(...)
DMPARALLEL = 1
OMPCPP = # -D_OPENMP
OMP = # -openmp -fpp -auto
SFC = ifort
SCC = icc
DM_FC = mpif90 -f90=$(SFC)
DM_CC = mpicc -cc=$(SCC) -DMPI2_SUPPORT
FC = $(DM_FC)
CC = $(DM_CC) -DFSEEKO64_OK
LD = $(FC)
RWORDSIZE = $(NATIVE_RWORDSIZE)
PROMOTION = -i4
ARCH_LOCAL = -DNONSTANDARD_SYSTEM_FUNC
CFLAGS_LOCAL = -w -O3 -ip
LDFLAGS_LOCAL = -ip
CPLUSPLUSLIB =
ESMF_LDFLAG = $(CPLUSPLUSLIB)
FCOPTIM = -O3
FCREDUCEDOPT = $(FCOPTIM)
FCNOOPT = -O0 -fno-inline -fno-ip
FCDEBUG = # -g $(FCNOOPT) -traceback
FORMAT_FIXED = -FI
FORMAT_FREE = -FR
FCSUFFIX =
BYTESWAPIO = -convert big_endian
FCBASEOPTS = -w -ftz -align all -fno-alias -fp-model precisee
$(FCDEBUG) $(FORMAT_FREE) $(BYTESWAPIO)
MODULE_SRCH_FLAG =
TRADFLAG = -traditional
CPP = /lib/cpp -C -P
AR = ar
(...)

Simply changing the '-O3' optimization to '-O2' it works proprertly. (It also work with '-O1' but it makes the simulations slower)

Just in case someone, would repair it
Error messages that are given (with check compiled version of the model, FCDEBUG = -traceback -check all):

(...) 
Timing for main: time 2001-11-10_00:02:30 on domain 1: 358.70981 elapsed seconds.
forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #49
forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #51
forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #134
forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #136
forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #181
forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #183
forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #87
forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #89
forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #21
forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #23
forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #58
forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #60
forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #55
forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #57
forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #20
forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #22
Timing for main: time 2001-11-10_00:05:00 on domain 1: 32.11490 elapsed seconds.
(...)
  • On phys/module_radiation_driver.F, subroutine pre_radiation_driver arguments #49, 51 are: i_end, j_end
  • On phys/module_radiation_driver.F, subroutine radiation_driver arguments #134, 136 are: i_end, j_end
  • On phys/module_surface_driver.F, subroutine surface_driver arguments #181, 183 are: i_end, j_end
  • On phys/module_pbl_driver.F, subroutine pbl_driver arguments #87, 89 are: i_end, j_end
  • On phys/module_cumulus_driver.F, subroutine cumulus_driver arguments #21, 23 are: i_end, j_end
  • On phys/module_fddagd_driver.F, subroutine fddagd_driver arguments #58, 60 are: i_end, j_end
  • On phys/module_microphysics_driver.F, subroutine microphysics_driver arguments #55, 57 are: i_end, j_end
  • On phys/module_diagnostics.F, subroutine diagnostic_output_calc arguments #20, 22 are: i_end, j_end

Subroutine definitions

  INTEGER, DIMENSION(num_tiles), INTENT(IN) ::                   & 
& i_start,i_end,j_start,j_end

Definition in frame/module_domain_type.F, WRF derived type for the domain TYPE(domain):

TYPE domain

  •  (...)
    INTEGER,POINTER :: i_start(:),i_end(:)
    INTEGER,POINTER :: j_start(:),j_end(:)
    (...)
    INTEGER :: num_tiles ! taken out of namelist 20000908
    (...)

Some information about WRF tiles

In frame/module_tiles.F is seen in subroutines set_tiles1,set_tiles2,set_tiles3:

   IF ( ASSOCIATED(grid%i_start) ) THEN ; DEALLOCATE( grid%i_start ) ; NULLIFY( grid%i_start ) ; ENDIF 
IF ( ASSOCIATED(grid%i_end) ) THEN ; DEALLOCATE( grid%i_end ) ; NULLIFY( grid%i_end ) ; ENDIF
IF ( ASSOCIATED(grid%j_start) ) THEN ; DEALLOCATE( grid%j_start ) ; NULLIFY( grid%j_start ) ; ENDIF
IF ( ASSOCIATED(grid%j_end) ) THEN ; DEALLOCATE( grid%j_end ) ; NULLIFY( grid%j_end ) ; ENDIF
ALLOCATE(grid%i_start(num_tiles))
ALLOCATE(grid%i_end(num_tiles))
ALLOCATE(grid%j_start(num_tiles))
ALLOCATE(grid%j_end(num_tiles))
grid%max_tiles = num_tiles

Many thanks to all of you

Llus

Best Reply

It seems that you also have -check on, as the warnings about argument temps being created are not the default. I suspect that these temps are creating a stack overflow situation for you and hence a segfault. See if adding -heap-arrays to the compiler options help.

I also recommend your reading our Knowledge Base articles on building WRF using Intel compilers.

Steve - Intel Developer Support

As you suggest it works! I changed the following on configuration of the compilation (maybe it is not in the most apropriated place, but it works.):

(...)
CFLAGS_LOCAL = -w -O3 -heap-arrays -ip
(...)
FCOPTIM = -O3 -heap-arrays

And now is working.

Many thanks!

Llus

Hi guys, I am having hard time in running WRF3.2_CLM3.5. (link: http://www.mmm.ucar.edu/wrf/users/contributed/contributed.html ); executables are not forming in /wrf/main directory. I like to mention here I tested all the other versions of WRF(3.1to 3.4.1)  and they run SMOOTHLY. However, for this combine code WRF3.2_CLM3.5 compilation was unsuccessful after repeated attempt.

I am using intel composer_xe_2013.2.146 on a 32 bit linux os(Centos 6) with intel 5600 processors & 32gb ram & with OpenMP(1.6.4). 

I have set FCOPTIM=02, stacksize ulimit -s unlimited, OMP_NUM_THREADS=1, and all these settings were perfect for running anyother WRF code i.e WRF3.1 to WRF3.4.1 except WRFV3.2_CLM3.5 code which I planned to run for my research work. I also tried with "-heap-arrays" option, doesn't seem to work here. 

Some warning keeps coming "missing return statement at end of non-void function "gen_shutdown_closes", also error like "module_configure.f90(19522): error #7002: Error in opening the compiled module file.  Check INCLUDE paths.   [MODULE_CONFIGURE]" and "module_configure.f90(19614): error #6406: Conflicting attributes or multiple declaration of name.   [MODEL_CONFIG_REC]"  is happening during compilation whereas all the other codes(WRF3.1-3.4.1) run smoothly.

I have attached configure.wrf, & compile log for em_real, any helpful suggestion will be highly appreciated and acknowledged.

Yours's Truly,

Supantha Paul
PhD Research Scholar,
IITB, Climate Centre,
Mumbai, Powai 400076
India
Mo:+919167481338

Supantha,

The "missing return" warning is coming from C code.

You didn't manage to attach the log. For error 7002, it is saying that the module MODULE_CONFIGURE didn't get compiled, so you should figure out why. Analysis of error 6406 would requiring seeing the sources.

Steve - Intel Developer Support

Leave a Comment

Please sign in to add a comment. Not a member? Join today