Performance hints for WRF on Intel® architecture

By Roman S Dubtsov, Published: 08/02/2012, Last Updated: 08/02/2012


This document explains how to configure the Weather Research & Forecasting (WRF) run-time environment in order to achieve best performance and scalability on Intel® architecture with Intel® software tools.


WRF v. 2.x-3.1.1.

Application Notes:

 The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility. WRF is suitable for a broad spectrum of applications across scales ranging from meters to thousands of kilometers.

 This description was taken from

Obtaining Source Code:

WRF source codes can be downloaded from
or from

Building the Application:

Links to Knowledge base articles describing how to build WRF using Intel tools can be found in the article "Building WRF and WPS with Intel® Compilers on Linux and improving performance on Intel® architecture."

Running the Application:

Reducing MPI overhead:

Substantial communication overhead that appears when computations to communications ratio becomes small can in many cases be reduced by using hybrid-parallel MPI+OpenMP parallelization. However, running hybrid-parallel application requires careful placement of its MPI processes onto CPU cores. General guidelines are that OpenMP threads should share cache, the lower level in cache hierarchy, the better, and that all cores on each node must be active. To enable special handling of hybrid-parallel applications with Intel® MPI Library add -genv I_MPI_PIN_DOMAIN omp to mpiexec params, and specify number of MPI processes per compute node using the -perhost mpiexec switch or via machinefile (see Intel® MPI Library manual for more information). To "pin" individual OpenMP threads to cores or logical CPUs, add -genv KMP_AFFINITY=compact to mpiexec parameters.

To troubleshoot and verify process and thread placement setup, add -genv I_MPI_DEBUG 5 -genv KMP_AFFINITY=compact,verbose to mpiexec parameters. This will cause MPI and OpenMP libraries to print what CPUs each process and thread were placed onto.

Improving cache and memory bandwidth utilization:

WRF is very sensitive to memory bandwidth. However, it is possible to reduce its demands by exploiting peculiarities of the WRF hybrid parallelization scheme.

WRF uses 2D grid decomposition over the forecast/simulation region that is called a domain. The portion of the domain that belongs to particular process is called a patch. Each patch can be further decomposed into tiles. Each tile is processed separately and, if OpenMP is enabled, several tiles are processed concurrently by different threads. However, each tile may still be too large to fit into the CPU's cache. In order to reduce its size, it is possible to increase the number of tiles. By default, number of tiles equals to 1, or, if OpenMP is enabled, to the number of OpenMP threads present. In order to specify a custom value, it is necessary to edit WRF run-time configuration file namelist.input. Find the domains part of the namelist.input, check that it does not contain existing numtiles value, and add the following line:

numtiles = X,

where X is the desired number of tiles.

Note that approach described above does not really depend on the WRF parallelization scheme. Also, there's no other way than experimentation to find what numtiles value gives the best performance, since this value depends on workload and hardware configuration. For reference purposes, below is the table with best numtiles values for Intel Endeavor cluster with Intel® Xeon X5560 processors (2 sockets per compute node, 4 cores per socket, Intel® HyperThreading turned off).


# of cores # of MPI processes # of OpenMP threads numtiles
8 4 2 64
16 8 2 32
32 16 2 12
64 32 2 8


Using Intel® Math Kernel Library (MKL) DFT for polar filters:

WRF planetary/global simulations use FFT to filter out noise in polar areas. Unfortunately, FFT library that comes with WRF - fftpack5 - is not very fast. It is possible to replace calls to fftpack5 with calls to Intel® MKL DFT interface. Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed. Please download the Source code patch from the compressed .gz archive. To enable Intel® MKL DFT, modify existing configure.wrf by adding '-DMKL_DFTI' to ARCHFLAGS_LOCAL and adding MKL libraries to LDFLAGS_LOCAL (see Intel® Math Kernel Library Link Line Advisor for more info).

Speeding up computations by reducing precision:

Default WRF build configuration for x86 and x86_64 platforms with Intel compilers specifies -fp-model precise in Fortran compiler base flags. Such setting provides the best accuracy at the expense of performance. If your workload is numerically stable, it is possible to obtain valid results with default or even with reduced precision settings. To revert to default precision settings, edit the WRF build configuration file configure.wrf, and remove -fp-model precise from Fortran compiler flags that are stored in FCBASEOPTS variable. To specify relaxed precision settings, replace -fp-model precise with -fp-model fast=2 -no-prec-div -no-prec-sqrt instead. This change can improve simulation speed by 3-5%.


Improved performance.

Known Issues or Limitations:

All approaches described above require careful and maybe tedious experimentation to find optimal parameters that match hardware and software configuration best. Be patient.




Optimization Notice in English

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804