Performance hints for WRF on Intel® architecture


This document explains how to configure the Weather Research & Forecasting (WRF) run-time environment in order to achieve best performance and scalability on Intel® architecture with Intel® software tools.


WRF v. 2.x-3.1.1.

Application Notes:

 The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility. WRF is suitable for a broad spectrum of applications across scales ranging from meters to thousands of kilometers.

 This description was taken from

Obtaining Source Code:

WRF source codes can be downloaded from
or from

Building the Application:

Links to Knowledge base articles describing how to build WRF using Intel tools can be found in the article "Building WRF and WPS with Intel® Compilers on Linux and improving performance on Intel® architecture."

Running the Application:

Reducing MPI overhead:

Substantial communication overhead that appears when computations to communications ratio becomes small can in many cases be reduced by using hybrid-parallel MPI+OpenMP parallelization. However, running hybrid-parallel application requires careful placement of its MPI processes onto CPU cores. General guidelines are that OpenMP threads should share cache, the lower level in cache hierarchy, the better, and that all cores on each node must be active. To enable special handling of hybrid-parallel applications with Intel® MPI Library add -genv I_MPI_PIN_DOMAIN omp to mpiexec params, and specify number of MPI processes per compute node using the -perhost mpiexec switch or via machinefile (see Intel® MPI Library manual for more information). To "pin" individual OpenMP threads to cores or logical CPUs, add -genv KMP_AFFINITY=compact to mpiexec parameters.

To troubleshoot and verify process and thread placement setup, add -genv I_MPI_DEBUG 5 -genv KMP_AFFINITY=compact,verbose to mpiexec parameters. This will cause MPI and OpenMP libraries to print what CPUs each process and thread were placed onto.

Improving cache and memory bandwidth utilization:

WRF is very sensitive to memory bandwidth. However, it is possible to reduce its demands by exploiting peculiarities of the WRF hybrid parallelization scheme.

WRF uses 2D grid decomposition over the forecast/simulation region that is called a domain. The portion of the domain that belongs to particular process is called a patch. Each patch can be further decomposed into tiles. Each tile is processed separately and, if OpenMP is enabled, several tiles are processed concurrently by different threads. However, each tile may still be too large to fit into the CPU's cache. In order to reduce its size, it is possible to increase the number of tiles. By default, number of tiles equals to 1, or, if OpenMP is enabled, to the number of OpenMP threads present. In order to specify a custom value, it is necessary to edit WRF run-time configuration file namelist.input. Find the domains part of the namelist.input, check that it does not contain existing numtiles value, and add the following line:

numtiles = X,

where X is the desired number of tiles.

Note that approach described above does not really depend on the WRF parallelization scheme. Also, there's no other way than experimentation to find what numtiles value gives the best performance, since this value depends on workload and hardware configuration. For reference purposes, below is the table with best numtiles values for Intel Endeavor cluster with Intel® Xeon X5560 processors (2 sockets per compute node, 4 cores per socket, Intel® HyperThreading turned off).

# of cores# of MPI processes# of OpenMP threadsnumtiles

Using Intel® Math Kernel Library (MKL) DFT for polar filters:

WRF planetary/global simulations use FFT to filter out noise in polar areas. Unfortunately, FFT library that comes with WRF - fftpack5 - is not very fast. It is possible to replace calls to fftpack5 with calls to Intel® MKL DFT interface. Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed. Please download the Source code patch from the compressed .gz archive. To enable Intel® MKL DFT, modify existing configure.wrf by adding '-DMKL_DFTI' to ARCHFLAGS_LOCAL and adding MKL libraries to LDFLAGS_LOCAL (see Intel® Math Kernel Library Link Line Advisor for more info).

Speeding up computations by reducing precision:

Default WRF build configuration for x86 and x86_64 platforms with Intel compilers specifies -fp-model precise in Fortran compiler base flags. Such setting provides the best accuracy at the expense of performance. If your workload is numerically stable, it is possible to obtain valid results with default or even with reduced precision settings. To revert to default precision settings, edit the WRF build configuration file configure.wrf, and remove -fp-model precise from Fortran compiler flags that are stored in FCBASEOPTS variable. To specify relaxed precision settings, replace -fp-model precise with -fp-model fast=2 -no-prec-div -no-prec-sqrt instead. This change can improve simulation speed by 3-5%.


Improved performance.

Known Issues or Limitations:

All approaches described above require careful and maybe tedious experimentation to find optimal parameters that match hardware and software configuration best. Be patient.




Optimization Notice in English

For more complete information about compiler optimizations, see our Optimization Notice.


Silvina R.'s picture


I had the same problem as Kevin to apply the patch to WRFV3.7 is there e new version of the patch? Does anyone figure out?



mike p.'s picture


Can you point to a chart of MKL FFT and FFTPACK performance vs. transform length, real-to-complex and complex-to-real?

I am particularly interested in data transform lengths less than 5000.


Kevin R.'s picture

I download the compressed patch, unzipped it and tried to apply it with "patch < 0001-mkl-fft-for-polar-filters.patch". I got the following back:

$ patch < ~/0001-mkl-fft-for-polar-filters.patch
patching file module_polarfft.F
Hunk #2 succeeded at 9 (offset 1 line).
Hunk #3 succeeded at 209 (offset 38 lines).
Hunk #4 succeeded at 518 (offset 86 lines).
Hunk #5 FAILED at 455.
Hunk #6 FAILED at 509.
Hunk #7 succeeded at 632 (offset 104 lines).
Hunk #8 FAILED at 557.
Hunk #9 succeeded at 671 with fuzz 2 (offset 103 lines).
Hunk #10 FAILED at 622.
4 out of 10 hunks FAILED -- saving rejects to file module_polarfft.F.rej


I tried to figure out exactly what the differences are for the ones that failed, but I'm unclear about a couple of the bigger ones.

The problem may be that this patch was from an earlier version of WRF; I attempted to patch on WRF V3.6. Is there an updated patch for the latest version of WRF? Thank you.

lee_icc's picture


I am using WRFv3.5 and two-way coupled WRFv3.3-CMAQv5.0.1.

For the 'Reducing MPI overhead', can I apply these options to MPICH library?

For the 'Using Intel Math Kernel Library (MKL) DFT for polar filters', I have downloaded the patch file 0001-mkl-fft-for-polar-filters.patch.gz. How to use this patch? Is there any steps or instruction?


chenwg2000's picture

hi,I want to use mkl to replace fft subroutine with your attached patch. But some errors occured during compiling. Would you please share your configure.wrf for us?

chenwg2000's picture

hi,I want to use mkl to replace fft subroutine with your attached patch. But some errors occured during compiling. Would you please share your configure.wrf for us?

chenwg2000's picture

hi,I want to use mkl to replace fft subroutine with your attached patch. But some errors occured during compiling. Would you please share you configure.wrf for us?

Vipin Kumar E K (Intel)'s picture


The patch is attached to the article now in the compressed file 0001-mkl-fft-for-polar-filters.patch.gz. Please download the same, extract it and apply the patch to use MKL DFT for polar filters.


Roman Dubtsov (Intel)'s picture


Unfortunately I cannot see the e-mail. Feel free to ping me at so that I can send you the patch.



Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.