WRF Conus12km on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® Processors

I. Overview

This document demonstrates the best methods to obtain, build, and run the Weather Research and Forecasting (WRF) model on an Intel® Xeon® processor-based server in native mode on a single Intel® Xeon Phi™ coprocessor, and in symmetric mode using both.  This document also describes the best WRF software configuration and affinity settings to extract the best performance on this server.

II. Introduction

The WRF model is a numerical weather prediction system designed to server atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.php for more details about WRF. The source code and input files can be downloaded from the NCAR website. The latest version as of this writing is WRFV3.6. In this article, we use the conus12km benchmark.

III. Compiling, Running, and Validating WRF to run natively on an Intel® Xeon Phi™ coprocessor (Single Card)

You can obtain Intel® Composer XE, which includes the Intel® C/C++ and Fortran Compilers1, from https://registrationcenter.intel.com/regcenter/register.aspx, or register at https://software.intel.com/en-us/ to get a free 30-day evaluation copy.

Compile WRF

  1. Download and un-tar the WRFV3.6 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
  2. Source the environment for Intel® MPI Library and for the Intel Compiler:
    1. source /opt/intel/impi/4.1.1.036/mic/bin/mpivars.sh
    2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
  3. On bash, export the path for the host netcdf and host pnetcdf. Having netcdf and pnetcdf built for the Intel Xeon Phi coprocessor is a prerequisite.
    1. export NETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.6/netcdf/mic/
    2. export PNETCDF=/localdisk/igokhale/k1om/trunk/WRFV3.6/pnetcdf/mic/
  4. Turn on Large file IO support by entering: export WRFIO_NCD_LARGE_FILE_SUPPORT=1
  5. Cd into the ../WRFV3/ directory, run ./configure, and select the option to build with the coprocessor (option 17). On the next prompt for nesting options, press return for the default, which is 1.
  6. In the configure.wrf that is created, remove -DUSE_NETCDF4_FEATURES and replace –O3 with –O2.
  7. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
  8. Run ./compile wrf >& build.mic. This will build a wrf.exe in the ../WRFV3/main folder.
  9. For a new, clean build run ./clean –a and repeat the process.

Run WRF

  1. Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.
  2. Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe.
  3. Copy/link all files in WRFV3/run to the directory where you are executing wrf.exe (i.e., the CONUS12_rundir).
  4. Edit the namelist.input to add "use_baseparam_fr_nml = .t."  under the &dynamics heading without the quotation marks (").
  5. cd into the CONUS12_rundir and execute WRF as follows on a coprocessor natively with the runtime parameters in the following script:

Script to run on coprocessor (native)


bash-4.1$ cat wrf.sh

source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh

export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/

export KMP_STACKSIZE=62m

ulimit -s unlimited

export I_MPI_DEBUG=5

export WRF_NUM_TILES_X=3

export WRF_NUM_TILES_Y=60

export I_MPI_PIN_MODE=mpd

export KMP_PLACE_THREADS=60C,3T

export OMP_NUM_THREADS=180

export KMP_AFFINITY=balanced,granularity=thread

export KMP_LIBRARY=turnaround

export KMP_BLOCKTIME=infinite

mpiexec.hydra -np 1 ./wrf.exe

6. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF' on the screen. You will have two files,  rsl.error.0000 and rsl.out.0000, in your CONUS12_rundir directory.

7. After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower is better).  The following parsing script may help:


bash-4.1$ cat gettiming.sh  
grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk

bash-4.1$ cat stats.awk

BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }

{

    i ++

    a += $1

    if ( $1 > max ) max = $1

    if ( $1 < min ) min = $1

}

END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of your runs

To validate if the WRF run is correct or bogus, do the following:

The ‘DIGITS’ column should contain a high value (>3). If it does, the WRF run is considered valid.

IV. Compiling WRF to run on a 2-Socket Intel® Xeon® processor-based server

We used these instructions on a 2-Socket Intel® Xeon® E5-26xx processor-based server.

Compile WRF

  1. Download and un-tar the WRF3.5 source code from the NCAR repository http://www.mmm.ucar.edu/wrf/users/download/get_sources.html#V351.
  2. Source the environment for Intel® MPI Library and for the Intel Compiler
    1. source /opt/intel/impi/4.1.1.036/intel64/bin/mpivars.sh
    2. source /opt/intel/composer_xe_2013/bin/compilervars.sh intel64
  3. Export the path for the host netcdf and pnetcdf. Having netcdf and  pnetcdf built for the host is a prerequisite.
    1. export NETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.6/netcdf/xeon/
    2. export PNETCDF=/localdisk/igokhale/IVB/trunk/WRFV3.6/pnetcdf/xeon/
  4. Turn on Large file IO support  by typing: export WRFIO_NCD_LARGE_FILE_SUPPORT=1
  5. cd into the WRFV3 directory created in step 1, run ./configure and select option 21: "Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc  (dm+sm)". On the next prompt for nesting options, hit return for the default, which is 1.
  6. In the configure.wrf that is created, remove -DUSE_NETCDF4_FEATURES and replace –O3 with –O2.
  7. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.
  8. Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder.  (Note: to speed up compile times, set the environment variable J to "-j 4" or whatever number of parallel make tasks you wish to use.)
  9. For a new, clean build, run ./clean –a and repeat the process.

Run WRF

  1. Download the CONUS12_rundir from http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.
  2. Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe.
  3. Copy/link all files in WRFV3/run to the directory where you are executing wrf.exe (i.e., the CONUS12_rundir).
  4. Edit the namelist.input to add "use_baseparam_fr_nml = .t." under the &dynamics heading without the quotation marks (").
  5. cd into the CONUS12_rundir and execute WRF with the runtime parameters in the following script:

Here is an example script to run on an Intel Xeon processor-based host:


bash-4.1$ cat run.sh

source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh

ulimit -s unlimited

I_MPI_PIN_MODE=mpd 

OMP_NUM_THREADS=2 

KMP_STACKSIZE=64m 

KMP_AFFINITY=scatter,granularity=thread 

KMP_BLOCKTIME=infinite 

KMP_LIBRARY=turnaround 

WRF_NUM_TILES=48 

mpiexec.hydra -np 12 ./wrf.exe


6. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF’on the screen. You will have two files, rsl.error.0000 and rsl.out.0000, in your CONUS12_rundir directory.

7. After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower the better).

The following script should help parse the output:


bash-4.1$ cat gettiming.sh

grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk

bash-4.1$ cat stats.awk
BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }
{

    i ++ 

    a += $1

    if ( $1 > max ) max = $1

    if ( $1 < min ) min = $1
}

END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of the run

  1. To validate if the WRF run is successful or bogus, do the following:
    1. diffwrf your_output wrfout_reference > diffout_tag
    2. The ‘DIGITS’ column should contain a high value (>3). If it does, the WRF run is considered valid.

Compiler Options:

  • -mmic : build an application that runs natively on an Intel Xeon Phi coprocessor
  • –openmp  : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)
  • -O3 : enable aggressive optimizations by the compiler
  • -opt-streaming-stores always : generate streaming stores
  • -fimf-precision=low : low precision for higher performance
  • -fimf-domain-exclusion=15 : generate lowest precision sequences for Single precision and Double precision
  • -opt-streaming-cache-evict=0 : turn off all cache line evicts

V. Additional Resources (NETCDF)

Here is how you can compile NETCDF for the Intel Xeon Phi coprocessor.

  1. Download NETCDF from http://www.unidata.ucar.edu/downloads/netcdf/netcdf-3_6_2/index.jsp.
  2. Create a directory called NETCDF:   ‘mkdir NETCDF’.
  3. cd into NETCDF directory and untar netcdf-3.6.2.tar.gz (tar xvzf netcdf-3.6.2.tar.gz).
  4. cd into netcdf-3.6.2 (created after untaring the netcdf-3.6.2.tar.gz).
  5. Source the Intel compiler, e.g., source /opt/intel/composer_xe_2013/bin/compilervars.csh intel64.
  6. Set the following environment variables:

        setenv CPPFLAGS "-DpgiFortran"

        setenv CXX "icpc"

        setenv CC "icc"

        setenv F77 "ifort"

  7. Run this command on the terminal: ./configure NM=nm --prefix=/lpath/to/NETCDF --disable-cxx --host=x86_64-k1om-linux --build=x86_64-unknown-linux.
  8. Run this command on the terminal: make CFLAGS=-mmic FCLAGS=-mmic LDFLAGS=-mmic.
  9. Then run: make install.

Now, netcdf has been built for Intel Xeon Phi coprocessor and is ready for use.

V. Run WRF Conus12km in symmetric mode on a 2-Socket Intel® Xeon® Processor-based server with Intel® Xeon Phi™ Coprocessors

Script to run in Symmetric mode
I am using: node01

When you request nodes, make sure you have a large stack size MIC_ULIMIT_STACKSIZE=365536


source /opt/intel/impi/4.1.0.036/mic/bin/mpivars.sh

source /opt/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh intel64

export I_MPI_DEVICE=rdssm

export I_MPI_MIC=1

export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0

export I_MPI_PIN_MODE=pm

export I_MPI_PIN_DOMAIN=auto


./run.symmetric


Below is the run.symmetric to run the code in symmetric mode:


run.symmetric script

#!/bin/sh

mpiexec.hydra 

-host node01 -n 12 -env WRF_NUM_TILES 20 -env KMP_AFFINITY scatter -env OMP_NUM_THREADS 2 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/x86/wrf.exe 

: -host node01-mic1 -n 8 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 30 -env KMP_LIBRARY=turnaround -env OMP_SCHEDULE=static -env KMP_STACKSIZE=190M -env I_MPI_DEBUG 5 /path/to/CONUS2.5_rundir/mic/wrf.sh 

In ../CONUS2.5_rundir/mic, create a wrf.sh file as follows:


Below is the wrf.sh that is needed for the Intel Xeon Phi coprocessor part of the runscript.


wrf.sh script

export LD_LIBRARY_PATH=/opt/intel/compiler/2013_sp1.1.106/composer_xe_2013_sp1.1.106/compiler/lib/mic:$LD_LIBRARY_PATH

/path/to/CONUS2.5_rundir/mic/wrf.exe

VI. Acknowledgements

The author would like to thank all who have contributed to the WRF project to date.

VII. About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel AZ SSG).

For more complete information about compiler optimizations, see our Optimization Notice.