How to get WRF running on the Intel® Xeon Phi™ Coprocessor

 

WRF on Intel® Xeon Phi™ Coprocessors and Intel® Xeon® processors

I. Overview

This document demonstrates the best methods to obtain, build and run the WRF model on a single Intel® Xeon Phi™ Coprocessor node and an Intel® Xeon® processor based server. This document also describes the best WRF software configuration and affinity settings to extract the best performance from a single node Intel® Xeon Phi™ Coprocessor and an Intel® Xeon® processor-based system.

II. Introduction 

The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to server atmospheric research and operational forecasting needs. WRF is used by academic atmospheric scientists, forecast teams at operational centers, application scientists, etc. Please see http://www.wrf-model.org/index.php for more details about WRF. The source code and input files can be downloaded from the NCAR website.

III. Compiling, running and Validating WRF to run on a Standalone Intel® Xeon Phi™ coprocessor (Single Card)

 

Compile WRF

1. Download and un-tar the WRF3.5 source code from the NCAR repository http://www2.mmm.ucar.edu/wrf/users/download/get_sources.html#WPS

2. Source the setup files for Intel® MPI Library and Intel® Compiler (example:)


source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh

source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

3. Export the path for the host netcdf. Having netcdf built for Intel Xeon Phi Coprocessor is a prerequisite.

export NETCDF=/localdisk/igokhale/KNC/trunk/WRFV3.4/netcdf/mic/ 

4. cd into the ../WRFV3/ directory and run ./configure and select option 21.

5. Edit the configure.wrf to change mpicc to mpiicc in the ‘DM_CC’ flags. (this will be fixed in the next release of WRF)

6. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.

7. Run ./compile wrf >& build.mic

8. This will build a wrf.exe in the ../WRFV3/main folder.

9. For a new ,clean build,  run ./clean –a and repeat the process.

 

Run WRF

 

1. Download the CONUS12_rundir from http://www2.mmm.ucar.edu/wrf/users/download/get_sources.html#WPS  and place it in ../WRFV3.

2. Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe

3. You need to copy/link at least all files in WRFV3/run to the directory where you are executing wrf.exe (i.e. the CONUS12_rundir)

4. Edit the namelist.input to add "use_baseparam_fr_nml = .t."  under the &dynamics heading without the quotation marks (").

5. cd into the CONUS12_rundir and execute WRF as follows on a coprocessor natively with the runtime parameters in the below script.

Script to run on Intel Xeon Phi coprocessor (native)


bash-4.1$ cat wrf.sh

source /opt/intel/impi/4.1.0.030/mic/bin/mpivars.sh

export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.2.146/compiler/lib/mic/

export KMP_STACKSIZE=62m

ulimit -s unlimited

export I_MPI_DEBUG=5

export WRF_NUM_TILES_X=3

export WRF_NUM_TILES_Y=60

export I_MPI_PIN_MODE=mpd

export KMP_PLACE_THREADS=60C,3T

export OMP_NUM_THREADS=180

export KMP_AFFINITY=balanced,granularity=thread

export KMP_LIBRARY=turnaround

export KMP_BLOCKTIME=infinite

mpiexec.hydra -np 1 ./wrf.exe

4. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF', on the screen. You will have 2 files:   rsl.error.0000 and rsl.out.0000 in your CONUS12_rundir directory.

5. After the run, compute the total time taken to simulate 149 timesteps with the script below. The sum and mean values are of interest for WRF (lower is better).   The following parsing script may help:


bash-4.1$ cat gettiming.sh 

grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk

bash-4.1$ cat stats.awk 

BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }

{

    i ++ 

    a += $1

    if ( $1 > max ) max = $1

    if ( $1 < min ) min = $1

}

END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of your runs

To validate if the WRF run is correct or bogus, do the following:

 diffwrf your_output wrfout_reference > diffout_tag 

The ‘DIGITS’ column should contain a high value (>3). If yes, the WRF run is considered valid.

 

IV. Compiling WRF to run on a 2-Socket Intel® Xeon® server

We used these instructions on a 2-Socket Intel® Xeon® E5-26xx system.

Compile WRF

1. Download and un-tar the WRF3.5 source code from the NCAR repository http://www2.mmm.ucar.edu/wrf/users/download/get_sources.html#WPS

2. Source the setup files for Intel MPI Library and Intel Compiler (example:)


source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh

source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

3. Export the path for the host netcdf. Having netcdf built for the host (for example, an Intel® Xeon® processor-based server  in our case) is a prerequisite.

export NETCDF=/localdisk/igokhale/KNC/trunk/WRFV3.4/netcdf/xeon/ 

4. Cd  into the WRFV3 directory created in step #1 and run ./configure and select option 25: “Linux x86_64 i486 i586 i686, Xeon (SNB with AVX mods) ifort compiler with icc  (dm+sm)”. On the next prompt for nesting options, hit return for the default, which is 1.

5. Edit the configure.wrf to remove -DINTEL_ALIGN64 from the ARCH_LOCAL flags. (this will be fixed in the next release of WRFV3.5)

6. Replace !DEC$ vector always with !DEC$ SIMD on line 7578 in the dyn_em/module_advect_em.F source file.

7. Run ./compile wrf >& build.snb.avx . This will build a wrf.exe in the ../WRFV3/main folder.  (Note: to speed up compiles, set the environment variable J to “-j 4” or whatever number of parallel make tasks you wish to use.)

8. For a new clean build run ./clean –a and repeat the process.

 

Run WRF

1. Download the CONUS12_rundir from  http://www2.mmm.ucar.edu/WG2bench/conus12km_data_v3/ and place it in ../WRFV3.

2. Copy the binary from ../WRFV3/main/wrf.exe to the ../CONUS12_rundir/wrf.exe

3. You need to copy/link at least all files in WRFV3/run to the directory where you are executing wrf.exe (i.e. the CONUS12_rundir)

4. Edit the namelist.input to add "use_baseparam_fr_nml = .t."  under the &dynamics heading without the quotation marks (").

5. cd into the CONUS12_rundir and execute WRF with the runtime parameters in the below script.

Here is an example script to run it on an Intel® Xeon® host:


bash-4.1$ cat run.sh 

source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64

source /opt/intel/impi/4.1.0.030/intel64/bin/mpivars.sh

ulimit -s unlimited

I_MPI_PIN_MODE=mpd 

OMP_NUM_THREADS=8 

KMP_STACKSIZE=64m 

KMP_AFFINITY=scatter,granularity=thread 

KMP_BLOCKTIME=infinite 

KMP_LIBRARY=turnaround 

WRF_NUM_TILES=32 

mpiexec.hydra -np 2 ./wrf.exe

4. The run is completed when it prints ‘wrf: SUCCESS COMPLETE WRF, on the screen. You will have 2 files rsl.error.0000 and rsl.out.0000 in your CONUS12_rundir directory.

5. After the run, compute the total time taken to  simulate 149 timesteps with the scripts below. The sum and mean values are of interest for WRF (lower the better).

The following script should help parse the output:


bash-4.1$ cat gettiming.sh 

grep 'Timing for main' rsl.out.0000 | sed '1d' | head -149 | awk '{print $9}' | awk -f stats.awk

bash-4.1$ cat stats.awk 

BEGIN{ a = 0.0 ; i = 0 ; max = -999999999  ; min = 9999999999 }

{

    i ++ 

    a += $1

    if ( $1 > max ) max = $1

    if ( $1 < min ) min = $1

}

END{ printf("---n%10s  %8dn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn%10s  %15fn","items:",i,"max:",max,"min:",min,"sum:",a,"mean:",a/(i*1.0),"mean/max:",(a/(i*1.0))/max) }

Validation of the run

1) To validate if the WRF run is successful or bogus, do the following:

diffwrf your_output wrfout_reference > diffout_tag 

b. The ‘DIGITS’ column should contain a high value (>3). If yes, the WRF run is considered valid.

 

Compiler Options:

-mmic : build an application that natively runs on Intel® Xeon Phi™ Coprocessor

–openmp  : enable the compiler to generate multi-threaded code based on the OpenMP* directives (same as -fopenmp)

-O3 :enable aggressive optimizations by the compiler.

-opt-streaming-stores always : generate streaming stores

-fimf-precision=low : low precision for higher performance

-fimf-domain-exclusion=15 : gives lowest precision sequences for Single precision and Double precision. 

-opt-streaming-cache-evict=0 : turn off all cache line evicts.

 

V. Additional Resources (NETCDF)

Here is how one would go about compiling NETCDF for Intel®  Xeon Phi™ coprocessor.

1. Download NETCDF from http://www.unidata.ucar.edu/downloads/netcdf/netcdf-3_6_2/index.jsp

2. Create a directory called NETCDF:   ‘mkdir NETCDF’

3. cd into NETCDF directory and untar netcdf-3.6.2.tar.gz (tar xvzf netcdf-3.6.2.tar.gz)

4. cd into netcdf-3.6.2 (created after untaring the netcdf-3.6.2.tar.gz)

5. Source the intel compiler : e.g. source /opt/intel/composer_xe_2013.5.192/bin/compilervars.csh intel64

6. set the following environment variables:

    setenv CPPFLAGS "-DpgiFortran"

    setenv CXX "icpc"

    setenv CC "icc"

 

    setenv F77 "ifort"

7. Run as follows on terminal: ./configure NM=nm --prefix=/lpath/to/NETCDF --disable-cxx --host=x86_64-k1om-linux --build=x86_64-unknown-linux

8. Run on terminal: make CFLAGS=-mmic FCLAGS=-mmic LDFLAGS=-mmic

9. Run on terminal: make install

10. Now, netcdf has been built for Intel® Xeon Phi™ Coprocessor and is ready for use.

VI. Acknowledgements

The author would like to thank all who contributed to the WRF project to date.

VII. About the Author

Indraneil Gokhale is a Software Architect in the Intel Software and Services Group (Intel AZ SSG).

有关编译器优化的更完整信息,请参阅优化通知