The Weather Research and Forecasting (WRF) Model is a numerical weather prediction (NWP) system designed for both atmospheric research and operational forecasting needs. It is made up of about a half million lines of code, predominantly in Fortran*.
WRF was first demonstrated on the prototype Intel® Xeon Phi™ Coprocessor (codenamed Knights Ferry) at the Supercomputing Conference in November 2011. WRF is the first NWP model to have been ported to and run in its entirety on Intel’s Many Integrated Core Architecture. It uses a two-level domain decomposition, first over MPI ranks and then within each MPI rank over threads (OpenMP*). This decomposition provides large amounts of hybrid MPI/OpenMP parallelism that is well-suited to the many-core architecture of Intel® Xeon Phi™ Processor x200 (codenamed Knights Landing).
The results on Knights Landing are a notable example of the value of avoiding offload programming models.
The computational cost in WRF is spread over a relatively flat profile comprising dynamics, a computational fluid dynamics (CFD) finite-volume solver using explicit finite-difference approximation, and physics modules that represent atmospheric processes such as radiative transfer, convection and other moist processes, surface land/sea surface and boundary layer effects.
The flatness of WRF’s profile argues against “offload” programming for coprocessors, which is best suited for applications where a majority of an application’s cost is over a small section of code comprising of a few thousand lines of code Like Intel® Xeon® processors, Knights Landing supports industry standard, mature and widely supported programming models (MPI, OpenMP, and a vector ISA). Unlike Intel® Xeon Phi™ Coprocessor (codenamed Knights Corner), Knights Landing nodes can be self-hosted avoiding the need for host/co-processor data movement.
The book (Intel® Xeon Phi™ Processor High Performance Programming, 2nd Edition – Knights Landing Edition) describes, among other topics on Knights Landing, the performance of the full WRF model using the 12km resolution Continental United States (CONUS12km) benchmark. The CONUS12km benchmark is a widely cited industry standard workload. Results are provided to explain the nearly triple speedup compared to Knights Corner. In particular, the impact of the high-bandwidth 16GB MCDRAM system on the Knights Landing node is discussed along with the benefits of the AVX-512 instruction set. Use of both all-to-all and quadrant cluster modes, with flat memory mode, are also discussed.
The CONUS12km workload is a computational-only benchmark. Therefore, we ignore the first timestep and measure performance for 149 time steps to exclude I/O and initialization costs from our measurements. Different time steps within the series cost differently; for example, 5 of the 149 steps involve radiative transfer physics costing 3.5x more than other steps on Knights Landing. Performance comparisons are based upon average time step based on wall clock time on an unloaded computational node.
More bandwidth for Knights Landing from MCDRAM is the most important factor contributing to the superior CONUS12km WRF performance on Knights Landing. In profiling WRF CONUS12km on Knights Landing, WRF CONUS12km is a bandwidth-bound application with bursts of high demands on bandwidth.
Knights Landing is the best performing processor for the WRF CONUS12km benchmark. Knights Landing performance is improved over a 2 Socket Intel® Xeon™ E5-2697v4 (codenamed Broadwell) processor and also faster than a 2 Socket Intel® Xeon® E5-2697v3 (codenamed Haswell) Processor. WRF CONUS12km also has a significant improvement on Knights Landing than Knights Corner.
Overall, we found that running on Knights Landing delivered a substantial performance improvement over the previous generation Knights Corner coprocessors. This was possible only because the per-core AVX‑512 vector units in Knights Landing were able to process the data at the rate possible due to the higher bandwidth. Knights Landing does better than other processors because of its higher memory bandwidth, superior vector capabilities, and higher thread counts.
About the Authors
Indraneil Gokhale is a Software Architect in the Intel Software and Services Group.
John Michalakes is a UCAR Visiting Scientist at the Navel Research Laboratory, Marine Meteorology Division.