Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

By Charles R Yount, Published: 07/29/2016, Last Updated: 07/29/2016

Overview

YASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels including vector folding, cache blocking, memory layout, loop construction, temporal wave-front blocking, and others.  YASK contains a specialized source-to-source translator to convert scalar C++ stencil code to SIMD-optimized code.  Proper tuning of a stencil kernel can show a performance boost on the Intel® Xeon Phi™ processor of up to 2.8 times the performance of the same program on an Intel® Xeon® processor. The performance advantage of the Xeon Phi processor can be attributed to the high bandwidth memory and 512 bit SIMD instructions.

Introduction

A very important subset of HPC computing is the use of stencil computations to update temporal and spatial values of data. Conceptually, the kernel of a typical 3D iterative Jacobian stencil computation can be shown by the following pseudo-code that iterates over the points in a 3D grid:

for t = 1 to T do
  for i = 1 to Dx do
    for j = 1 to Dy do
       for k = 1 to Dz do
          u(t + 1, i, j, k) ← S(t, i, j, k)
      end for
    end for
  end for
end for

where T is the number of time-steps; Dx, Dy, and Dz are the problem-size dimensions; and S(t, i, j, k) is the stencil function.  For very simple 1D and 2D stencils, modern compilers can often recognize the data access patterns and optimize code generation to take advantage of vector registers and cache lines, but for more complicated stencils, combined with modern multi-core processors with shared caches and memories, the task of producing optimal code is beyond the scope of most compilers.

YASK is a tool which allows a user to experiment with different types of data distribution, including vector folding and loop structures which may yield better performing code than straight compiler optimizations. YASK is currently focused on single node OpenMP optimizations.

The following graphic shows the typical YASK usage model:

High-level components

Introductory Tutorial

This tutorial will walk a user through the necessary steps to build and execute YASK jobs.

Vector Folding Customization

Vector folding, otherwise known as multi-dimensional vectorization is the process of packing vector registers with blocks of data which are not necessarily contiguous in order to optimize data and cache reuse.  For a complete discussion of vector folding, please refer to the document titled: “Vector Folding: improving stencil performance via multi-dimensional SIMD-vector representation.” Vector folding by hand is complicated and error prone, so YASK presents a software tool for translating standard sequential code into new code which can then be compiled to produce faster, more efficient code.

Download detailed Vector Folding paper  [PDF 330 KB]

Loop Structure Customization

In combination with vector folding, the execution of loops across multiple threads gains additional performance.  By allowing a user to experiment with loop structure via OpenMP constructs, YASK offers yet another avenue for code optimization. There are three main loop control customizations: ‘Rank’ loops break the problem in OpenMP regions, ‘Region’ loops break each OpenMP region into cache blocks, and ‘Block’ loops iterate over each vector cluster in a cache block.

Performance

AWP-ODC: One of the stencils included in YASK is awp-odc, a staggered-grid finite difference scheme used to approximate the 3D velocity-stress elastodynamic equations: http://hpgeoc.sdsc.edu/AWPODC. Applications using this stencil simulate the effect of earthquakes to help evaluate designs for buildings and other at-risk structures. Using a problem size of 1024*384*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.8x compared to the Intel® Xeon® processor E5-2697 v4.

AWP-ODC

Configuration details: YASK HPC Stencils, AWP-ODC kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=awp arch=hsw cluster=x=2,y=2,z=2 fold=y=8 omp_schedule=guided mpi=1
  • ./stencil-run.sh -arch hsw -ranks 2 -bx 74 -by 192 -bz 20 -pz 2 -dx 512 -dy 384 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=awp arch=knl INNER_BLOCK_LOOP_OPTS='prefetch(L1,L2)'
  • ./stencil-run.sh -arch knl -bx 128 -by 32 -bz 32 -dx 1024 -dy 384 -dz 768

ISO3DFD: Another of the stencils included in YASK is iso3dfd, a 16th-order in space, 2nd-order in time, finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits. Using a problem size of 1536*1024*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.6x compared to the Intel® Xeon® processor E5-2697 v4.

ISO3DFD

Configuration details: YASK HPC Stencils, iso3dfd kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=iso3dfd arch=hsw mpi=1
  • ./stencil-run.sh -arch hsw -ranks 2 -bx 256 -by 64 -bz 64 -dx 768 -dy 1024 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=iso3dfd arch=knl
  • ./stencil-run.sh -arch knl -bx 192 -by 96 -bz 96 -dx 1536 -dy 1024 -dz 768

Additional Resources

Intel® Xeon Phi™ Processor

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804