Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

Overview

YASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels including vector folding, cache blocking, memory layout, loop construction, temporal wave-front blocking, and others.  YASK contains a specialized source-to-source translator to convert scalar C++ stencil code to SIMD-optimized code.  Proper tuning of a stencil kernel can show a performance boost on the Intel® Xeon Phi™ processor of up to 2.8 times the performance of the same program on an Intel® Xeon® processor. The performance advantage of the Xeon Phi processor can be attributed to the high bandwidth memory and 512 bit SIMD instructions.

Introduction

A very important subset of HPC computing is the use of stencil computations to update temporal and spatial values of data. Conceptually, the kernel of a typical 3D iterative Jacobian stencil computation can be shown by the following pseudo-code that iterates over the points in a 3D grid:

for t = 1 to T do
  for i = 1 to Dx do
    for j = 1 to Dy do
       for k = 1 to Dz do
          u(t + 1, i, j, k) ← S(t, i, j, k)
      end for
    end for
  end for
end for

where T is the number of time-steps; Dx, Dy, and Dz are the problem-size dimensions; and S(t, i, j, k) is the stencil function.  For very simple 1D and 2D stencils, modern compilers can often recognize the data access patterns and optimize code generation to take advantage of vector registers and cache lines, but for more complicated stencils, combined with modern multi-core processors with shared caches and memories, the task of producing optimal code is beyond the scope of most compilers.

YASK is a tool which allows a user to experiment with different types of data distribution, including vector folding and loop structures which may yield better performing code than straight compiler optimizations. YASK is currently focused on single node OpenMP optimizations.

The following graphic shows the typical YASK usage model:

High-level components

Introductory Tutorial

A complete introductory tutorial can be found in the documentation section of the Yask website. This tutorial will walk a user through the necessary steps to build and execute YASK jobs.

Vector Folding Customization

Vector folding, otherwise known as multi-dimensional vectorization is the process of packing vector registers with blocks of data which are not necessarily contiguous in order to optimize data and cache reuse.  For a complete discussion of vector folding, please refer to the document titled: “Vector Folding: improving stencil performance via multi-dimensional SIMD-vector representation.” Vector folding by hand is complicated and error prone, so YASK presents a software tool for translating standard sequential code into new code which can then be compiled to produce faster, more efficient code.

Download detailed Vector Folding paper [PDF 330 KB]

Loop Structure Customization

In combination with vector folding, the execution of loops across multiple threads gains additional performance.  By allowing a user to experiment with loop structure via OpenMP constructs, YASK offers yet another avenue for code optimization. There are three main loop control customizations: ‘Rank’ loops break the problem in OpenMP regions, ‘Region’ loops break each OpenMP region into cache blocks, and ‘Block’ loops iterate over each vector cluster in a cache block.

Performance

AWP-ODC: One of the stencils included in YASK is awp-odc, a staggered-grid finite difference scheme used to approximate the 3D velocity-stress elastodynamic equations: http://hpgeoc.sdsc.edu/AWPODC. Applications using this stencil simulate the effect of earthquakes to help evaluate designs for buildings and other at-risk structures. Using a problem size of 1024*384*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.8x compared to the Intel® Xeon® processor E5-2697 v4.

AWP-ODC

Configuration details: YASK HPC Stencils, AWP-ODC kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=awp arch=hsw cluster=x=2,y=2,z=2 fold=y=8 omp_schedule=guided mpi=1
  • ./stencil-run.sh -arch hsw -ranks 2 -bx 74 -by 192 -bz 20 -pz 2 -dx 512 -dy 384 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=awp arch=knl INNER_BLOCK_LOOP_OPTS='prefetch(L1,L2)'
  • ./stencil-run.sh -arch knl -bx 128 -by 32 -bz 32 -dx 1024 -dy 384 -dz 768

ISO3DFD: Another of the stencils included in YASK is iso3dfd, a 16th-order in space, 2nd-order in time, finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits. Using a problem size of 1536*1024*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.6x compared to the Intel® Xeon® processor E5-2697 v4.

ISO3DFD

Configuration details: YASK HPC Stencils, iso3dfd kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=iso3dfd arch=hsw mpi=1
  • ./stencil-run.sh -arch hsw -ranks 2 -bx 256 -by 64 -bz 64 -dx 768 -dy 1024 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

  • Download code from https://github.com/01org/yask and install per included directions
  • make stencil=iso3dfd arch=knl
  • ./stencil-run.sh -arch knl -bx 192 -by 96 -bz 96 -dx 1536 -dy 1024 -dz 768
For more complete information about compiler optimizations, see our Optimization Notice.

11 comments

Top
drMikeT's picture

Thanks, Chuck, I resolved the issue with "DP" reporting same rates as with SP. It was one of my omissions while building the codes.

Mike

Chuck Yount (Intel)'s picture

Announcement: YASK version 2 is now available on the 'master' branch at https://github.com/intel/yask. These changes have been in development in the 'develop' and 'v2_alpha' branches over the last 7 months. There are significant new features, including arbitrary grid dimensions and APIs for C++ and Python. The code has also been reorganized, so a clean 'git clone' is recommended. If you run 'git pull', also run 'git status' and/or 'git clean' to make sure files left over from version 1 don't cause mysterious errors. Please see docs/YASK-intro.pdf for more information on the changes.

Chuck Yount (Intel)'s picture

No, the FLOPS rate is not normalized, so we would expect the DP rate to be roughly half that of the SP rate.

  • How did you specify the DP vs SP compilation?
  • Did you run "make clean" before recompiling with new settings?
drMikeT's picture

Hello Chuck,

I compiled all stencils with both 4 (single-precision) and 8 (double-precision). The best estimated FP rates remain the same though. Are all rates reported in equivalent single-precision numbers?

Thanks!

Michael 

Excerpt from SKL :

machine_name- Num_ranks = Ranks_per_node X num_nodes x num_cores_per_rank
exp/helios_16-32=16X2x20-2017-11-16_19:51:58-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            7.11358T
exp/helios_16-32=16X2x20-2017-11-16_19:51:58-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            7.12314T
exp/helios_16-32=16X2x20-2017-11-16_20:30:19-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            7.11507T
exp/helios_16-32=16X2x20-2017-11-17_20:07:30-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            7.11561T
exp/helios_16-32=16X2x20-2017-11-17_20:07:30-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            7.09007T
exp/helios_16-32=16X2x20-2017-11-18_07:39:58-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            5.00941T
exp/helios_16-32=16X2x20-2017-11-18_07:39:58-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            5.1548T
exp/helios_16-4=2X2x20-2017-11-17_19:33:50-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            915.487G
exp/helios_16-4=2X2x20-2017-11-17_19:33:50-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            912.857G
exp/helios_16-4=2X2x20-2017-11-17_20:29:01-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            913.946G
exp/helios_16-4=2X2x20-2017-11-17_20:29:01-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            912.981G
exp/helios_16-4=2X2x20-2017-11-18_06:49:44-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            683.333G
exp/helios_16-4=2X2x20-2017-11-18_06:49:44-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            690.287G
exp/helios_16-8=16X2-2017-11-11_22:24:43-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            1.80632T
exp/helios_16-8=16X2-2017-11-11_22:24:43-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            1.79252T
exp/helios_16-8=4X2x20-2017-11-17_19:39:22-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            1.80271T
exp/helios_16-8=4X2x20-2017-11-17_19:39:22-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            1.79352T
exp/helios_16-8=4X2x20-2017-11-18_06:57:44-AVX512-IMPI__2X20-YASK-iso3dfd-FP4.dat:best-throughput (est-FLOPS):            1.30931T
exp/helios_16-8=4X2x20-2017-11-18_06:57:44-AVX512-IMPI__2X20-YASK-iso3dfd-FP8.dat:best-throughput (est-FLOPS):            1.29956T

 

 

Chuck Yount (Intel)'s picture

Good question regarding snoop settings. I have not done those studies as we're not currently focused on BDW and HSW. Feel free to run the experiments and let us know if you find anything interesting :).

Michael T.'s picture

I was wondering how does performance of the stencil benchmarks varies based on different snoop configuration modes for Haswells or Broadwells ? 

Chuck Yount (Intel)'s picture

The unexpected performance is due to the two-socket BDW not having SMT enabled. By default, the YASK kernel runs on only half of the CPUs on a BDW or HSW, so when SMT is off, half of the cores are unused. To get the expected performance, do one of the following:

  1. Enable SMT.
  2. Run the kernel with the following options to use all the cores when SMT is off: “-thread_divisor 1 -block_threads 1”.

BTW, the compiler messages are just a lot of warning messages that can be ignored (I will mention this in a later release of the recipe slide).

drMikeT's picture

The performance differences are at times significant. You reported for ISO3DFD an estimate of ~256GF/s and I can attain ~150-165GF/s. 

I am also using 2.133GT/s DDR4 DIMMs on our Broadwel and Intel 2017.1 tool-set.


user@host [pts/1]yask-master $ make clean; make stencil=iso3dfd arch=hsw mpi=1
user@host [pts/1]yask-master $ ./stencil-run.sh -arch hsw -ranks 2 -- -d 1024 -dx 512

...

best-time (sec):                        20.2204
best-throughput (prob-size-points/sec): 2.65509G
best-throughput (points-updated/sec):   2.65509G
best-throughput (est-FLOPS):            161.961G
────────────────────────────────────────────────────────────
Notes:
 prob-size-points/sec is based on problem-size as described above.
 ...
user@host $ ./stencil-run.sh -arch hsw -ranks 2 -bx 256 -by 64 -bz 64 -dx 768 -dy 1024 -dz 768

....

time in halo exch (sec):                7.6495
────────────────────────────────────────────────────────────
time (sec):                             24.5541
throughput (prob-size-points/sec):      2.45979G
throughput (points-updated/sec):        2.45979G
throughput (est-FLOPS):                 150.047G
time in halo exch (sec):                7.55161
────────────────────────────────────────────────────────────
time (sec):                             24.737
throughput (prob-size-points/sec):      2.4416G
throughput (points-updated/sec):        2.4416G
throughput (est-FLOPS):                 148.938G
time in halo exch (sec):                7.73886
────────────────────────────────────────────────────────────
best-time (sec):                        24.5541
best-throughput (prob-size-points/sec): 2.45979G
best-throughput (points-updated/sec):   2.45979G
best-throughput (est-FLOPS):            150.047G
────────────────────────────────────────────────────────────
Notes:
...

Trying to build with the options mentioned on this web page gets the compiler go into an infinite loop.

user@host $ make clean ; make stencil=awp arch=hsw cluster=x=2,y=2,z=2 fold=y=8 omp_schedule=guided mpi=1
rm -fv src/*.[io] *.optrpt src/*.optrpt *.s src/stencil_rank_loops.hpp src/stencil_region_loops.hpp src/stencil_halo_loops.hpp src/stencil_block_loops.hpp src/layout_macros.hpp src/layouts.hpp src/stencil_macros.hpp src/stencil_code.hpp make-report.txt
removed `src/stencil_main.hsw.optrpt'
...

mpiicpc -g -O3 -std=c++11 -Wall -xCORE-AVX2 -debug extended -Fa -restrict -ansi-alias -fno-alias -fimf-precision=low -fast-transcendentals -no-prec-sqrt -no-prec-div -fp-model fast=2 -fno-protect-parens -rcd -ftz -fma -fimf-domain-exclusion=none -qopt-assume-safe-padding -qoverride-limits -vec-threshold0 -qopt-report=5 -qopt-report-phase=VEC,PAR,OPENMP,IPO,LOOP -no-diag-message-catalog -fopenmp   -c -o src/stencil_main.hsw.o src/stencil_main.cpp
icpc: remark #10397: optimization reports are generated in *.optrpt files in the output location
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): (col. 17) warning #13379:  was not vectorized with "simd"
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): remark #15335:  was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
src/realv.hpp(473): (col. 17) warning #13379:  was not vectorized with "simd"
...

 

Chuck Yount (Intel)'s picture

Hi, Michael,

I believe the DDR4 on the BDW is 2133 MHz. Are you running on 2-sockets of Intel Xeon E5-2697 v4 with 18 cores per socket? How far off are your results? Can you send the log file?

Regarding your suggestion about putting the stencil name in the file, coincidentally, I pushed a change with that feature a few days ago. If you do a 'git pull', you should get that update. It also creates a log file with each run.

Regarding documentation, the "Example" stencils don't solve any specific problem, they are just example shapes. The "FSG" and "SSG" codes were contributed by another user, so I don't have any documentation myself. But, you should be able to find papers and other descriptions online for "full" and "standard staggered grid" finite-difference formulations.

drMikeT's picture

Hello Chuck,

Thanks for the nice Stencil framework for benchmarks! 

I am also using the same Broadwell model that you mention here (and in the presentation) with the same compilation and tun-time command lines as you mention here but I cannot get numbers as high as those you are mentioning. What is the DDR4 speed in your tests?

Could you point to some documentation on the other stencils that your code supports (besides iso3dfd and AWP-ODC) ?  

One suggestion is to include the stencil name on the executable file name so that we won't have to recompile stencils each time we want to switch to different stencils.

 

Thanks again!

Michael

 

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.