Optimizing Memory Bandwidth on Stream Triad

ID 660002
Updated 3/30/2021
Version Latest
Public

author-image

By

This document demonstrates the best methods to obtain peak memory bandwidth performance on Intel® Xeon processors using the de facto industry standard benchmark for the measurement of computer memory bandwidth, STREAM.

Introduction

The STREAM benchmark is a simple, synthetic benchmark designed to measure sustainable memory bandwidth (in MB/s) and a corresponding computation rate for four simple vector kernels: Copy, Scale, Add and Triad. Its source code is freely available from STREAM: Sustainable Memory Bandwidth in High Performance Computers. STREAM is also a part of the HPC Challenge Benchmark suite.

STREAM Rules

The general rule for STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 million elements, whichever is larger.

Standard vs. Tuned

There are two categories created by the STREAM author for citing memory bandwidth score. The kernels in the published link above “as is” are considered Standard. The Tuned category has been added to allow users or vendors to submit results based on modified source code. This category explicitly allows assembly-language coded kernels. The code needs to be based on the sample harness provided by the author in the STREAM webpage.

This article provides instructions to compile and run the STREAM benchmark without any source code modifications and as such the performance results obtained would fall under the Standard category.

Triad

Of all the vector kernels, Triad is the most complex scenario and it is highly relevant to HPC.

The STREAM Triad kernel is as follows:

#pragma parallel for 
for (i =0; i<N; i++) { 
	a[i] = b[i] + c[i] * SCALAR;
}

Compile and run STREAM on Intel Xeon Processors

  1. Download the latest STREAM benchmark source code.
  2. Download the Intel C Compiler.
  3. Use the following Intel C Compiler options:
    1. Common options for all Intel CPU’s: -DNTIMES=100 -DOFFSET=0 -DSTREAM_TYPE=double -DSTREAM_ARRAY_SIZE=268435456 -Wall -O3 -mcmodel=medium -qopenmp -shared-intel -qopt-streaming-stores always
    2. Add the following option based on the supported instruction set architecture of the Processor
      1. For Intel® Advanced Vector Extensions (Intel® AVX): -xAVX
      2. For Intel® Advanced Vector Extensions 2 (Intel® AVX2): -xCORE-AVX2
      3. For Intel® Advanced Vector Extensions 512 (Intel® AVX-512): -xCORE-AVX512 -qopt-zmm-usage=high
  4. Set the number of OpenMP threads to total number of available physical cores: export OMP_NUM_THREADS=<available_number_of_physical_cores>
  5. Set the OpenMP thread affinity.
    1. If Hyper-Threads are enabled: export KMP_AFFINITY=granularity=fine,compact,1,0
    2. If Hyper-Threads are disabled: export KMP_AFFINITY=compact

Acknowledgements

The author would like to thank the Intel Compiler team, Paul Besl, and John McCalpin, the author of the STREAM Benchmark.