Optimizing Memory Bandwidth in Knights Landing on Stream Triad

Overview

This document demonstrates the best methods to obtain peak memory bandwidth performance on Intel® Xeon Phi™ Processor  (codenamed Knights Landing). This is done using STREAM* benchmarks, the de facto industry-standard benchmark for the measurement of computer memory bandwidth.

Introduction

The STREAM benchmark is a simple synthetic benchmark designed to measure sustainable memory bandwidth (in MB/s) and a corresponding computation rate for four simple vector kernels (Copy, Scale, Add, and Triad). Its source code is freely available from http://www.cs.virginia.edu/stream/. STREAM is also a part of the HPCC Benchmark suite.

STREAM Rules

The general rule for STREAM is that each array must be at least 4x the size of the sum of all the last level caches used in the run or 1 million elements, whichever is larger.

Standard versus Tuned Versions

The “standard” set of results represents the results of the C or Fortran* code running with double-precision data type on production hardware. The “tuned” results represent results with double-precision data type but allows code modifications (which includes Assembly language code). Our reported measurements generally represent standard results.

For measuring the bandwidth out of new memory types (MCDRAM, in this case) some code modifications can be done (but are not necessary), since we can measure them using standard NUMA APIs during the run (like numactl), as the memory will be exposed as a separate NUMA domain in specific memory modes.

Knights Landing Memory Architecture Overview

Knights Landing has two types of memory: DDR4 and MCDRAM.

  • DDR4 is a low-bandwidth, high-capacity memory.

  • MCDRAM is a high-bandwidth, low-capacity (up to 16 GB) memory, packaged with the Knights Landing silicon.

MCDRAM can be configured as a third level cache (memory side cache) or as a distinct NUMA node or somewhere in between. The MCDRAM mode can be booted in cache, flat, or hybrid mode. In this article we will demonstrate the method to obtain the peak memory bandwidth out of DDR and MCDRAM in Flat mode (where DDR and MCDRAM are individual addressable memory exposed as distinct NUMA nodes)

Source Modifications

Align the memory allocations at 2 MB boundaries

Static Allocations Example

static double a[N+OFFSET] __attribute__((aligned(2097152)));

Dynamic Allocations Example

a = (STREAM_TYPE *)_mm_malloc(sizeof(STREAM_TYPE)*(STREAM_ARRAY_SIZE+OFFSET),2097152);

Repeat the above for all the 3 allocated arrays "a, b and c" in the STREAM benchmark

Compiler Flags Used (Intel Compiler)

-mcmodel medium -shared-intel -O3 -xMIC-AVX512 -DSTREAM_ARRAY_SIZE=134217728 -DOFFSET=0 -DNTIMES=10 -qopenmp -qopt-streaming-stores always

Note: We use an array size of 1 GB which follows the STREAM run rules.

Memory Mode: Flat (Addressable) DDR4 + MCDRAM

  • Both DDR4 and MCDRAM are available in flat mode, a separately addressable memory.

  • Software modifications are needed to use DDR4 and MCDRAM in the same application.

  • To set allocations of your application from either DDR4 or MCDRAM, use NUMA APIs.

    • In this case no Software Modifications are needed, and users can run their whole application out of MCDRAM using NUMA APIs and if the memory footprint is within the MCDRAM memory limits.

  • Since DDR4 and MCDRAM are in flat mode, you must measure DDR and MCDRAM bandwidth separately.

To Measure DDR4 Bandwidth

  • Run command: (e.g. KNL 7250)

    export OMP_NUM_THREADS=68; export KMP_AFFINITY=scatter; numactl –m 0 ./stream

    Use numactl –H command to determine the available NUMA nodes in the system.

To Measure MCDRAM Bandwidth (Primary Method)

Measure the MCDRAM bandwidth using STREAM in flat mode with the NUMA API to allocate all memory to MCDRAM at runtime. No source modification is necessary.

  • Run command (e.g. KNL 7250):

    export OMP_NUM_THREADS=68; export KMP_AFFINITY=scatter; numactl –m 1 ./stream

To Measure MCDRAM Bandwidth (Alternate Method)

Use the MEMKIND/HBW APIs with source modifications:

The standard memory allocations in STREAM are

static STREAM_TYPE a[STREAM_ARRAY_SIZE+OFFSET],
 b[STREAM_ARRAY_SIZE+OFFSET],
 c[STREAM_ARRAY_SIZE+OFFSET];

The MCDRAM (HBM) memory allocations resulting example:

hbw_posix_memalign ((void **)&a, 2097152, sizeof(STREAM_TYPE) * (STREAM_ARRAY_SIZE + OFFSET)
hbw_posix_memalign ((void **)&b, 2097152, sizeof(STREAM_TYPE)* (STREAM_ARRAY_SIZE + OFFSET)
hbw_posix_memalign ((void **)&c, 2097152, sizeof(STREAM_TYPE) * (STREAM_ARRAY_SIZE + OFFSET)

The syntax is hbw_posix_memalign(void **memptr, size_t alignment, size_t size)

  • Run command:

    export OMP_NUM_THREADS=68; export KMP_AFFINITY=scatter; ./stream_hbw_malloc

Summary

Because MCDRAM provides the necessary high bandwidth for Knights Landing, it is needed to measure and report the bandwidth specified for the different memory configurations.

Expected STREAM Triad (GB/s) results for Intel® Xeon Phi™ Processor 7250:

Cluster Mode: Quadrant; Memory Mode: DDR + MCDRAM Flat

MCDRAM Flat 

DDR4 Flat

~475 - 490 GB/s

~90 GB/s

 

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.