Quick Start Guide for the Intel® Xeon Phi™ Processor x200 Product Family

By Loc Q Nguyen,

Published:03/27/2015   Last Updated:09/27/2017


This document describes the process for taking the Intel® Xeon Phi™ processor from the point where the hardware has been received up to the point where the processor is ready to be used by the programmer.

This document does:

  • Provide a high level overview of the architecture of the processor, focusing on those parts of the architecture that differ from the other Intel® processors.
  • Provide configuration options specific to the Intel Xeon Phi processor.

This document does not:

  • Provide information on basic system administration.
  • Provide information on optimizing code for the Intel Xeon Phi processor.

Additional Documentation

Basic system architecture

The Intel® Xeon Phi™ x200 product family is the second-generation Intel Xeon Phi product. It is a many-core processor based on modern Intel Atom® microarchitecture with considerable High Performance Computing (HPC)-focused improvements. As shown in Figure 1, it has a maximum of 72 cores with 4 threads per core, giving a total of 288 CPUs as viewed by the operating system. The cores are laid out in units called tiles. Each tile contains a pair of cores, a shared L2 cache, and a hub connecting the tile to the interprocessor interconnect.

Figure 1. Intel® Xeon Phi™ processor architecture.

Major architectural innovations include the addition of on-package MCDRAM and clustering modes.

MCDRAM is high-bandwidth memory located in the same package as the processor. It can be configured in one of three modes: cache mode, flat mode, or hybrid mode. In cache mode, the MCDRAM is used as an L3 cache; in flat mode, it is used as additional addressable memory; in hybrid mode, a portion of each unit of MCDRAM is used as L3 cache with the remainder being used as additional addressable memory. The MCDRAM configuration is set at boot time and cannot be changed without a reboot.

When all or part of the MCDRAM is used as an L3 in-memory cache, access to the cache is transparent to software and requires no code modifications on the part of the user. When all or part of the MCDRAM is used as flat memory, it can be used transparently, as an extension of the DDR memory address space and/or non-transparently, by explicitly allocating space on the MCDRAM using hbwmalloc. If used transparently, there is no automatic performance gain. To get performance gains, software must be aware of the increased bandwidth of MCDRAM and use it effectively for bandwidth-critical data structures.

Clustering refers to dividing the available cores into contiguous blocks of cores, called clusters. The clustering mode affects the memory latency between the tiles and MCDRAM, and therefore affects performance. The three clustering modes are: Quadrant, Sub-NUMA, and All-to-All.

The default clustering mode is Quadrant. Quadrant mode, which divides the cores into four sections called quadrants and attempts to decrease intra-process communication time by keeping all threads of a single process close together, provides good overall performance and is transparent to the programmer. Sub-NUMA clustering mode, which attempts to increase memory performance by keeping shared memory accesses to MCDRAM closer to the quadrant where the request originated, offers the possibility of greater performance but requires software redesign to achieve it. All-to-All mode does not divide the cores into multiple clusters and is generally used only as a fail-safe mode.

The cluster mode is set at boot time and cannot be changed without rebooting. Both Quadrant and Sub-NUMA clustering modes require that the same number of equal capacity DIMMs be installed on each memory controller; if this condition is not met, All-to-All mode is automatically selected as a fall back.

New Instruction Set

The Intel Xeon Phi x200 product family uses the standard Intel® Architecture (IA) Instruction Set Architecture (ISA) that is similar to other Intel® processors, including 256-bit Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2).

The Intel Xeon Phi x200 product family also supports Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. Each core has two vector processing units (VPUs) that operate on 512-bit vector registers. Four subsets of Intel AVX-512 instructions are available in the Intel Xeon Phi processor:

  • AVX512-F: Fundamental instruction set
  • AVX512-CD: Conflict Detection instruction set
  • AVX512-ER: Exponential and Reciprocal instruction set
  • AVX512-PF: Prefetch instruction set

Installing the Software Stack

The Intel Xeon Phi processor should be able to use any operating system that executes on other Intel processors, as long as that operating system supports Intel AVX-512 registers. Table 1 lists some of the latest operating systems that can run on the Intel Xeon Phi processor.

Operating System Enabling Status
CentOS* 7.3 64-bit kernel 3.10.0-514
CentOS 7.2 64-bit kernel 3.10.0-327
Red Hat* Enterprise Linux* Server 7.3 64-bit kernel 3.10.0-514
Red Hat Enterprise Linux Server 7.2 64-bit kernel 3.10.0-327
SUSE* Linux Enterprise Server SLES 12 SP2 kernel 4.4.21-69-default
SUSE Linux Enterprise Server SLES 12 SP1 kernel 3.12.49-11-default

Table 1. Operating system support for the Intel® Xeon Phi™ processor.

Although the processor should be able to run any operating system that supports the IA ISA containing Intel AVX-512 registers, Intel validates against a limited number of operating systems. The initial validation is being done against Red Hat* Enterprise Linux* (RHEL) 7, CentOS* 7, and SUSE* Linux Enterprise Server (SLES) 12. As delivered, these distributions do not support the Intel Xeon Phi processor but require patches that can be obtained from Intel Xeon Phi Processor Software. From the Intel® Developer Zone page for the Intel Xeon Phi Processor, under “Software and Tools,” select “Intel Xeon Phi Processor Software”. Depending on the operating system installed with your processor, download and install one of the following supported operating systems specified in Table 1.

Configuration Options

After installing the software stack, you can configure the Cluster and Memory Modes on your system. Please refer to the following article for information related to configuring Cluster and Memory modes supported by the Intel Xeon Phi processor.

Intel® Xeon Phi™ x200 Processor - Memory Modes and Cluster Modes: Configuration and Use Cases

Intel® Tools

Intel provides the micperf tool for monitoring and evaluating the Intel Xeon Phi processor. micperf is designed to incorporate a variety of benchmarks into a simple user experience with a single interface for execution.

This tool and its documentation can be obtained from the Intel Xeon Phi Processor Software page. The README and User’s Guide of micperf can also found in /usr/share/doc/micperf-<version>.

Since the Intel Xeon Phi processor is an IA ISA, it is able to run other tools available from Intel.

In addition to the operating system, the OpenFabrics* Enterprise Distribution (OFED) software should be installed if a high-performance network is utilized. For the Intel Xeon Phi processor without an integrated fabric interface, any version of this software supported by a normal Intel Xeon processor should be usable, including OpenFabrics, Mellanox*, and Intel® True Scale Fabric. For Intel Xeon Phi processors with an integrated Intel® Omni-Path Fabric (Intel® OP Fabric) interface, use the Intel® Omni-Path Software (Intel® OP Software) instead. Intel OP Software can be downloaded from https://downloadcenter.intel.com/search?keyword=Omni-Path.

User Environment

The user environment is, in part, dictated by the system administrator’s choice of operating system.

Unlike the first-generation Intel Xeon Phi coprocessor, for which Intel provided a minimal user environment as part of the MPSS, Intel provides no user environment for the latest generation processor. Administrators can install an environment with as many features as they choose.

Also, unlike the coprocessor, all tools will run natively on the processor. Running the compilers natively removes the complications that previously occurred when attempting to build third-party and open source software that relied on configure scripts to determine the architecture and available compilers.

Intel® Parallel Studio XE

The following Intel® products support—or will support, in the case of tools not yet released—program development on the Intel Xeon Phi processor:

  • Intel® C Compiler/Intel® C++ Compiler/Intel® Fortran Compiler
  • Intel® Math Kernel Library (Intel® MKL)
  • Intel® Data Analytics Acceleration Library (Intel® DAAL)
  • Intel® Integrated Performance Primitives (Intel® IPP)
  • Intel® Cilk™ Plus
  • Intel® Threading Building Blocks (Intel® TBB)
  • Intel® VTune™ Amplifier XE
  • Intel® Advisor XE
  • Intel® Inspector XE
  • Intel® MPI Library
  • Intel® Trace Analyzer and Collector
  • Intel® Cluster Ready
  • Intel® Cluster Checker

Open Source Tools

Support for the processor is included in the mainline for GDB 7.12 (included in Intel® Parallel Studio XE 2018).

Basic Programming

Programmers should note three major changes: the MCDRAM on-package high-bandwidth memory, the new clustering modes, and the new Intel AVX-512 instructions. This section provides a very brief description of the unique Intel software and libraries to facilitate the use of MCDRAM and clustering modes.

For more specific information on these three topics, see the pre-release version of Intel 64 and IA-32 Architectures Software Developer Manuals.

To simplify the use of both MCDRAM and the new clustering modes, Intel is working with the Open Source community to develop the hbw_malloc library. Figure 2 shows the syntax for calling this library. This library is based on the jemalloc and memkind APIs and libraries. It provides a simple way to exploit both new capabilities by simply replacing a program’s malloc() calls with hbw_malloc() calls in C/C++ and the FASTMEMORY directive for Fortran.

Figure 2: High Bandwidth malloc (hbwmalloc) APIs

hbwmalloc is intended to be a simple and low-cost way of allowing developers to take advantage of both MCDRAM and cluster modes. hbw_malloc() allocates memory from MCDRAM when possible. It also is aware of the clustering mode and will automatically allocate, if possible, memory that is closer to the tile with the allocating thread.

Figure 3 shows the syntax for both C/C++ and Fortran to allocate memory from MCDRAM.

Figure 3: Code snippets illustrating the use of hbwmalloc() for using MCDRAM and clustering modes




Intel® AVX-512 supported functionality (F, CD, ER, PF)

  • 512-bit float point/Integer
  • 32 registers + 8 mask registers
  • Embedded rounding and broadcast
  • Scalar/Intel® SSE/Intel® AVX “promotions”
  • Transcendental support
  • Gather/scatter
  • Must run an OS that supports Intel AVX-512
  • Use Intel® C++ and Intel® Fortran Compilers 2015 (version 15.0 or later)
  • GCC 4.7+ is enabled


Mesh clustering

Supports 3 types of clustering in the inter-processor mesh:

  • All-to-all
  • Quadrant
  • SubNUMA 4 (SNC4)
  • SW: Intel C++ and Intel Fortran Compilers 2015 (version 15.0 or later), GCC 4.7+
  • BIOS for configuring at boot; use of hbwmalloc.h


  • Processor: Up to 16 GB high-bandwidth on-package memory (MCDRAM) exposed as NUMA node
  • ~500 GB/s sustained bandwidth
  • Configurable usage as in-memory L3 cache, flat addressable memory, or hybrid cache and flat
  • Coprocessor: Up to 16 GB of high-bandwidth (MCDRAM) on-package memory (supports only flat model)
  • SW: Intel C++ and Intel Fortran Compilers 2015 (version 15.0 or later), GCC 4.7+
  • BIOS for configuring at boot; use of hbwmalloc.h


Six channels of DDR4 – Up to 384 GB


Silvermont Core

  • Enhancements include four threads/core
  • Two memory ops/cycle, out-of-order vector/FP, 32K L1 cache, 64 micro-TLBs, larger data TLBs including 1G page support


Legacy Intel® Xeon® processor compatibility

Binary compatible with legacy code for the Intel Xeon processor


Future Intel Xeon processor compatibility

The same ISA with some minor exceptions:

  • No Intel® Transactional Synchronization Extensions (Intel® TSX) in Intel® Xeon Phi™ processor
  • Small set of Intel AVX-512 HPC-specific instructions in Intel Xeon Phi processor but not in Intel Xeon processor
  • Small set of Intel AVX-512 instructions in Intel Xeon processor but not Intel Xeon Phi processor; for example, 512-bit byte/word support (promotions from Intel® AVX2, some additions)

SW: Intel C++ and Intel Fortran Compilers 2015 (version 15.0 or later), GCC 4.9+

Languages, libraries, and tools

Currently released Intel® languages and tools recognize Intel AVX-512 instructions

SW: Intel® Parallel Studio XE 2015 Update 2, GCC 4.9+

Open Source support

GCC generates Intel AVX-512 and other Intel AVX-512 extensions for KNL; GDB generates Intel AVX-512 and other Intel AVX-512 extensions for KNL

GCC 4.9+/GDB 7.8.1+

OS support for Intel Xeon Phi processor

  • Any IA (Intel® Architecture) OS supporting Intel AVX-512; KNL is validated against SUSE* 12 and RHEL* 7 distribution
  • KNL-F supported only by Linux*

Linux kernels 3.15+

Table 2: Software feature enabling for the Intel® Xeon Phi™ processor

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804