The following article gives an overview of different Memory modes and Cluster modes supported by Intel® Xeon Phi™ x200 processor – codename Knights Landing (KNL). It shows how each mode can be configured in BIOS and some application characteristics are listed where it would be favorable to use one mode over the other. The intent of this article is to provide a hint to the developers and the system administrators about which memory mode or cluster mode should be used with their application.
It is important to note that this article does not give detailed information about KNL architecture, neither does it give examples or detailed explanation of any specific programming model. Useful reference are included at the end of the article for further reading.
The Intel® Xeon Phi™ x200 processor is offered with options for two types of memory – DDR as high capacity memory and MCDRAM as high bandwidth memory.
Figure 1. KNL package overview
Some of the architectural highlights for Intel® Xeon Phi™ x200 product family – codename Knights Landing – are as follows:
Figure 2. KNL Tile
Currently, Intel® Xeon Phi™ x200 processor based systems can be booted in a particular mode by selecting the appropriate configuration in the BIOS. When the KNL system is booting up, a key interrupt (F2 in most cases) will be required to enter into the BIOS. Once in the BIOS, the memory mode or cluster mode configurations can be done using the following selections.
EDKII Menu -> Advanced -> Uncore configurations
Figure 3. BIOS Screen - EDKII Menu
Figure 4. BIOS Screen - Advanced Menu
Memory modes define ways in which different memory components like DDR and MCDRAM are exposed to the software. Developers can choose one mode over the other depending on application’s bandwidth and latency characteristics.
In this mode, all of the MCDRAM behave as a memory-side direct mapped cache in front of DDR4. As a result, there is only a single visible pool of memory and you see MCDRAM as high bandwidth/high capacity L3 cache.
Cache mode can be selected as given below:
Figure 5. BIOS Screen – Selecting Cache Mode
Use of MCDRAM in cache mode is completely transparent to the user. Thus it would be advantageous to use this mode for legacy applications where modifications to the code are challenging or require a lot of engineering effort. In certain conditions this mode can offer near full performance of MCDRAM along with full capacity of DDR. In general, applications that have spatial & temporal locality for data accesses can achieve near peak performance, while streaming apps with very large datasets, or those with no re-use will get closer to DDR levels of performance.
In Flat mode, MCDRAM is used as a SW visible and Operating System (OS) managed addressable memory (as a separate NUMA node), so that memory can be selectively allocated to your advantage on either DDR4 or MCDRAM. With slight modifications to your software to enable use of both types of memory at the same time, flat mode can deliver uncompromising performance. In order to achieve this, the system is designed in such a way that by default, all the code and data will be allocated out of DDR while preserving MCDRAM as a precious resource for bandwidth critical data structures.
Detection and selective allocation of memory can be done using high bandwidth memory allocation library – Memkind library. More information about using this library can be found here.
Flat mode can be selected as given below:
Figure 6. BIOS Screen – Selecting Flat Mode
This mode will work best for applications which do not show ideal caching characteristics. This may happen if your application is streaming large data sets without much reuse of the data or if the application randomly accesses very large data sets. In such cases where your application performance is limited by DDR bandwidth and/or application’s memory footprint increases beyond the available MCDRAM capacity, you can certainly boost your application performance by investigating bandwidth critical hotspots and selectively allocating only those critical data structures out of high bandwidth memory.
The hybrid mode offers a bit of both worlds – some MCDRAM is configured as addressable memory and the other is configured as cache. MCDRAM is divided in such a way that in both the modes peak bandwidth performance can be achieved even though the capacity of MCDRAM reduces for either of the configuration.
There are two variants of hybrid—hybrid25 & hybrid50, which represent 25% and 50% of MCDRAM allocated as cache respectively. For both the variants, there will be 2 types of memory (DDR and MCDRAM) exposed to the software as 2 NUMA nodes (similar to Flat mode). In order to take advantage of MCDRAM as high bandwidth memory, code modifications or explicit NUMA control for memory allocations will be required. However use of MCDRAM as cache will still be transparent to the user.
Hybrid mode can be selected by changing the Memory Mode. Hybrid50 or Hybrid25 can be selected by changing MCDRAM Cache Size.
Figure 7. BIOS Screen – Selecting Hybrid50 Mode
Hybrid mode configured systems will be a good choice for customers who have a mix of applications with different characteristics in terms of bandwidth, cache and latency. It would also be beneficial to use this mode if the application can take advantage of high bandwidth memory in some portion of runtime while utilizing bigger and high performant cache in other. Also this mode allows different users to take advantage of MCDRAM in flat or cache mode without requiring to reboot the system.
As the name suggests these modes will have only a single type of memory available in the system i.e. MCDRAM. This is similar to Intel® Xeon systems, which generally have one type of memory - DDR.
No special configuration is required to enable this mode.
Some customers may choose to use MCDRAM-only systems for applications with very high bandwidth requirements but with a memory footprint of less than MCDRAM capacity.
Cluster modes define different ways in which the memory requests originating from cores will be fulfilled by memory controllers. The cluster modes differentiate how memory accesses are routed across the chip to reach the appropriate memory channel. In order to understand routing of a memory requests from a core (in case of a L2 miss) we define three elements/agents with specific roles as follows:
In this mode the memory accesses can be routed from any core to any memory channel. Although this is not a default mode, the KNL system is set automatically in this mode whenever there is asymmetry in terms of memory (number or capacity of DIMMS) or any other irregularities in the system configuration. This mode can offer very good performance but is not fully optimized because the average travel distance of memory requests is not optimized. In a worst case scenario, all three agents can be in different parts or quadrants of the chip generating a lot of mesh traffic.
All2All cluster mode can be selected as follows:
Figure 8. BIOS Screen – Selecting All2All Mode
All2All mode can be used in situations where customers want to be very flexible in terms of DIMM configuration. This mode is completely transparent to the users and can optimally support both MPI and threading models like OpenMP or Intel® Threading Building Blocks (Intel® TBB).
Quadrant is the default mode selected when there is symmetry in DIMM configuration and the system is free of any other configuration irregularities. In this mode the capacity and number of DIMMs on all the channels must be the same. It is not mandatory to populate all the DIMMs, however doing so can achieve maximum bandwidth. In this mode, whenever a memory request is generated from a core, it is forwarded to the tag directory which in turn is forwarded to the memory in case of a L2 miss. Here since the tag directory and the memory channel are always in the same quadrant, it generally results in less mesh traffic as compared to All2All mode and hence can provide better performance. Here, it is important to note that unlike SNC4 mode the chip is not divided into mulitple NUMA domains. However there will still be 2 NUMA domains when Flat or Hybrid mode is selected.
Quadrant mode can be selected as follows:
Figure 9. BIOS Screen – Selecting Quadrant Mode
Quadrant mode is ideal for customers who would want to achieve better performance than All2All without any code modifications. Benefits of this mode can be achieved by just using symmetric configuration of DIMMs. As a result, this mode was chosen to be the default mode for KNL systems provided, there is symmetric DIMM configuration and no system irregularities. With reduced mesh traffic, quadrant mode can provide better performance than All2All mode. In certain shared memory model workloads where application can use all the cores in a single process using a threading library like OpenMP and TBB, this mode can also provide better performance than Sub-NUMA clustering mode.
In this mode all the three agents supporting memory access request, the core, the tag directory and the memory channel are always in the same region (same quarter - SNC4 or same half - SNC2). As a result there is a sense of localization and hence reduced mesh traffic congestion providing an opportunity for the absolute highest performance. In order to achieve consistent optimized performance in this mode, code modifications or some NUMA environmental controls will be required. This is because, in this mode, the separate clusters or regions of the chip appear to the OS as separate NUMA nodes. In theory, this is very similar to a multi-socket Xeon systems where the OS manages the memory attached to each socket separately and is susceptible to both the benefits and limitations of NUMA.
SNC-4 cluster mode can be selected as given below:
Figure 10. BIOS Screen – Selecting SNC-4 Mode
This mode is suitable for customers who would like to use distributed memory programming models using MPI or hybrid MPI-OpenMP (MPI-TBB) on KNL. For such applications, processes can be distributed across multiple SNC regions, as if they were separate sockets and then within each cluster or region, each MPI rank can utilize thread level parallelism using OpenMP or TBB. In order to achieve this, proper MPI affinity controls like MPI pinning and pin domains must be used to affinitize each MPI rank to a unique NUMA node. Without pinning, the process and threads may migrate to a different NUMA node that does not have its allocated memory, thereby result in reduced performance.
This article gives a brief overview of different memory modes and cluster modes supported by Intel® Xeon Phi™ x200 processors. Accordingly, snapshots were provided on how each mode can be configured in BIOS. Some recommendations were also made regarding what kind of application characteristics or behavior one should pay attention to while deciding which memory mode or cluster mode to use.
Getting Ready for Intel® Xeon Phi™ x200 Product Family (/content/www/us/en/develop/articles/getting-ready-for-knl.html)
What disclosures has Intel® made about Knights Landing (https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing)
An Overview of Programming for Intel® Xeon processors and Intel® Xeon Phi™ coprocessors (https://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf)
Knights Corner: Your Path to Knights Landing (/content/www/us/en/develop/videos/knights-corner-your-path-to-knights-landing.html)
High Bandwidth Memory (HBM): how will it benefit your application? (/content/www/us/en/develop/articles/high-bandwidth-memory-hbm-how-will-it-benefit-your-application.html)
GitHub - memkind and jemalloc (https://github.com/memkind)
Sunny Gogar received a Master’s degree in Electrical and Computer Engineering from the University of Florida, Gainesville and a Bachelor’s degree in Electronics and Telecommunications from the University of Mumbai, India. He is currently a software engineer with Intel Corporation's Software and Services Group. His interests include parallel programming and optimization for Multi-core and Many-core Processor Architectures.
 BIOS – Basic Input Output System
 This diagram is for conceptual purposes only and only illustrates a processor and memory – it is not to scale and does not include all functional areas of the processor, nor does it represent actual component layout. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
 NUMA – Non Uniform Memory Access
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804