Dynamic random-access memory (DRAM) dual-inline memory modules (DIMMs) cannot supply the bandwidth of data Intel® Xeon Phi™ processors can consume. To help alleviate this problem, Intel Corporation offers two technologies that place DRAM memory closer to cores:
- Multi-Channel DRAM (MCDRAM) is proprietary, high-bandwidth memory that physically sits atop the family of Intel Xeon Phi processors, code named Knights Landing.
- High-Bandwidth Memory (HBM), which is compatible with JEDEC standards, is high-bandwidth memory designed for a generation of Intel Xeon Phi processors, code named Knights Hill.
However, even the higher bandwidth of these new technologies is still too low; a high L1 and L2 cache hit rate is necessary for Intel Xeon Phi processors to run at full speed.
Why These Technologies Are Needed
Intel Xeon Phi processors are the freight trains of the data processing world. While each core is not as fast or as flexible as an Intel® Xeon® processor core, there are more of them and each core has twin vector processing units to move a lot of data along exactly the same path.
The 144-vector processing units can perform approximately 3 teraflops of double-precision operations per second. That is 3 million million operations per second – twice as many if you can use single-precision operations instead. But if the data is not available for units to process, the units sit idle.
Note: All numbers in this article are approximate, and vary widely depending on the hardware configuration and configuration usage. Consult technical documentation to get exact numbers for a particular configuration.
Data can come from three places:
- It can already be loaded into the vector processing unit, perhaps as the result of a previous operation.
- It can be loaded from the private L1 and L2 caches that abut each core.
- It can come from memory outside the processor through a fast set of paths that form a grid around the cores.
The first two places have a huge advantage over the third – the cores need not share the bandwidth of the memory devices outside the processor, or the channels going to them.
L1 and L2 caches can supply enough data to keep all the vector processing units busy all the time, but they are limited in size – about 0.5 MB per core. They provide about 1 terabyte of data per second.
While some applications perform many operations on such a small data set, many applications have much larger data sets. If data cannot be cached in the registers of the vector processing unit, or in the L1 and L2 caches, then it must come from the double data rate (DDR) memory in the DIMMs or from the in-package-memory. The six channels of DDR memory outside the processor can supply about 90 GB of data per second, but the many channels of MCDRAM can supply about 400 GB of data per second.
Why These Technologies Are Not Enough
The 400 GB of data per second from a MCDRAM is only 2% of the 25 terabytes of data per second the caches can supply and the vector processing units can consume.
What You Should Do
Look carefully into how your algorithms spread load and store instructions across the threads that execute on the cores. Try to use data repeatedly from the L1 and L2 caches after the data first moves there. When the next lot of data is needed, or when you must store more data than fits in those caches, the MCDRAM provides another 16 GB of storage before you must go off-package to the 384 GB of storage available on DDR4 DIMMs.
Some algorithms can subdivide data into hot data (frequently accessed data that does not fit in the processor caches) and cold data (other rarely used data). In these cases, it is possible to place the hot data in the MCDRAM and the cold data in the DIMMs. It may even be possible to prefetch the data from the DIMMs into the MCDRAM before it is needed.
In other algorithms, accesses are unpredictable, but once data is accessed, it is likely to be accessed again in the near future. For these cases, you can set aside a portion of the MCDRAM at boot time to cache the DIMMs.
The previous article, What's New About Modern Hardware, introduced three changes in memory technology: On-package memory, Non-volatile DIMMs (NVDIMMs), and Intel® Omni-Path Fabric (Intel® OP Fabric). This article gives more details about the on-package memory. The next article, Why Efficient Use of the Memory Subsystem is Critical to Performance, discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM.
You may also want to read Performance Improvement Opportunities with NUMA Hardware, a series that covers the basics of efficiently using these new technologies.
About the Author
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. Bevin enjoys the problem-solving aspects of being a 911 paramedic, and often describes the differences between debugging people and debugging code: with patients, there is no backup/restore, no pause/resume, limited visibility into what is happening inside them, and the whole interaction is over in about an hour. Come to think of it, apart from the one-hour bit, that is a lot like debugging a massive application on a computer cluster!
- MCDRAM on Knights Landing - Analysis Methods & Tools
- An Intro to MCDRAM on Knights Landing
- High Bandwidth Memory (HBM): How will it benefit your application?
- Estimating the Performance Impact of the HBM on KNL Using Dual-Socket Nodes (PDF)
- Intel® Xeon Phi™ Processor Knights Landing Architectural Overview (PDF)
- Inside Intel Knights Landing Architecture
- More Knights Landing Xeon Phi Secrets Unveiled
- High Performance Parallel Programming - Books on Programming Multicore and Many-core by James Reinders and Jim Jeffers
- Trinity: Advanced Technology System for the ASC Program (PDF)
- Intel HPC Technology Overview (PDF)
- Developer's Guide to Next-Generation Intel® Xeon Phi™ Processor (One-hour webinar)