Dynamic random-access memory (DRAM) dual-inline memory modules (DIMMs) cannot supply the bandwidth of data Intel® Xeon Phi™ processors can consume. To help alleviate this problem, Intel Corporation offers two technologies that place DRAM memory closer to cores:
However, even the higher bandwidth of these new technologies is still too low; a high L1 and L2 cache hit rate is necessary for Intel Xeon Phi processors to run at full speed.
Intel Xeon Phi processors are the freight trains of the data processing world. While each core is not as fast or as flexible as an Intel® Xeon® processor core, there are more of them and each core has twin vector processing units to move a lot of data along exactly the same path.
The 144-vector processing units can perform approximately 3 teraflops of double-precision operations per second. That is 3 million million operations per second – twice as many if you can use single-precision operations instead. But if the data is not available for units to process, the units sit idle.
Note: All numbers in this article are approximate, and vary widely depending on the hardware configuration and configuration usage. Consult technical documentation to get exact numbers for a particular configuration.
Data can come from three places:
The first two places have a huge advantage over the third – the cores need not share the bandwidth of the memory devices outside the processor, or the channels going to them.
L1 and L2 caches can supply enough data to keep all the vector processing units busy all the time, but they are limited in size – about 0.5 MB per core. They provide about 1 terabyte of data per second.
While some applications perform many operations on such a small data set, many applications have much larger data sets. If data cannot be cached in the registers of the vector processing unit, or in the L1 and L2 caches, then it must come from the double data rate (DDR) memory in the DIMMs or from the in-package-memory. The six channels of DDR memory outside the processor can supply about 90 GB of data per second, but the many channels of MCDRAM can supply about 400 GB of data per second.
The 400 GB of data per second from a MCDRAM is only 2% of the 25 terabytes of data per second the caches can supply and the vector processing units can consume.
Look carefully into how your algorithms spread load and store instructions across the threads that execute on the cores. Try to use data repeatedly from the L1 and L2 caches after the data first moves there. When the next lot of data is needed, or when you must store more data than fits in those caches, the MCDRAM provides another 16 GB of storage before you must go off-package to the 384 GB of storage available on DDR4 DIMMs.
Some algorithms can subdivide data into hot data (frequently accessed data that does not fit in the processor caches) and cold data (other rarely used data). In these cases, it is possible to place the hot data in the MCDRAM and the cold data in the DIMMs. It may even be possible to prefetch the data from the DIMMs into the MCDRAM before it is needed.
In other algorithms, accesses are unpredictable, but once data is accessed, it is likely to be accessed again in the near future. For these cases, you can set aside a portion of the MCDRAM at boot time to cache the DIMMs.
The previous article, What's New About Modern Hardware, introduced three changes in memory technology: On-package memory, Non-volatile DIMMs (NVDIMMs), and Intel® Omni-Path Fabric (Intel® OP Fabric). This article gives more details about the on-package memory. The next article, Why Efficient Use of the Memory Subsystem is Critical to Performance, discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM.
You may also want to read Performance Improvement Opportunities with NUMA Hardware, a series that covers the basics of efficiently using these new technologies.
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. Bevin enjoys the problem-solving aspects of being a 911 paramedic, and often describes the differences between debugging people and debugging code: with patients, there is no backup/restore, no pause/resume, limited visibility into what is happening inside them, and the whole interaction is over in about an hour. Come to think of it, apart from the one-hour bit, that is a lot like debugging a massive application on a computer cluster!
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804