Programming modern computers rarely requires an understanding of underlying hardware and software; consequently, most programmers do not know how the memory subsystem works.
However, such lack of knowledge can ultimately produce a 10x or worse slowdown in application performance – especially since the arrival of new hardware technologies.
This article explains why optimal usage of the memory subsystem can have massive performance benefits.
It is hard to know where a modern computer begins and ends. The boundary is blurred by issues such as wired and wireless networks, data transfer across such networks by load and store instructions, and I/O devices that themselves contain complicated software and the hardware to run them. The following is a useful simplification rather than a complete truth.
Think of a computer as a motherboard with:
Each processor contains from one to hundreds of cores. Each core contains hardware to:
The rest of this article describes how a load instruction executed in a core moves data from a memory module – in this case, a dual in-line memory module (DIMM) – to that core.
Sometimes you do not want to know how a thing is made – things like sausages, laws, and (it turns out) memory accesses – because knowing changes your perception of that thing forever. But if you are one of those people who need to know, read on…
When you map a file into memory, you specify the virtual addresses to use. (If you ever examined a C or C++ pointer variable using a debugger, you examined 64-bit virtual addresses.) Pointer arithmetic is done using virtual addresses.
A virtual address consists of two parts:
Hardware controlled by the OS translates the page number into a physical address and adds back in the offset, using per-process page tables in the OS and accelerated by cached entries in translation lookaside buffers (TLBs).
For example the Intel® Core™ i7-6700 processor has two levels of TLBs:
Applications that rapidly touch more pages than TLBs can map may stall doing virtual-to-physical address translation.
A TLB miss is expensive. You can reduce the TLB miss rate by:
Once the physical address has been determined, a read or write request is sent to the L1 cache. The L1 cache will either perform the access or propagate it deeper into the memory subsystem.
If nothing useful can be done, the core stalls. Unfortunately, the OS is almost unaware of the stall: the application appears to be running, and it is hard to tell if the application is slower than it should be. You need tools to examine hardware performance counters to see stall details.
The accesses propagating through the memory subsystem are a combination of a specific request and the needed physical addresses and, perhaps, data.
Data moves around most of the memory subsystem in 64-byte quantities called cache lines. A cache entry, which is some transistors that can store a physical address and a cache line, is filled when a cache line is copied into it. Pages are evenly divided into cache lines – the first 64 bytes of a 4096-byte page is a cache line, with the 64 bytes stored together in a cache entry; the next 64 bytes is the next cache line, etc.
Each cache line may:
Cores, I/O devices, and other devices send requests to caches to either read or write a cache entry for a physical address. The lowest six bits of the physical address are not sent – they are used by the core to select the bytes within the cache line. The core sends separate requests for each cache line it needs.
Note: Memory mapping can result in two or more virtual addresses mapping to the same physical address. This often happens with shared libraries and demand-zero pages.
The L1 cache – usually 32 KB of data and 32 KB of instructions – is private to the adjacent core, which is why it can supply data so quickly. All hyper-threads in a core share the L1 cache.
The next cache out, the L2 cache, is sometimes private and sometimes shared by two cores.
If they exist, the L3 and L4 caches are shared by all the cores, and perhaps by more than one processor.
If the Multi-Channel DRAM (MCDRAM) is used to cache the DIMMs, that cache is also shared.
Cache sharing is not necessarily a problem when the cores are accessing different data; however, if a core is trying to store more than its share of data in the shared caches, it can push out another core’s data, so that neither benefits. In some situations the faster-accessing core can dominate and use the entire cache for itself, causing a load imbalance.
The red ring road buses shown in the diagram below connect the L2 caches to portions of the L3 cache, as well as to the Intel® QuickPath Interconnect (Intel® QPI) links, the peripheral component interconnect express (PCIe) links, and the home agents for the memory controllers. The two ring roads are themselves connected by the two escalator-like-appearance short buses.
Traffic on the buses only goes as far as necessary, and different traffic can be on different sections of the bus simultaneously. Obviously the further traffic must go, the higher the latency. Bandwidth must be shared when traffic uses the same sections of the bus, but usually it is not the limiting factor.
A memory access that starts at a core and misses the L1 and L2 caches goes along the buses to the home agent for the target memory controller for that physical address. If the destination memory controller is on a different socket in a multiprocessor system, the traffic goes over the Intel QPI link to a ring bus on the target processor, which adds more delays.
The Intel® Xeon Phi™ processor has a different layout. This processor generation, code named Knights Landing, uses a grid rather than the two interconnected rings, and does not have an L3 cache at all; otherwise the traffic follows the same basic patterns. These processors have home agents and memory controllers for two different types of dynamic random-access memory (DRAM): double data rate (DDR) DIMMs and MCDRAM.
There are multiple memory controllers near the bus, each controlling multiple channels. Each memory controller is connected to the bus by a home agent. The home agent for the memory controller recognizes the physical addresses for its channels.
Interleaving is a technique for spreading adjacent virtual addresses within a page across multiple memory devices so hardware-level parallelism increases the available bandwidth from the devices – assuming some other access is not already using up the bandwidth. Virtual addresses within a page are mapped to adjacent physical addresses, so without interleaving, consecutive adjacent accesses would be sent to the same memory controller and swamp it. Physical addresses can interleave across sockets, or just within a socket – it is firmware-selectable.
Before 3D XPoint™ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.
Intel® microarchitecture code named Skylake supports 1, 2, 3, 4, and 6-way interleaving. If five channels are populated, the home agents are probably using a 2-way and a 3-way interleave. 3-way and 6-way interleaving support implies the hardware can do mod 3 calculations on the addresses – a non-trivial amount of work.
The home agent translates the physical address into a channel address, and passes it to the memory controller. Each memory controller has a table to find what to do with each range of channel addresses it is handing.
For example: On Intel Xeon Phi processors code named Knights Landing, this table is how a memory controller knows if the address is in the range of addresses where the MCDRAM is caching a more distant DDR DIMM. If the address is for a DIMM, the memory controller translates the channel address into a (channel, dim, rank, bank, row, column) tuple used in a conversation over a bus to the DIMM.
Despite their complexity of operation, memory controllers are implemented in specialized logic circuits, not micro-programmed!
The previous article, Why Efficient Use of the Memory Subsystem is Critical to Performance, discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM.
This article discusses how knowing the number of devices, the buses, and data sharing opportunities can help you understand:
The next article, Detecting and Avoiding Bottlenecks, describes how to recognize and avoid congestion in this complex network connecting the cores to the main memory.
In addition, there is a series of articles starting with Performance Improvement Opportunities with NUMA Hardware that provides an introduction to all aspects of modern NUMA hardware and how it can be used within various application domains.
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. These articles would be much harder to read without the extensive editorial work of one of Intel’s writers – Nancee Moster. Bevin made the mistakes, Nancee recognized and called him on many of them – but we are sure some have leaked on through…
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804