Programming modern computers rarely requires an understanding of underlying hardware and software; consequently, most programmers do not know how the memory subsystem works.
However, such lack of knowledge can ultimately produce a 10x or worse slowdown in application performance – especially since the arrival of new hardware technologies.
This article explains why optimal usage of the memory subsystem can have massive performance benefits.
It is hard to know where a modern computer begins and ends. The boundary is blurred by issues such as wired and wireless networks, data transfer across such networks by load and store instructions, and I/O devices that themselves contain complicated software and the hardware to run them. The following is a useful simplification rather than a complete truth.
Think of a computer as a motherboard with:
- Multiple, mounted processors
- Buses connecting those processors to:
- I/O devices (data is transferred in huge slabs)
- Memory devices (data is transferred in much smaller slabs)
Each processor contains from one to hundreds of cores. Each core contains hardware to:
- Fetch instructions.
- Decode those instructions.
- Schedule the instructions for execution.
- Execute the instructions.
- Fetch the data the instructions need.
- Store the data the instructions produce.
The rest of this article describes how a load instruction executed in a core moves data from a memory module – in this case, a dual in-line memory module (DIMM) – to that core.
Virtual Addresses, Physical Addresses, and Beyond
Sometimes you do not want to know how a thing is made – things like sausages, laws, and (it turns out) memory accesses – because knowing changes your perception of that thing forever. But if you are one of those people who need to know, read on…
Instructions Generate Virtual Addresses
When you map a file into memory, you specify the virtual addresses to use. (If you ever examined a C or C++ pointer variable using a debugger, you examined 64-bit virtual addresses.) Pointer arithmetic is done using virtual addresses.
Hardware Translates Virtual Addresses into Physical Addresses
A virtual address consists of two parts:
- Page number – Acts as an index into a page table
- Offset – The location difference between the byte address you want and the start of the page
Hardware controlled by the OS translates the page number into a physical address and adds back in the offset, using per-process page tables in the OS and accelerated by cached entries in translation lookaside buffers (TLBs).
For example the Intel® Core™ i7-6700 processor has two levels of TLBs:
- First level
- TLBs for data can map 1-GB sets of contiguous pages (4-way set associative, 4 entries) or 4-KB pages (4-way set associative, 64 entries)
- TLBs for instructions can map 4-KB pages (8-way set associative, 64 entries)
- Second level has TLBs for both data and instructions and can map either 4-KB or 2-MB pages (6-way associative, 1536 entries) and 1-GB pages (4-way associative, 16 entries)
Applications that rapidly touch more pages than TLBs can map may stall doing virtual-to-physical address translation.
A TLB miss is expensive. You can reduce the TLB miss rate by:
- Changing the page size at boot time
- Using big and huge pages
- Carefully planning how your data fits in the pages
Once the physical address has been determined, a read or write request is sent to the L1 cache. The L1 cache will either perform the access or propagate it deeper into the memory subsystem.
- Translation Caching: Skip, Don’t Walk (the Page Table)
- Intel® 64 and IA-32 Architectures Optimization Reference Manual
Cores Do Other Things Until Access Is Satisfied
If nothing useful can be done, the core stalls. Unfortunately, the OS is almost unaware of the stall: the application appears to be running, and it is hard to tell if the application is slower than it should be. You need tools to examine hardware performance counters to see stall details.
Physical Addresses Cascade Through the Caches
The accesses propagating through the memory subsystem are a combination of a specific request and the needed physical addresses and, perhaps, data.
Data moves around most of the memory subsystem in 64-byte quantities called cache lines. A cache entry, which is some transistors that can store a physical address and a cache line, is filled when a cache line is copied into it. Pages are evenly divided into cache lines – the first 64 bytes of a 4096-byte page is a cache line, with the 64 bytes stored together in a cache entry; the next 64 bytes is the next cache line, etc.
Each cache line may:
- Not be cached
- Occupy an entry in one cache
- Be duplicated in several caches
Cores, I/O devices, and other devices send requests to caches to either read or write a cache entry for a physical address. The lowest six bits of the physical address are not sent – they are used by the core to select the bytes within the cache line. The core sends separate requests for each cache line it needs.
- Reads – If a cache has the requested physical address in a cache entry, the cache returns the data. If not, the cache requests the data from deeper in the memory subsystem and evicts some cache entry to make room. If the evicted cache entry has been modified, it must be written to the deeper memory subsystem as part of this eviction. This means a stream of reads may slow down because an earlier set of writes must be pushed deeper into the memory subsystem. A small queue of written data buffers the communication from the sender to the receiver.
- Writes – If the cache does not have the cache line in a cache entry, the cache reads it from deeper in the memory subsystem. It evicts some other physical address from its cache entry to make room for this cache line. The read is necessary to get all the 64 bytes, because the write is probably changing only some of them. The first time a cache entry is written, the cache entries of this physical address in all other caches are invalidated. This action makes the first write on a cache entry more expensive than later writes.
Note: Memory mapping can result in two or more virtual addresses mapping to the same physical address. This often happens with shared libraries and demand-zero pages.
Caches Can Be Shared
The L1 cache – usually 32 KB of data and 32 KB of instructions – is private to the adjacent core, which is why it can supply data so quickly. All hyper-threads in a core share the L1 cache.
The next cache out, the L2 cache, is sometimes private and sometimes shared by two cores.
If they exist, the L3 and L4 caches are shared by all the cores, and perhaps by more than one processor.
If the Multi-Channel DRAM (MCDRAM) is used to cache the DIMMs, that cache is also shared.
Cache sharing is not necessarily a problem when the cores are accessing different data; however, if a core is trying to store more than its share of data in the shared caches, it can push out another core’s data, so that neither benefits. In some situations the faster-accessing core can dominate and use the entire cache for itself, causing a load imbalance.
Buses Connect L2 Cache to L3 Cache and Memory Controllers
The red ring road buses shown in the diagram below connect the L2 caches to portions of the L3 cache, as well as to the Intel® QuickPath Interconnect (Intel® QPI) links, the peripheral component interconnect express (PCIe) links, and the home agents for the memory controllers. The two ring roads are themselves connected by the two escalator-like-appearance short buses.
Traffic on the buses only goes as far as necessary, and different traffic can be on different sections of the bus simultaneously. Obviously the further traffic must go, the higher the latency. Bandwidth must be shared when traffic uses the same sections of the bus, but usually it is not the limiting factor.
A memory access that starts at a core and misses the L1 and L2 caches goes along the buses to the home agent for the target memory controller for that physical address. If the destination memory controller is on a different socket in a multiprocessor system, the traffic goes over the Intel QPI link to a ring bus on the target processor, which adds more delays.
The Intel® Xeon Phi™ processor has a different layout. This processor generation, code named Knights Landing, uses a grid rather than the two interconnected rings, and does not have an L3 cache at all; otherwise the traffic follows the same basic patterns. These processors have home agents and memory controllers for two different types of dynamic random-access memory (DRAM): double data rate (DDR) DIMMs and MCDRAM.
Home Agents Recognize Their Subset of the Physical Addresses
There are multiple memory controllers near the bus, each controlling multiple channels. Each memory controller is connected to the bus by a home agent. The home agent for the memory controller recognizes the physical addresses for its channels.
Interleaving is a technique for spreading adjacent virtual addresses within a page across multiple memory devices so hardware-level parallelism increases the available bandwidth from the devices – assuming some other access is not already using up the bandwidth. Virtual addresses within a page are mapped to adjacent physical addresses, so without interleaving, consecutive adjacent accesses would be sent to the same memory controller and swamp it. Physical addresses can interleave across sockets, or just within a socket – it is firmware-selectable.
Before 3D XPoint™ DIMMs, interleaving was done per one or two cache lines (64 bytes or 128 bytes), but DIMM non-volatile memory characteristics motivated a change to every four cache lines (256 bytes). So now four adjacent cache lines go to the same channel, and then the next set of four cache lines go to the next channel.
Intel® microarchitecture code named Skylake supports 1, 2, 3, 4, and 6-way interleaving. If five channels are populated, the home agents are probably using a 2-way and a 3-way interleave. 3-way and 6-way interleaving support implies the hardware can do mod 3 calculations on the addresses – a non-trivial amount of work.
The Memory Channel Accesses the Memory Device
The home agent translates the physical address into a channel address, and passes it to the memory controller. Each memory controller has a table to find what to do with each range of channel addresses it is handing.
For example: On Intel Xeon Phi processors code named Knights Landing, this table is how a memory controller knows if the address is in the range of addresses where the MCDRAM is caching a more distant DDR DIMM. If the address is for a DIMM, the memory controller translates the channel address into a (channel, dim, rank, bank, row, column) tuple used in a conversation over a bus to the DIMM.
Despite their complexity of operation, memory controllers are implemented in specialized logic circuits, not micro-programmed!
The previous article, Why Efficient Use of the Memory Subsystem is Critical to Performance, discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM.
This article discusses how knowing the number of devices, the buses, and data sharing opportunities can help you understand:
- The origin of access latency and bandwidth variations
- Why planning the routes data takes through the network can make a huge difference in execution times
The next article, Detecting and Avoiding Bottlenecks, describes how to recognize and avoid congestion in this complex network connecting the cores to the main memory.
In addition, there is a series of articles starting with Performance Improvement Opportunities with NUMA Hardware that provides an introduction to all aspects of modern NUMA hardware and how it can be used within various application domains.
About the Author
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. These articles would be much harder to read without the extensive editorial work of one of Intel’s writers – Nancee Moster. Bevin made the mistakes, Nancee recognized and called him on many of them – but we are sure some have leaked on through…