by Ravi Hegde
Enabling/disabling hardware prefetch mechanisms on discrete applications can help system integrators and software developers obtain optimal performance for solutions running on Intel® Core™ Microarchitecture-based processors.
Continuous improvement in processor performance, particularly in terms of instruction-level parallelism, has made processors increasingly 'data hungry.' A high computation bandwidth has to be matched with a similarly powerful memory subsystem, to avoid creating a bottleneck. In addition to the advances currently being made in memory-subsystem performance, hardware architects have provided several techniques to avoid frequent memory access that would otherwise slow down the processor. Prefetching is one such technique that helps alleviate potential bottlenecks, by fetching instructions and/or data from memory into the cache well before the processor needs it, thus improving the load-to-use latency.
This paper compares the prefetch support available in processors based on the Intel Core Microarchitecture (including the Dual-Core Intel® Xeon® processor 5100 series), with the corresponding support in previous generations of processors based on Intel NetBurst® microarchitecture. It briefly covers prefetch behavior, discusses guidelines for enabling/disabling the prefetch mechanisms, and looks at the impact of enabling/disabling prefetch mechanisms on different applications.
System integrators and software developers will benefit from this paper by gaining the background to make use of hardware prefetch mechanisms in the Intel Core Microarchitecture.
Prefetching Support in Intel NetBurst® Microarchitecture
Processors based on the Intel NetBurst® microarchitecture expose two prefetch mechanisms through the BIOS: automatic hardware prefetch and Adjacent Cache Line Prefetch.
- Hardware prefetcher: The hardware prefetcher operates transparently, without programmer intervention, to fetch streams of data and instruction from memory into the unified second-level cache. The prefetcher is capable of handling multiple streams in either the forward or backward direction. It is triggered when successive cache misses occur in the last-level cache and a stride in the access pattern is detected, such as in the case of loop iterations that access array elements. The prefetching occurs up to a page boundary. This feature can be disabled through the BIOS.
- Adjacent Cache-Line Prefetch: The Adjacent Cache-Line Prefetch mechanism, like automatic hardware prefetch, operates without programmer intervention. When enabled through the BIOS, two 64-byte cache lines are fetched into a 128-byte sector, regardless of whether the additional cache line has been requested or not. In applications with relatively poor spatial locality, the cache miss ratio is higher. A cache miss on an Intel® Pentium® 4 processor-based system (with adjacent sector prefetch enabled) brings in 128 bytes, leading to higher bus utilization (assuming that the application didn’t need the other 64 bytes). When adjacent sector prefetch is disabled, an Intel® P entium® 4 processor-based system only fetches 64 bytes. The other 64 bytes of the sector in the last-level cache are not used unless the application explicitly issues a load to that address. Disabling adjacent sector prefetch on Intel® Pentium® 4 processor-based systems can reduce bus traffic.
Prefetching in the Intel® Core™ Microarchitecture
Processors based on the Intel Core Microarchitecture also expose two prefetch mechanisms through the BIOS. These are called the Data Prefetch Logic (DPL) and L2 Streaming Prefetch. This paper uses the terms "Data Prefetch Logic" and "Hardware Prefetch" interchangeably. It also uses the terms "L2 Streaming Prefetch" and "Adjacent Cache Line Prefetch" interchangeably.
- Data Prefetch Logic: The DPL prefetch mechanism is similar to the hardware prefetch mechanism provided by the Intel NetBurst® microarchitecture. This prefetch mechanism fetches streams of instructions and data from memory to the unified second-level cache upon detecting a stride.
- L2 Streaming Prefetch: L2 Streaming Prefetcher, like the Intel NetBurst® microarchitecture's Adjacent Cache Line Prefetch, fetches adjacent 64-byte cache line.
How to Turn on the Prefetchers
DPL Prefetch and the L2 Streaming Prefetch can be enabled or disabled through the BIOS. Figure 1 shows a snapshot of a BIOS screen, and the circled region shows the configuration parameters used to control DPL and L2 Streaming Prefetch. Developers can enable the Hardware Prefetcher option (enabled by default) to turn on DPL and the Adjacent Cache Line Prefetch option (disabled by default) to turn on L2 Streamer.
Figure 1. Snapshot of a BIOS screen showing DPL Prefetch (Hardware Prefetcher) and L2 Streaming Prefetch (Adjacent Cache Line Prefetch)
DPL Prefetch and L2 Streaming Prefetch settings can also be changed programmatically by writing a device driver utility for changing the bits in the IA32_MISC_ENABLE register – MSR 0x1A0. Such a utility offers the ability to enable or disable prefetch mechanisms without requiring any server downtime.
The table below shows the bits in the IA32_MISC_ENABLE MSR that have to be changed in order to control the DPL and L2 Streaming Prefetch:
|Prefetcher Type||MSR (0x1A0) Bit||Value|
|DPL (Hardware Prefetch)||Bit 9||0 = Enable
1 = Disable
|L2 Streamer (Adjacent Cache Line Prefetch)||Bit 19||0 = Enable
1 = Disable
When to Turn on the Prefetch Mechanisms
Prefetch mechanisms were designed to hide memory latency by fetching data from the main memory to the cache based on heuristics, thus improving the computation bandwidth. One side effect, however, is an increase in bus utilization. A decision to enable or disable prefetch merchanisms should be made based on the nature of the application-memory-intensive applications with a high bus utilization could see a performance degradation if prefetch is turned on. Another consideration is spatial locality, since prefetch functionality can hamper the performance of applications that do not have a good spatial locality by causing cache pollution, which results in high levels of cache misses.
Servers based on the Intel® Core™ Microarchitecture offer advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. The default prefetch setting provides optimal performance for many workloads, so careful consideration should be given to memory-bandwidth utilization of applications before enabling or disabling these mechanisms.
- Intel® Core™ Microarchitecture This Web site describes this processor architecture in depth, as well as describing features and benefits and showing how it provides a solid foundation for new server, desktop, and mobile platforms.
- Intel’s Core Microarchitecture Redefines Computing: This news release captures the strategic relevance of the architecture, as well as identifying early performance wins associated with processors based on Intel Core microarchitecture.
- Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance: This Technology@Intel Magazine article provides an in-depth overview of the Intel Core microarchitecture.
- Preparing Applications for Intel Core Microarchitecture: This Technology@Intel Magazine article suggests some techniques to either improve or maintain the performance of an existing application when running on systems with these new Intel processors.
- How to Choose between Hardware and Software Prefetch on 32-Bit Intel® Architecture: This Knowledge Base item contrasts the capabilities and suitability of hardware and software prefetching for various implementations.