by Ravi Hegde
Enabling/disabling hardware prefetch mechanisms on discrete applications can help system integrators and software developers obtain optimal performance for solutions running on Intel® Core™ Microarchitecture-based processors.
Continuous improvement in processor performance, particularly in terms of instruction-level parallelism, has made processors increasingly 'data hungry.' A high computation bandwidth has to be matched with a similarly powerful memory subsystem, to avoid creating a bottleneck. In addition to the advances currently being made in memory-subsystem performance, hardware architects have provided several techniques to avoid frequent memory access that would otherwise slow down the processor. Prefetching is one such technique that helps alleviate potential bottlenecks, by fetching instructions and/or data from memory into the cache well before the processor needs it, thus improving the load-to-use latency.
This paper compares the prefetch support available in processors based on the Intel Core Microarchitecture (including the Dual-Core Intel® Xeon® processor 5100 series), with the corresponding support in previous generations of processors based on Intel NetBurst® microarchitecture. It briefly covers prefetch behavior, discusses guidelines for enabling/disabling the prefetch mechanisms, and looks at the impact of enabling/disabling prefetch mechanisms on different applications.
System integrators and software developers will benefit from this paper by gaining the background to make use of hardware prefetch mechanisms in the Intel Core Microarchitecture.
Processors based on the Intel NetBurst® microarchitecture expose two prefetch mechanisms through the BIOS: automatic hardware prefetch and Adjacent Cache Line Prefetch.
Processors based on the Intel Core Microarchitecture also expose two prefetch mechanisms through the BIOS. These are called the Data Prefetch Logic (DPL) and L2 Streaming Prefetch. This paper uses the terms "Data Prefetch Logic" and "Hardware Prefetch" interchangeably. It also uses the terms "L2 Streaming Prefetch" and "Adjacent Cache Line Prefetch" interchangeably.
DPL Prefetch and the L2 Streaming Prefetch can be enabled or disabled through the BIOS. Figure 1 shows a snapshot of a BIOS screen, and the circled region shows the configuration parameters used to control DPL and L2 Streaming Prefetch. Developers can enable the Hardware Prefetcher option (enabled by default) to turn on DPL and the Adjacent Cache Line Prefetch option (disabled by default) to turn on L2 Streamer.
Figure 1. Snapshot of a BIOS screen showing DPL Prefetch (Hardware Prefetcher) and L2 Streaming Prefetch (Adjacent Cache Line Prefetch)
DPL Prefetch and L2 Streaming Prefetch settings can also be changed programmatically by writing a device driver utility for changing the bits in the IA32_MISC_ENABLE register – MSR 0x1A0. Such a utility offers the ability to enable or disable prefetch mechanisms without requiring any server downtime.
The table below shows the bits in the IA32_MISC_ENABLE MSR that have to be changed in order to control the DPL and L2 Streaming Prefetch:
|Prefetcher Type||MSR (0x1A0) Bit||Value|
|DPL (Hardware Prefetch)||Bit 9||0 = Enable
1 = Disable
|L2 Streamer (Adjacent Cache Line Prefetch)||Bit 19||0 = Enable
1 = Disable
When to Turn on the Prefetch Mechanisms
Prefetch mechanisms were designed to hide memory latency by fetching data from the main memory to the cache based on heuristics, thus improving the computation bandwidth. One side effect, however, is an increase in bus utilization. A decision to enable or disable prefetch merchanisms should be made based on the nature of the application-memory-intensive applications with a high bus utilization could see a performance degradation if prefetch is turned on. Another consideration is spatial locality, since prefetch functionality can hamper the performance of applications that do not have a good spatial locality by causing cache pollution, which results in high levels of cache misses.
Servers based on the Intel® Core™ Microarchitecture offer advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. The default prefetch setting provides optimal performance for many workloads, so careful consideration should be given to memory-bandwidth utilization of applications before enabling or disabling these mechanisms.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804