Optimizing Application Performance on Intel® Core™ Microarchitecture Using Hardware-Implemented Prefetchers

by Ravi Hegde

Introduction

Enabling/disabling hardware prefetch mechanisms on discrete applications can help system integrators and software developers obtain optimal performance for solutions running on Intel® Core™ Microarchitecture-based processors.

Continuous improvement in processor performance, particularly in terms of instruction-level parallelism, has made processors increasingly 'data hungry.' A high computation bandwidth has to be matched with a similarly powerful memory subsystem, to avoid creating a bottleneck. In addition to the advances currently being made in memory-subsystem performance, hardware architects have provided several techniques to avoid frequent memory access that would otherwise slow down the processor. Prefetching is one such technique that helps alleviate potential bottlenecks, by fetching instructions and/or data from memory into the cache well before the processor needs it, thus improving the load-to-use latency.

This paper compares the prefetch support available in processors based on the Intel Core Microarchitecture (including the Dual-Core Intel® Xeon® processor 5100 series), with the corresponding support in previous generations of processors based on Intel NetBurst® microarchitecture. It briefly covers prefetch behavior, discusses guidelines for enabling/disabling the prefetch mechanisms, and looks at the impact of enabling/disabling prefetch mechanisms on different applications.

System integrators and software developers will benefit from this paper by gaining the background to make use of hardware prefetch mechanisms in the Intel Core Microarchitecture.


Prefetching Support in Intel NetBurst® Microarchitecture

Processors based on the Intel NetBurst® microarchitecture expose two prefetch mechanisms through the BIOS: automatic hardware prefetch and Adjacent Cache Line Prefetch.

  • Hardware prefetcher: The hardware prefetcher operates transparently, without programmer intervention, to fetch streams of data and instruction from memory into the unified second-level cache. The prefetcher is capable of handling multiple streams in either the forward or backward direction. It is triggered when successive cache misses occur in the last-level cache and a stride in the access pattern is detected, such as in the case of loop iterations that access array elements. The prefetching occurs up to a page boundary. This feature can be disabled through the BIOS.
  • Adjacent Cache-Line Prefetch: The Adjacent Cache-Line Prefetch mechanism, like automatic hardware prefetch, operates without programmer intervention. When enabled through the BIOS, two 64-byte cache lines are fetched into a 128-byte sector, regardless of whether the additional cache line has been requested or not. In applications with relatively poor spatial locality, the cache miss ratio is higher. A cache miss on an Intel® Pentium® 4 processor-based system (with adjacent sector prefetch enabled) brings in 128 bytes, leading to higher bus utilization (assuming that the application didn’t need the other 64 bytes). When adjacent sector prefetch is disabled, an Intel® P entium® 4 processor-based system only fetches 64 bytes. The other 64 bytes of the sector in the last-level cache are not used unless the application explicitly issues a load to that address. Disabling adjacent sector prefetch on Intel® Pentium® 4 processor-based systems can reduce bus traffic.

 


Prefetching in the Intel® Core™ Microarchitecture

Processors based on the Intel Core Microarchitecture also expose two prefetch mechanisms through the BIOS. These are called the Data Prefetch Logic (DPL) and L2 Streaming Prefetch. This paper uses the terms "Data Prefetch Logic" and "Hardware Prefetch" interchangeably. It also uses the terms "L2 Streaming Prefetch" and "Adjacent Cache Line Prefetch" interchangeably.

  • Data Prefetch Logic: The DPL prefetch mechanism is similar to the hardware prefetch mechanism provided by the Intel NetBurst® microarchitecture. This prefetch mechanism fetches streams of instructions and data from memory to the unified second-level cache upon detecting a stride.
  • L2 Streaming Prefetch: L2 Streaming Prefetcher, like the Intel NetBurst® microarchitecture's Adjacent Cache Line Prefetch, fetches adjacent 64-byte cache line.

 


How to Turn on the Prefetchers

DPL Prefetch and the L2 Streaming Prefetch can be enabled or disabled through the BIOS. Figure 1 shows a snapshot of a BIOS screen, and the circled region shows the configuration parameters used to control DPL and L2 Streaming Prefetch. Developers can enable the Hardware Prefetcher option (enabled by default) to turn on DPL and the Adjacent Cache Line Prefetch option (disabled by default) to turn on L2 Streamer.



Figure 1. Snapshot of a BIOS screen showing DPL Prefetch (Hardware Prefetcher) and L2 Streaming Prefetch (Adjacent Cache Line Prefetch)

DPL Prefetch and L2 Streaming Prefetch settings can also be changed programmatically by writing a device driver utility for changing the bits in the IA32_MISC_ENABLE register – MSR 0x1A0. Such a utility offers the ability to enable or disable prefetch mechanisms without requiring any server downtime.

The table below shows the bits in the IA32_MISC_ENABLE MSR that have to be changed in order to control the DPL and L2 Streaming Prefetch:

Prefetcher Type MSR (0x1A0) Bit Value
DPL (Hardware Prefetch) Bit 9 0 = Enable
1 = Disable
L2 Streamer (Adjacent Cache Line Prefetch) Bit 19 0 = Enable
1 = Disable

 

When to Turn on the Prefetch Mechanisms

Prefetch mechanisms were designed to hide memory latency by fetching data from the main memory to the cache based on heuristics, thus improving the computation bandwidth. One side effect, however, is an increase in bus utilization. A decision to enable or disable prefetch merchanisms should be made based on the nature of the application-memory-intensive applications with a high bus utilization could see a performance degradation if prefetch is turned on. Another consideration is spatial locality, since prefetch functionality can hamper the performance of applications that do not have a good spatial locality by causing cache pollution, which results in high levels of cache misses.


Conclusion

Servers based on the Intel® Core™ Microarchitecture offer advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. The default prefetch setting provides optimal performance for many workloads, so careful consideration should be given to memory-bandwidth utilization of applications before enabling or disabling these mechanisms.


Additional Resources

 


For more complete information about compiler optimizations, see our Optimization Notice.
Categories: