Optimize the use of prefetches in code for the Intel® Pentium® 4 processor and the Pentium® M processor. The performance of most applications can be considerably improved if the data they require can be fetched from the processor caches, rather than from main memory.
Standard techniques to bring data into the processor before it is needed involve additional programming, which can be difficult to implement and may require special steps to prevent performance degradation. The Streaming SIMD Extensions (SSE) instructions addressed these issues by providing various prefetch instructions. The Pentium® 4 processor and Pentium M processor extend prefetching support via an automatic hardware data prefetch, a new mechanism for data prefetching based on current data access patterns that does not require programmer intervention.
Use the following general prefetch coding guidelines:
- Take advantage of the hardware prefetcher’s ability to prefetch data that is accessed in linear patterns, either forward or backward direction.
- Use a current-generation compiler, such as the Intel® C++ Compiler, that supports C++ language-level features for SSE. SSE and MMX™ technology instructions provide intrinsics that allow you to optimize cache utilization. The examples of such Intel compiler intrinsics are _mm_prefetch, _mm_stream and _mm_load, _mm_sfence. For more details on these intrinsics, refer to the Intel C++ Compiler User’s Guide.
- Facilitate compiler optimization:
- Minimize use of global variables and pointers.
- Minimize use of complex control flow.
- Use the const modifier, avoid the register modifier.
- Choose data types carefully and avoid type casting.
- Optimize prefetch scheduling distance:
- Far enough ahead to allow interim computation to overlap memory access time.
- Near enough that the prefetched data is not replaced from the data cache.
- Use prefetch concatenation – arrange prefetches to avoid unnecessary prefetches at the end of an inner loop and to prefetch the first few iterations of the inner loop inside the next outer loop.
- Minimize the number of prefetches – prefetch instructions are not completely free in terms of bus cycles, machine cycles, and resources. Excessive usage of prefetches can adversely impact application performance.
- Interleave prefetch with computation instructions – for best performance, prefetch instructions must be interspersed with other computational instructions in the instruction sequence, rather than clustered together.
- Use cache-blocking techniques such as strip mining (see the separate item, How to Use Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture) – improve cache hit rate by using cache blocking techniques such as strip-mining for one dimensional arrays or loop blocking for two dimensional arrays (see the separate item, Loop Blocking to Optimize Memory Use on 32-Bit Intel® Architecture).
- Balance single-pass versus multi-pass execution:
- An algorithm can use single-pass or multi-pass execution, defined as follows: single-pass, or unlayered execution passes a single data element through an entire computation pipeline. Multi-pass, or layered execution performs a single stage of the pipeline on a batch of data elements before passing the entire batch on to the next stage.
- General guideline: if your algorithm is single-pass, use prefetchnta; if your algorithm is multi-pass use prefetcht0.
- Resolve memory-bank conflict issues – minimize memory-bank conflicts by applying array grouping to group contiguously used data together or allocating data within 4 KB memory pages.
- Resolve cache-management issues – minimize disturbance of temporal data held within the processor’s caches by using streaming-store instructions, as appropriate