Optimize code to take full advantage of the processor's cache size, which is variable among various target machines. This optimization is important, particularly, to support high-performance applications.
Dynamically detect cache size at runtime with the cpuid instruction and adjust performance-critical code accordingly. Optimize data structures either to fit in one half of the first-level cache or in the second-level cache. Optimizing for one half of the first-level cache will bring the greatest performance benefit. If one half of the first-level cache is too small to be practical, optimize for the second-level cache.
Optimizing for a point in between (for example, for the entire first-level cache) will likely not bring a substantial improvement over optimizing for the second-level cache. Although current compilers often do a good job of enhancing locality, also consider using manual techniques to enhance locality, such as blocking, loop interchange and loop skewing, as described in the article, "Cache Blocking Technique on Hyper-Threading Technology Enabled Processors."
Enable the prefetch generation in your compiler by using the /QxW or /QaxW switch in the Intel® C++ Compiler or the /arch:SSE2 switch in the Microsoft .NET* 2003 C++ compiler. As the compiler's prefetch implementation improves, automatic prefetch insertion by the compiler may outperform manual insertion.
If you are using a compiler that does not support software prefetching, intrinsics or inline assembly may be used to manually insert prefetch instructions. Chapter 6 of the IA-32 Intel Architecture Optimization Reference Manual contains an example of using software prefetch to implement a memory-copy algorithm.
If a load is found to miss frequently with significant negative performance impact, first try moving the load up to execute earlier. If that change does not reduce the number of lo misses, insert a prefetch before the load of the data. Be aware that manual prefetch is independent of the hardware prefetching capabilities in both the Pentium® M processor and the Intel® NetBurst™ microarchitecture. These mechanisms are separate, and hardware prefetch is not improved by manual prefetching; moreover, excessive manual prefetching can degrade performance.