| March 10, 2009 1:00 AM PDT | |
Resolve cache misses that cause a significant number of stall cycles. These occur when data is not in the desired cache and data retrieval requires access to a slower cache, memory, or disk.
Prefetch the data in advance to ensure availability, or implement a more localized data use. This can mean any of the following:
-
Raising the optimization level passed to the compiler.
-
Inserting prefetch intrinsics into the source appropriately.
-
Restructuring the algorithm to do more work on a given piece of data.
-
Restructuring the data to make sequential accesses use sequential addresses (i.e., in the same cache line). Cache misses occur because the data is either not being re-used efficiently or the data is not organized to take advantage of cache-line design patterns.
A common occurrence is for data to be organized as a linked list of structures. The program then walks through the linked list. Probably the best solution for this is to allocate space for a block of structures and to organize the data within the block as arrays. The advantage of this solution is that structure elements will be stored consecutively, allowing very efficient use of the cache lines. Further, while the current block is being analyzed, the next block in the linked list can be prefetched and the memory-access latency completely hidden.
Frequently, very little of the data in the structure is used by the algorithm at any given time. This results in there being little advantage to loading a cache line, as only a small fraction of the data contained in the cache line is utilized.
From the perspective of using the cache efficiently, it is better to have a structure of arrays than a linked list of structures. If this is not practical, keep floating-point and integer data in separate parallel data structures. This will help in utilizing the cache-line structures in the L1 (integer only) and L2 caches more effectively.
Alternatively, if the code reuses a small block of data but only after walking through very large blocks (that are not used), the cache has to be reloaded each time the program returns to working with the small block of data. In such a case, it is better to restructure the algorithm if possible, so that it completely finishes working with a piece of data before moving on. This is the usage model for which the cache line concept is intended.
Another option is possible if the data resides in L3 but not in the high-speed cache that the hardware uses for access. In the case of looping code, the memory-access latency is absorbed through pipelining and rotating registers. By unrolling the loops aggressively, you can increase the cycles per iteration of the loop and increase the amount of latency absorbed by the pipelining.
If you can do this to the level that the compiler’s optimization automatically absorbs the L3 latency (i.e. 13 cycles), then the cache misses will not contribute to any pipeline stalls. Keep this option in mind while optimizing applications running on Itanium® Processor Family architectures.
Introduction to Microarchitectural Optimization for Itanium® Processors
For more complete information about compiler optimizations, see our Optimization Notice.

