| March 10, 2009 1:00 AM PDT | |
Resolve memory access stalls in the EXE pipeline stage on 64-Bit Intel® Architecture. Memory access stalls occur when the data is not available in the caches as expected. The instructions that are dependent on this data being loaded and available will stall until the load has completed. There are two main causes for the data not being available:
-
The data is not resident in the desired cache level (cache miss)
-
The virtual-to-physical translation did not occur optimally (DTLB miss)
A DTLB miss will result in a cache miss after the translation has been accomplished. A level-1 DTLB hit is required for integer data retrieved from the L1 Data cache, and a level 2 DTLB hit is required for floating point data to be retrieved from the L2 cache. Compilers schedule instructions assuming the following optimal latencies for data loads:
Integer data is loaded from the L1 cache with a one-cycle latency
-
Floating point data is loaded from the L2 cache with a six-cycle latency
-
The compiler may schedule instructions to absorb more latency than the minimum but will usually be able to absorb at least the minimums stated above.
Address cache misses and data-address conflicts separately. Either of these two conditions can cause the delivery of data to take more clock cycles than the optimum that the compiler uses for minimum latency scheduling:
Cache misses occur when data is not in the desired cache and data retrieval requires access to a slower cache, memory, or even disk. This condition is covered in the separate item, Resolve Cache Misses on 64-Bit Intel® Architecture
Data address conflicts can occur when multiple data accesses are issued on a single cycle or in rapid succession. Address conflicts can cause the data deliveries to interfere with each other. Access paths for integers and floating point (FP) data are different, so address conflicts for these data types differ. This condition is covered in the separate item, Resolve Address Conflicts on 64-Bit Intel® Architecture
Always focus your efforts on reducing the largest contributions to the stall cycles, and do not work on issues that make little difference to the overall performance. Before addressing the issue of reducing memory-access stall cycles, first establish that they contribute in a significant way to the performance of your program.
Once the exact nature of the stall is identified, you can formulate an appropriate resolution.
Introduction to Microarchitectural Optimization for Itanium® Processors
For more complete information about compiler optimizations, see our Optimization Notice.

