| March 10, 2009 10:18 AM PDT | |
Determine memory-access stall penalties due to simple cache misses. Whenever a load instruction attempts to access data from a data cache array that does not contain the desired data, it encounters a cache miss. All integer loads attempt to access the first-level instruction cache (L1D) first. All floating-point loads access the L2 first.
The load request is escalated up through the cache hierarchy (L1->L2->L3 for integers, L2->L3 for floats), to memory, and ultimately to disk until the desired data is found. The compiler schedules the use of the data assuming the fastest cache system satisfies the data request. As the data request is escalated through the memory hierarchy, the latency for the ultimate delivery of the data increases. Simultaneously, the cache-miss and cache-reference counters for each level of the hierarchy are incremented appropriately, leaving a trail for you to investigate.
Cache misses are caused when either of the following conditions occurs:
• The cache line containing the desired data has been replaced by other data.
• The cache line has not yet been loaded with the required data.
Use the microbenchmark for simple access given here. The latency is forced on every iteration of the loop by trying to move the data from r28 to r29. The target cache for the test is forced simply by correctly setting the range within the accessed buffer. This is controlled by the contents of r34. The baseline can then be determined from the version on the right, which is identical except that there is no load executed.
The result of executing such code is summarized in the following table. The latencies derived with this microbenchmark should not be interpreted as absolute results. They are used here to determine relations that can serve as guides in application analysis:
| Memory subsystem accessed | Latency (in cycles) |
| L1 | 1 |
| L2 | 5 |
| L3 | 13.3 |
| Memory | 209.6 |
These are the measured penalties for integer-data access. The L3 latency is slightly longer than the stated value in the Itanium® Processor Reference Manual for Software Development and Optimization. Observing the minimum L3 latency requires a more carefully constructed test that is not quite as general.
For the purposes of this item, we will use the measured number, as all other tests will be based on the one shown. The latency from main memory is also a function of the chipset. The latencies quoted in this document were measured on an Intel® platform using the 870 chipset. Note that one cycle is added to the result from the microbenchmark program, due to the cycle having spent issuing the second stop bit in the first bundle.
If the code is modified to load floating-point data, as shown below, the penalties become slightly different due to the change in the path the data uses to arrive in the register file. Floating-point data is loaded directly from the L2 Cache.
The resulting latencies are shown in the following table. Note that again one cycle has to be added t o the measured result for the second stop bit in the first bundle.
| Memory subsystem accessed | Latency (in cycles) |
| L2 | 6 |
| L3 | 13.1 |
| Memory | 209.5 |
With this information, it is possible to construct a model to account for the memory access stalls. For more information on creating such a model, see the manual, "Introduction to Microarchitectural Optimization for Itanium® Processors," Chapter 6.
Introduction to Microarchitectural Optimization for Itanium® Processors
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (0) 
Trackbacks (1)
- approaches to concurrency-
eigen.systems
February 5, 2010 9:39 AM PST

