| March 10, 2009 12:00 AM PDT | |
Resolve address conflicts that cause a significant number of stall cycles. Cache misses occur when data is not in the desired cache and data retrieval requires access to a slower cache, memory, or even disk.
Address conflicts are more common than most people realize and can be as costly as cache misses. Address conflicts are caused by technical details of the cache and cache access hardware; they are therefore more difficult to understand, though sometimes much easier to avoid.
Use the event-based-sampling feature of the Intel® VTune™ Performance Analyzer to determine if and where address conflicts are a problem, then use a debugger to identify the variables that have the address conflicts.
Address conflicts occur because of features in the access mechanisms used by one or more of the caches. In the Itanium® processor, there are two main types of address conflicts: bank conflicts and secondary misses to an L2 cache line.
The L2 cache is made of 16 banks, each 16 bytes wide. If two (more common with floating-point) loads try to access the same bank on the same cycle, one of the accesses is forced to be re-issued by the L2 OzQ. This, for example, causes the floating-point latency of the access for the second load to change from six cycles (when the two accesses go to two different banks) to 12 cycles.
There are two types of mechanisms that can create bank conflicts:
- An access conflict between two elements contained in the same cache line.
- An access conflict between two blocks of data (arrays) that have a base-address alignment and a coherent access pattern.
The second of these mechanisms can become a significant issue when looping code accesses multiple arrays, although you can check these alignments easily.
If a linked list of structures creates an alignment problem, it may be a little more difficult to fix, as it may not occur on every access. This circumstance illustrates another complexity associated with linked lists, in that the alignment of the elements can be made excessively complicated, particularly if they are dynamically managed.
The following line of code creates a 256-byte aligned buffer:
buf = buf + 256 - ((UINT64)buf%256);
where UINT64 is simply a typedef'd unsigned 64-bit integer (i.e., a pointer).
The address can be incremented by whatever multiple of 16 bytes will resolve the bank conflict. Note that you should always keep buffers 16-byte aligned to allow the option of double loads. Malloc returns a 16-byte aligned address.
In a real application, cache-line replacements occur regularly. This replacement rearranges the bank allocations with respect to addressing. The 16 L2 banks correspond to two L2 cache lines. The pattern repeats as four sets to give the complete eight-way associative set.
The OzQ will recirculate data accesses if a pending request for the same cache line caused an L2 cache miss. Only one pending access to a missing cache line in L2 is escalated to L3 and/or the system bus at a time. Subsequent misses that occur while the pending line is being updated recirculate until the cache line has been completely updated. While the first miss can return data from the first part of the cache line that is loaded to the L2 cache, any further accesses to that cache line prior to its complete replacement must wait until the cache line has been completely updated. This causes an extra latency in the data delivery to the registers.
There are everal techniques to avoid the extra latency. Certainly, the best practice is to avoid the cache miss latency entirely by either of the following means:
- Building the program with /O3 optimization to generate the prefetches
- Using prefetch intrinsics
Alternatively, the secondary miss can be avoided by one of the following techniques:
- Separating the source lines that access the same structure or array sufficiently to allow the cache line to update from the first access. This will take about 10 cycles if the data is in L3. Note: this approach really is not possible if the data is in main memory.
- Alternatively, replacing arrays (or linked lists) of structures with structures of arrays will avoid putting the elements in the same cache line.
Counting recirculations due to secondary L2 misses cannot usually be done exactly on Itanium® processors. An upper limit on the number can be determined by summing four sub-events of L2_FORCE_RECIRC. The sum may double-count some secondary misses and even include some misses to multiple cache lines from the same associative set.
Introduction to Microarchitectural Optimization for Itanium® Processors
For more complete information about compiler optimizations, see our Optimization Notice.

