Identify Bank/Address Conflicts on 64-Bit Intel Architecture


Challenge

Identify and locate conflicts in the EXE pipeline. Microprocessor cache architectures frequently have access structures that allow for very low latencies, but in some circumstances, this is not applicable. You encounter one of the more common causes of restrictions when the program needs to load multiple pieces of data at the same time. This means that the compilers must generate code that issues two load instructions on the same cycle.

Under these conditions, address conflicts can arise, causing less-than-optimal data-access latencies. This effect is most likely to occur in floating-point-intensive applications, with their large data samples, but it is by no means restricted to them.

The L2 Cache is 256KB, eight-way associative, with 128-byte cache lines. It is constructed of 16 banks that are 16 bytes wide. When multiple loads are issued within a cycle that must be satisfied by L2, complications can arise if there are address or bank conflicts. (If the loads can be satisfied by L1, there are no conflicts, and both loads occur in one cycle.)

When multiple memory accesses are made to data stored in the same L2 banks, a conflict arises. This conflict impacts the latency as different access mechanisms are invoked by the L2 OzQ. It is most apparent when floating-point data is loaded, as there is no additional interaction with the L1 cache.


Solution

Use the drill-down feature of the Intel® VTune™ Performance Analyzer to determine exactly which lines of code are responsible for the conflicts. Run the code in a debugger to analyze the addresses of the data being loaded. Then reorganize the order of the floating-point data accesses.

L2 bank conflicts cause several architectural event counters to accumulate. These counters can be used to identify the issue and its severity. L2_OZQ_Cancels1.Bank_CONF is incremented for every L2 access that experiences a bank conflict. L2_Bypass is also incremented by both loads if there is a bank conflict. Therefore, identifying these is quite easy. Comparing these counts to L2_Data_References.L2_Data_Reads gives the fraction of L2 reads that are associated with bank conflicts.

In a complex piece of code, the event skid as displayed by the VTune analyzer may make the exact location of the bank conflict confusing. If the Bank_Conflict subevent has skidded to an obviously incorrect place, use other events to identify the correct location of the bank conflict. The address conflicts and the associated bank conflicts require multiple loads per cycle. Look at the disassembly view of the executable intermingled with the source code in the VTune analyzer to help identify the location.

The Data_EAR_Event can also be very useful in this circumstance. By selecting the subevent with a latency over the minimum, umask =1, latency >= 8 cycles, long-latency accesses can be localized. If the event L3_READS.DATA_READ.ALL is used in conjunction, a reasonable condition would be a higher-than-normal ratio, meaning that long latency loads that do not miss in L2. Obviously, all loads that miss L2 satisfy the latency condition.

Remember that the EAR events are exact in their location but are inexact with regard to number. The hardware samples a fraction, and that fraction can depend on how the program interacts with the cache-access hardware.

Two separate items will be useful in making best use of the information established using this item:

 


Source

Introduction to Microarchitectural Optimization for Itanium® Processors

 


For more complete information about compiler optimizations, see our Optimization Notice.