Quantify Integer Bank-Conflict Penalties on 64-Bit Intel Architecture

Submit New Article

March 10, 2009 1:00 AM PDT



Challenge

Measure the performance penalty associated with L2 bank conflicts on integer-data loads. Access to cacheable integer data always goes through the L1D cache. This complicates the issue of addressing conflicts and their impact on the flow through the core pipeline. In order to understand addressing conflicts with integer data, you must comprehend the data flow from the caches in a bit more detail.

Integer address conflicts are an issue only when multiple integer-data accesses require data from the L2 cache simultaneously. There are essentially no issues associated with multiple data access directly from the L1D. It has two load and two store ports. The load ports are fully dual-ported, meaning that any two load addresses can be read from the L1D in parallel without conflict. Stores access the L1D data array in eight groups that are eight bytes wide. Stores do have the potential for conflicts, but special hardware is provided to limit these conflicts from impacting performance.

When considering the latency of multiple integer misses to independent cache lines, the bandwidth between the caches can impact the overall latency for the multiple data requests from L2. In addition to the requested data, entire 64-byte L1 cache lines are brought in. Thus, multiple accesses (requiring multiple fills) that might suffer addressing conflicts between the L2 banks have the additional bottleneck of having to transfer multiple cache lines. In fact, multiple integer accesses with no addressing conflicts encounter additional latency, as only one L1 cache line can be updated from L2 at a time.

The address conflicts are complicated by the fact that not only must the requested data not occupy the same L2 bank, but the entire cache lines associated with the data must not overlap. The net result is that for integer-data address conflicts, the L2 cache behaves as if it has four banks that are 64 bytes in width. As the data paths are different from those for floating-point data (which load directly to the floating-point register file from L2), the penalties are also different.


Solution

Modify the floating-point-access microbenchmark from the separate item, Quantify Floating-Point Bank-Conflict Penalties on 64-Bit Intel® Architecture to use integer loads and move instructions. The following microbenchmark achieves that goal:

You can remove the address conflicts by adjusting the base address of the second buffer stored in r36. Do this from a main program that calls the assembler function from within a loop that adjusts the second buffer’s base address in steps of 16 bytes (the L2 bank width).

The pattern for integer-data conflicts is different than that for floating-point data. If the integer addresses overlap such that their cache lines lie in overlapping L2 banks, then there is a conflict. This translates into a one-in-four possibility, instead of one-in-16, of a floating-point conflict.

On running the microbenchmark, you will find that, if the addresses overlap within the 64-byte window that corresponds to an L1 cache line, the latency (which is due to the second load) is increased to 11 cycles. If the two addresses do not overlap, within the definition being discussed, then the latency is seven cycles. Compare these numbers to the five-cycle integer latency for a single access from L2 and the one-cycle access from L1 that the compiler uses for scheduling:

Acc ess Mode

Latency (in Cycles)

Single Access from L1

1

Single Access from L2

5

Double Access from L2, no Overlap

7

Double Overlapping Access from L2

11

 


Source

Introduction to Microarchitectural Optimization for Itanium® Processors