| March 10, 2009 1:00 AM PDT | |
Measure the performance penalty associated with bank conflicts on floating-point loads. If you are dealing with a looping algorithm and have unrolled the loops (or if the compiler has done this for you), then more than the minimal latency can be absorbed by the scheduling of the instructions. Even so, removing the bank conflicts will reduce the OzQ activity and can improve the throughput of L2 access.
Modify the latency microbenchmark from the separate item, Quantify Memory-Stall Penalties on 64-Bit Intel Architecture to use two 256-byte-aligned buffers, and run the assembler shown below, within a high-level loop that progressively bumps the base address stored in r36 of the second buffer by 16 bytes.
This steps the relative alignment of the accesses in multiples of the bank widths. If the buffers are both 256-byte aligned, then the conflict occurs on the first iteration of the outer loop and repeats every 16 iterations of the outer (relative alignment driving) loop. The code on the right is used to subtract off the baseline where no loads are performed and no memory access stalls encountered.
Running this benchmark from a high-level loop shows that the bank conflict results in an additional latency of six cycles. You can remove the additional six cycles of latency by shifting the base address by 16 (or 32, 48, 64…up to 240) bytes. Remember that the banking structure spans 256 bytes.
|
Access Mode |
Latency (in Cycles) |
|
Single Access |
6 |
|
Double Access with no Bank Overlap |
6 |
|
Double Access with a Bank Overlap |
12 |
You can compute the upper limit to the contribution to CPI due to floating point loads that encounter bank conflict with the following relation. It will overestimate the true effect, but it can be used as a guide:
CPI(FP Bank Conflicts) =
L2_OZQ_Cancels1.Bank_CONF*3/IA64IR
The penalty of six cycles is divided by two, as L2_OZQ_Cancels1.Bank_CONF counts every load that has a conflict and thus “double counts” the number of loads canceled and re-issued. This is, of course, the upper bound on the penalty, due to the bank conflicts. If you are dealing with a looping algorithm and have unrolled the loops (or if the compiler has done this for you), then more than the minimal latency can be absorbed by the scheduling of the instructions. Even so, removing the bank conflicts will reduce the OzQ activity and can improve the throughput of L2 access.
For one common aspect of removing bank conflicts, see the separate item, Remove Many Bank Conflicts on 64-Bit Architecture.
Introduction to Microarchitectural Optimization for Itanium® Processors
For more complete information about compiler optimizations, see our Optimization Notice.

