Quantify the Penalty of Branch Misprediction on 64-Bit Architecture

Submit New Article

July 31, 2009 9:00 PM PDT



Challenge

Determine the performance penalty associated with the misprediction of a conditional branch on a processor based on 64-bit Intel® architecture. A separate item, How to Identify Branch Misprediction on 64-Bit Intel® Architecture shows how to identify stalls due to branch misprediction.


Solution

Use a simple loop as shown in the following code:

b1_2: .b1_2:
{.mfi {.mfi
nop.m0nop.f0 nop.m0nop.f0
nop.i0 nop.i0
} }
{.mfi {.mfi
nop.m0nop.f0 nop.m0nop.f0
xorr11=1,r10;; xorr11=1,r10;;
} }
{.mfi {.mfi
nop.m0 nop.m0
nop.f0 nop.f0
nop.i0 nop.i0
} }
{.mfi {.mfi
nop.m0 nop.m0
nop.f0 nop.f0
cmp4.eq.unc cmp4.eq.unc
p7,p8=r11,r0;; p7,p8=r11,r0;;
} }
{.mfi {.mfi
nop.m0nop.f0 nop.m0nop.f0
nop.i0 nop.i0
} }
{.mib {.mib
nop.m0 nop.m0
nop.i0 nop.i0
(p8)br.cond.sptk.clr (p8)br.cond.sptk.clr
.b2_2 .b2_2
} }
.b2_2: .b2_2:
{.mfi {.mfi
nop.m0 nop.m0
nop.f0 nop.f0
movr10=r11 nop.i0
} }
{.mfb {.mfb
nop.m0 nop.m0
nop.f0 nop.f0
br.ctop.sptk.b1_2;; br.ctop.sptk.b1_2;;
} }

 

The register, r10, is initialized to zero before the loop. The code on the left mispredicts the branch on every other iteration of the loop. The code on the right determines the baseline as if there were no branch mispredictions. The mov r10=r11 instruction is replaced by a nop.i to keep the predicate p8 from toggling values. If the branch is not taken, there is a misprediction, and the pipeline must be flushed, even though there is no difference in the paths.

For 1000 iterations of the loop, the code on the left takes 7000 cycles and the code on the right takes 4000. As half the loop iterations have a branch misprediction, you can conclude that the predicated conditional branch misprediction will cost six cycles. Note that such a simple loop incurs no larger penalties due to locating new instructions for the L1-I cache that a more realistic branch misprediction would suffer.

On Itanium® 2 processors, the branch-prediction hardware is always used, the sptk branch hint merely initializes the prediction history table (PHT). This is different from Itanium® processors, where such a branch hint precludes the use of the branch-prediction hardware. By using the clr hint, the table is reinitialized each time.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors