[Sandy-bridge loop buffer]

[Sandy-bridge loop buffer]

zakaria-bendifallah's picture

Hello all,

I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?

Zakaria

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Thomas Willhalm (Intel)'s picture

Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:

Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)

All micro-ops are also resident in the Decoded ICache

Can contain no more than eight taken branches and none of them can be a CALL or RET

Cannot have mismatched stack operations. For example, more PUSH than POP instructions.

Kind regards
Thomas

zakaria-bendifallah's picture

Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria

Thomas Willhalm (Intel)'s picture

Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas

Tim Prince's picture

One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.

Login to leave a comment.