[Sandy-bridge loop buffer]

[Sandy-bridge loop buffer]

Hello all,

I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?

Zakaria

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:

Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)

All micro-ops are also resident in the Decoded ICache

Can contain no more than eight taken branches and none of them can be a CALL or RET

Cannot have mismatched stack operations. For example, more PUSH than POP instructions.

Kind regards
Thomas

Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria

Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas

One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya