[Sandy-bridge loop buffer]

[Sandy-bridge loop buffer]

zakaria-bendifallah的头像

Hello all,

I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?

Zakaria

5 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Thomas Willhalm (Intel)的头像

Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:

Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)

All micro-ops are also resident in the Decoded ICache

Can contain no more than eight taken branches and none of them can be a CALL or RET

Cannot have mismatched stack operations. For example, more PUSH than POP instructions.

Kind regards
Thomas

zakaria-bendifallah的头像

Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria

Thomas Willhalm (Intel)的头像

Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas

Tim Prince的头像

One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.

登陆并发表评论。