TITLE: Categorizing delivery of micro-ops from front end to back end of the instruction pipeline
ISSUE_NAME:
FrontEnd^FE_BW^Deliver0uop
FrontEnd^FE_BW^Deliver1uop
FrontEnd^FE_BW^Deliver2uops
FrontEnd^FE_BW^Deliver3uops
FrontEnd^FE_BW^Deliver4uops
DESCRIPTION:
We can use performance monitoring events to breakdown the distribution of cycles when 0, 1, 2, 3 micro-ops are delivered from the front end:
FrontEnd^FE_BW^Deliver0uop = Cycles when 0 uops were delivered from FE to BE of pipeline
FrontEnd^FE_BW^Deliver1uop = Cycles when 1 uops were delivered from FE to BE of pipeline
FrontEnd^FE_BW^Deliver2uops = Cycles when 2 uops were delivered from FE to BE of pipeline
FrontEnd^FE_BW^Deliver3uops = Cycles when 3 uops were delivered from FE to BE of pipeline
FrontEnd^FE_BW^Deliver4uops = Cycles when 4 uops were delivered from FE to BE of pipeline
RELEVANCE:
This performance monitoring usage can only be accomplished on for architectures code-named Sandy Bridge and Ivy Bridge.
EXAMPLE:
Calculations
Percentage of cycles the front end is effective, or execution is back end bound:
%FE.DELIVERING =
100 * ( CPU_CLK_UNHALTED.THREAD -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering three micro-ops per cycle:
%FE.DELIVER.3UOPS =
100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering two micro-ops per cycle:
%FE.DELIVER.2UOPS =
100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering one micro-ops per cycle:
%FE.DELIVER.1UOPS =
100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering zero micro-ops per cycle:
%FE.DELIVER.0UOPS =
100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE ) /
CPU_CLK_UNHALTED.THREAD;
Average Micro-ops Delivered per Cycle: This ratio assumes that the front end could
potentially deliver four micro-ops per cycle when bound in the back end.
AVG.uops.per.cycle =
(4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) +
(%FE.DELIVER.1UOPS ) ) / 100
SOLUTION:
Seeing the distribution of the micro-ops being delivered in a cycle is a hint at the
front end bottlenecks that might be occurring. Issues such as LCPs and penalties
from switching from the decoded ICache to the legacy decode pipeline tend to result
in zero micro-ops being delivered for several cycles. Fetch bandwidth issues and
decoder stalls result in less than four micro-ops delivered per cycle.
RELATED_SOURCES:
NOTES:

