Categorizing delivery of micro-ops from front end to back end of the instruction pipeline

Categorizing delivery of micro-ops from front end to back end of the instruction pipeline

Portrait de Michael Chynoweth (Intel)

TITLE: Categorizing delivery of micro-ops from front end to back end of the instruction pipeline

ISSUE_NAME:

FrontEnd^FE_BW^Deliver0uop

FrontEnd^FE_BW^Deliver1uop

FrontEnd^FE_BW^Deliver2uops

FrontEnd^FE_BW^Deliver3uops

FrontEnd^FE_BW^Deliver4uops

DESCRIPTION:

We can use performance monitoring events to breakdown the distribution of cycles when 0, 1, 2, 3 micro-ops are delivered from the front end:

FrontEnd^FE_BW^Deliver0uop = Cycles when 0 uops were delivered from FE to BE of pipeline

FrontEnd^FE_BW^Deliver1uop = Cycles when 1 uops were delivered from FE to BE of pipeline

FrontEnd^FE_BW^Deliver2uops = Cycles when 2 uops were delivered from FE to BE of pipeline

FrontEnd^FE_BW^Deliver3uops = Cycles when 3 uops were delivered from FE to BE of pipeline

FrontEnd^FE_BW^Deliver4uops = Cycles when 4 uops were delivered from FE to BE of pipeline

RELEVANCE:
This performance monitoring usage can only be accomplished on for architectures code-named Sandy Bridge and Ivy Bridge.

EXAMPLE:

Calculations
Percentage of cycles the front end is effective, or execution is back end bound:

%FE.DELIVERING =

100 * ( CPU_CLK_UNHALTED.THREAD -

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE) /

CPU_CLK_UNHALTED.THREAD;

Percentage of cycles the front end is delivering three micro-ops per cycle:

%FE.DELIVER.3UOPS =

100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE -

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE) /

CPU_CLK_UNHALTED.THREAD;

Percentage of cycles the front end is delivering two micro-ops per cycle:

%FE.DELIVER.2UOPS =

100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE -

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE) /

CPU_CLK_UNHALTED.THREAD;

Percentage of cycles the front end is delivering one micro-ops per cycle:

%FE.DELIVER.1UOPS =

100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE -

IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) /

CPU_CLK_UNHALTED.THREAD;

Percentage of cycles the front end is delivering zero micro-ops per cycle:

%FE.DELIVER.0UOPS =

100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE ) /

CPU_CLK_UNHALTED.THREAD;

Average Micro-ops Delivered per Cycle: This ratio assumes that the front end could

potentially deliver four micro-ops per cycle when bound in the back end.

AVG.uops.per.cycle =

(4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) +

(%FE.DELIVER.1UOPS ) ) / 100

SOLUTION:

Seeing the distribution of the micro-ops being delivered in a cycle is a hint at the

front end bottlenecks that might be occurring. Issues such as LCPs and penalties

from switching from the decoded ICache to the legacy decode pipeline tend to result

in zero micro-ops being delivered for several cycles. Fetch bandwidth issues and

decoder stalls result in less than four micro-ops delivered per cycle.

RELATED_SOURCES:

NOTES:

1 contribution / 0 nouveau(x)
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.