Front End Bound

Front End Bound

TITLE: Front End Bound

ISSUE_NAME: Frontend

DESCRIPTION:

This category reflects slots where the Frontend of the processor undersupplies its Backend. Frontend denotes the first portion of pipeline where the branch predictor predicts the next address to fetch, cache-lines are fetched, parsed into instructions, and decoded into micro-ops that can be executed later by the Backend. The purpose of the Frontend cluster is to deliver uops to Backend whenever the latter can accept them. The IDQ (decoded uops queue) queues the uops delivered by the Frontend to the Backend. An example of stalls that should be counted in the Frontend bound bucket are stalls due to instruction-cache misses.

To calculate this bucket, we use a newly designated counter for non-delivered uops (stalled allocation pipeline slots) when such uops could otherwise have been accepted; that is, when there was no Backend Stall:

IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD)

It should be noted that the qualification with no Backend stall is very important here, as it lets us correctly distinguish slots when the Frontend was the limiter. Furthermore, accounting at slot granularity is important as it enables us to catch slight inefficiencies where a non-optimal number of uops were delivered in a cycle. IDQ_UOPS_NOT_DELIVERED.CORE is a notable counter introduced for SandyBridge which was defined with Top Down mindset. Prior to SandyBridge it was difficult to get an accurate estimate of Frontend penalties, especially in client workloads that typically do not suffer from long latency stalls like icache or iTLB misses, but may suffer from issues like instruction decoding bandwidth that have smaller penalties and manifest often by less than the optimal delivery of 4 uops/cycle (1, 2 or 3 uops). Such scenarios were traditionally considered good as some allocation did occur, hence underestimating Frontend issues.

When HyperThreading (HT) is enabled, the allocation alternates between the two threads. IDQ_UOPS_NOT_DELIVERED.CORE is designed such that it accounts just for the thread currently allocating. This provides accurate allocation attribution of allocation slots hence enabling the Top Level breakdown for HT.

RELEVANCE:

This metric can be used to determine at a high level CPU is bound due to front end issues.

EXAMPLE:

I-cache misses, iTLB misses, Frontend penalties after miss-prediction clears, LCP stalls, DSB to MITE switches, Decoders inefficiency and various other front end issues can cause this to be high.

SOLUTION:

Drill down into the lower level front end metrics to find the specific performance issue.

RELATED_SOURCES:

NOTES:

This metric is measured by specifically counting instances where the backend of the machine is requesting uops and the front end is unable to fill all pipeline slots.

EQUATION: IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD)

1 Beitrag / 0 neu
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.