Counting stall cycles on Sandy Bridge

Counting stall cycles on Sandy Bridge

I have (yet another) question about performance counters. I am testing the following simple code on Sandy Bridge:

{
      int i;
      double a = 1.0, b = 2.0, c = 1.0, d = 3.0, e = 1.0, f = 4.0;
      for(i=0; i<1000000; i++){
            a+=b;
            c+=d;
            e+=f;
      }
      printf(".", a, b, c, d, e, f);
}

The assembly looks as follows:

.L77:
    vaddsd    %xmm8, %xmm0, %xmm0
    vaddsd    %xmm7, %xmm2, %xmm2
    vaddsd    %xmm6, %xmm4, %xmm4
    subl    $1, %edi
    jne    .L77

To the best of my knowledge, the above iteration takes 3 cycles per iteration: 3 independent add operations are issued to port 1, so the latency of 3 clocks per add is hidden. Moreover, loop management goes to other ports. The execution is hence bounded by Port 1 and on average 1 loop iteration takes 3 clocks. I have verified that this is the case by timing the code. On my computer (i7-2620M fixed at 2.7GHz), executing the above loop takes 0.001128s, which is almost exactly 3 clocks. The loop has 5 instructions in total, but 4 uops due to fusion, so in every clock cycle there should be on average 4/3 uops issued and retired, with at least 1 uop issued/retired.

Now I expected that analyzing the performance counters for such a simple case would be easy, but I must be clearly missing something! Below are some counter values (per iteration) I obtained:

evt 0x0e, mask 0x01, cmask 0, inv 0  UOPS_ISSUED.ANY            4
evt 0x0e, mask 0x01, cmask 1, inv 1  UOPS_ISSUED.ANY, stalls    2
evt 0xa2, mask 0x01, cmask 0, inv 0, RESOURCE_STALLS.ANY        2
evt 0xc2, mask 0x01, cmask 1, inv 1, UOPS_RETIRED.ALL, stalls   0.4
evt 0xc2, mask 0x02, cmask 0, inv 0, UOPS_RETIRED.RETIRE_SLOTS  4

I don't understand why do I see stalls reported? In every cycle there is at least one instruction issued/retired, which would be my understanding of no stalls. However, I get non-zero values for UOPS_ISSUED stall cycles, UOPS_RETIRED stall cycles (different) and RESOURCE_STALLS.ANY. Now I do not exactly know what RESOURCE_STALLS.ANY means, but surely the two other counters should read 0? Or are those counters counting unused resources per clock cycle? But then it also does not add up: there are 4 retirement slots, but only 4/3 used in every clock cycle, so UOPS_RETIRED.ALL, stalls should then show a value of 2.666, not 0.4.

Could anyone explain why am I completely wrong here?

Thanks!

 

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Marcin,

Is there an event to show how many uops or instructions are retired per cycle? It seems like there is an event to count how often just 1 instruction (or uop) is retired/cycle or 2 or more instructions (or uops)/cycle, or 3 or more instr (or uops)/cycle, etc.

It seems possible (likely) that more than 1 instruction (or uop) is getting retired per cycle and that there are stalls the other cycles.

Pat

Best Reply

Here are some ideas....   I am drawing heavily from Agner Fog's wonderful documents on the microarchitecture of various processors, and trying to pay very careful attention to the specific wording in Intel's description of the performance counters.

Event 0eh, Umask 01h: UOPS_ISSUED.ANY --- shows 4 per iteration, exactly as expected.  But also shows 2 cycles per iteration with no increment of the counter.   The exact definition of the counter is "increments each cycle the number of Uops issued by the RAT to the RS".   From Agner Fog's microarchitecture description, it appears that the RAT (Register Alias Table) can handle 4 uops per clock and can rename 4 registers per clock.  The loop has four Uops, so only one cycle would be needed for the RAT to issue these to the RS, leaving two cycles per iteration with no uops issued -- as your measurements seem to show.   

I don't quite understand how the RAT manages to perform the required register renaming in one cycle --- the code uses six xmm registers, so I would have expected the RAT to require two cycles for the renaming, leaving only one idle cycle per iteration.   On the other hand, if we consider four iterations of the loop, there would be 4*6=24 registers to be renamed, which should take the RAT 6 cycles.  This is 50% of the 12 cycles required for the four iterations, so would increase the average number cycles with no RAT to RS transfers from 1 to 1.5.   The only way I can get to 2 cycles with no RAT to RS transfers would be for the RAT to handle all the register renaming in 1 cycle.

Event A2h, Mask 01h: RESOURCE_STALLS.ANY -- shows 2.    This is defined as the "Cycles Allocation is stalled due to resource related reason".   I think "Allocation" is the key word here, and that this event is showing exactly the same cycles as the previous counter.  If I am correct, the processor issues instructions until the Reservation Station for Port 1 is full.  This backs up through the Reorder Buffer to the Register Alias Table.  Once the processor gets into this state, the RAT then issues four Uops in one cycle once every three cycles as Port 0 reservation station entries become available.   Of course it is plausible that the RAT could issue 1 or 2 uops in every cycle (averaging four every three cycles), but my hypothesis of "bursty" behavior appears consistent with your measurements.  You might try the other Umasks for Event A2h -- my guess is that most of the stalls will be associated with Umask 04h "Cycles stalled due to no eligible RS entry available".

Event C2h, Umask 01h: UOPS_RETIRED.ALL -- shows 4 uops retired per iteration (as expected), but shows 0.4 cycles per iteration with no uops retired.   I think "retired" is the key word here.  I can't find exact numbers on how many Uops the Sandy Bridge core can retire per cycle, but it is likely at least 4 (the specification for the Core 2 and Nehalem cores, according to Intel's Performance Optimization Guide).  Since you only need to retire 4 Uops in 3 cycles, the "burstiness" of the retirement could give you values anywhere between 0 "idle" cycles and 2 "idle" cycles.  The observed value of 0.4 cycles with no instructions retired suggests that the mechanism is less "bursty" than the issue mechanism, and that you retire the 4 uops in an average of 2.6 cycles.

Summary:  It is critical to pay attention to the differences between fetch, issue, execution, and retirement when trying to understand modern out-of-order processors.  Any of these mechanisms that are not operating at 100% utilization may exhibit varying degrees of "burstiness", so even if you know that you are *executing* one Uop per cycle on Port 1 (which you can verify directly using Event A1h, Umask 02h), this only constrains the *average* behavior of the issue and retirement mechanisms, and does not imply that those mechanisms stay in cycle-by-cycle synchronization with the execution.  (The same applies to instruction fetch, though that case is potentially much more complicated due to the Uop buffers supported by the hardware.)

Disclaimer: I have no particular insights into Intel's processor design, so I certainly cannot rule out either misunderstanding on my part or bugs in the performance counters on Intel's part.  All I can say is that these results appear plausible for a strongly OOO system with decoupled issue, execution, and retirement.

John D. McCalpin, PhD
"Dr. Bandwidth"

Of course as soon as I posted the previous note I thought of a completely plausible mechanism that would account for the Event 0Eh and A2h results.

Assume that the structure of the pipeline includes this ordering:

  1. Fetch
  2. Decode
  3. Rename (Register Alias Table: RAT)
  4. Re-Order (Re-Order Buffer: ROB) Read
  5. Reservation Station (RS) -- queues for each of the Ports (=Execution units)
  6. Execute
  7. Re-Order (ROB) Writeback results
  8. Retire

These are all structures inside a single core, so the flow of information between units is 1:1.  This means that credit-based flow control between units is a reasonable approach to handling "back-up" in the pipeline.    Recall that the Renamer can rename four registers per cycle and send four Uops per cycle from the RAT to the RS.  (It looks like steps 3 & 4 above might have been integrated in Sandy Bridge, but the details don't matter.)    The ROB is large -- 168 entries according to http://www.anandtech.com/show/6355/intels-haswell-architecture/8, while the Reservation Stations are somewhat smaller at 54 entries.  (This makes sense -- the Reservation Stations are only for instructions whose arguments are ready, while the ROB needs to be larger to allow rearranging around instructions whose arguments are not ready.)     The flow-control mechanism only operates when either the ROB or the RS is full, which means that there is lots of work already queued up.  For example, if the RS is full (54 entries), there is not a lot of benefit in telling the Renamer immediately when one additional RS entry becomes available.  It makes perfect sense to build the credit-based flow control on larger "chunks", with 4 Uops as a completely reasonable size.   This means that once the instruction issue has filled up either the ROB or RS, the RAT is only going to get permission to send new Uops once at least 4 ROB or RS entries have become available.  So the RAT may be busy for 2 or even 3 cycles performing the register renaming, but it will only send the Uops in groups of 4.   Since there are 4 Uops in the loop, this means one cycle of sending Uops and two cycles not sending uops -- "stalled" in the sense of Events 0Eh and A2h,  and in agreement with the cycle stall measurements for these two events.

This leaves the cycles with retirement stalls as the only event without a quantitative story -- it is clear that Retirement can be bursty in this case, but it is not obvious why it should retire instructions in 2.6 out of 3 cycles and not in the other 0.4 cycles.  There are lots of mechanisms that might be at work:  fused micro-ops might cause different retirement behavior;  the Loop Stream Detector may be involved somehow.   The results might also vary from run to run, or vary with code alignment.   The main point is that since the code is only retiring instructions at 1/3 of the peak rate, burstiness should not be a surprise, so idle cycles should not be a surprise.

John D. McCalpin, PhD
"Dr. Bandwidth"

While thinking about these topics, it occurred to me that the explanation of why SSE and AVX Floating-Point instructions are overcounted may be hiding in here as well.

In Section 2.2.3 of the Performance Optimization Guide, the "scheduler" is described as the entity that "queues micro-ops until all source operands are ready.  Schedules and dispatches ready micro-ops to the available execution units [...]".    This has puzzled me, because for the SSE and AVX floating-point operations, Events 10h and 11h are supposed to count the various FP uops "executed", but they actually count values that are quite a bit higher if there are cache misses -- up to 6x higher than expected in cases that I have tested.

The explanation may be in the definition of "ready".   The scheduler has no idea whether the memory arguments to an instruction are in the cache or not, so its definition of "ready" is more likely to be related to the computability of the memory addresses, rather than the return of the actual data.   For FP instructions with register arguments, the renamer/scheduler can determine whether instructions have been issued that will define the values in the input registers to the FP instruction.  If any of these involve memory reads, the scheduler might declare the FP instruction arguments "ready" as soon as all the instructions that define the FP instructions inputs have been issued.  (The scheduler might reasonably wait four cycles (the minimum L1 cache latency) after issuing the last MOV instruction before declaring that the FP instruction's arguments are "ready", but that does not change the idea here.)   If the FP instruction has a memory argument as an input, the scheduler might declare that the FP instruction's arguments are "ready" as soon as any instructions that define the values of the registers used in computing the address of that memory argument have been issued.

In the event of L1 Data Cache hits, this allows instructions to flow straight through the pipeline at minimum latency.   In the event of L1 Data Cache misses, the hardware may attempt to execute the FP instruction only to discover that one (or more) of its inputs is not actually ready.  The instruction to *define* that input has been issued, but because of the L1 Data Cache miss, the corresponding data value has not arrived by the time that the FP instruction reaches the Port 0 or Port 1 functional unit.   So the instruction is *rejected* to be *retried* later.   As far as I can tell, Intel does not describe this process, which I first ran across in the design of the IBM POWER4 processor.  It certainly has to push the instruction back into the RS buffer (or perhaps it simply avoids removing it), but the details of what controls how long the processor will wait before retrying the instruction are not at all clear from available documentation.   It is likely to be based on how many other instructions are in the RS buffer targeting the same execution port, but probably has a minimum "wait" time as well.   I have not pursued modeling this because with a factor of up to 6 overcounting in the STREAM benchmark, the model of the overcounting would have to have errors of well under 15% to be able to recover the desired value (actual number of FP arithmetic operations that either completed execution or retired) to within 10%, and it does not seem likely that I would be able to generate an accurate enough model and populate it with good enough data to reliably obtain such small errors for arbitrary codes.  Validation of such a model would require detailed analysis of many codes, and even successful validation might not provide insight into cases for which the model would deliver significant errors.  Intel is likely to provide an "FP arithmetic instructions retired" counter at some point in the future, which would be even more helpful if the compiler were to provide an option to instrument the code to accumulate the nominal specified number of FP operations by basic block and by loop trip count.   When these values differed, it would be an indication that either the compiler was able to eliminate nominal specified operations or that the compiler expanded operations (such as replacing a divide loop with an iterative sequence of multiply-add instructions).  (The compiler could easily report instances in which either of these optimizations occurred, but you would have to run the code to get either the hardware counts or the aggregated totals for the compiler-inserted counts for the specific problem size and data set of interest.)

John D. McCalpin, PhD
"Dr. Bandwidth"

As there is no interdependency between those three vaddsd instruction they could be pipelined inside the  floating point SSE unit.One question more related to design puzzles me. Where are stored floating point number components like sign ,exponent and significand during the arithmetical calculations.Are these components  stored in physical register file thus decreasing the number of available physical registers used for register renaming or maybe they are stored in some kind of  buffers inside the floating point SSE execution unit?

Thank you all very much for your response - that was most insightful. It seems both Patric and John agree on the fact that the instructions can be retired (also dispatched/issued/delivered) in 'bursts', and not necessarily as I assumed - on the fly. If we assume that the pipeline structure John has outlined is accurate, then I guess what I wanted to find is an event that monitors the Execute step and tells me if at any point in time no uops are executed. Now I am not even sure that such an event can exist for a general, real-world program!

Before I go on:

>> You might try the other Umasks for Event A2h -- my guess is that most of the stalls will be associated with Umask 04h "Cycles stalled due to no eligible RS entry available". <<

This turned out to be correct - A2h04h and A2h01h give the same value.

I have performed a few more synthetic tests. First of all, I have added MORE stall monitoring events :) CYCLE_ACTIVITY.CYCLES_NO_DISPATCH, UOPS_DISPATCHED.THREAD stalls, IDQ_UOPS_NOT_DELIVERED.CORE.  I have checked that the obtained counters are the same when using AVX and SSE2. I have also considered 1, 2, and 3 adds in the loop body. In all these cases the per-iteration time is 3 cycles due to the latency of add and the inter-iteration data dependency. I have finally unrolled the loop 10 and 100 times. The numbers reported for unrolled runs are normalized to the 'basic' iteration, i.e., they are divided by the unroll factor.

Results are as follows:

AVX                                         3 adds        2 adds        1 add
--------------------------------------------------------------------------------
(0x0e,0x01) UOPS_ISSUED.ANY                    4           3             2
(0x0e,0x01) UOPS_ISSUED.ANY, stalls            2           2             2
(0xa2,0x01) RESOURCE_STALLS.ANY                2           2             2
(0xc2,0x01) UOPS_RETIRED.ALL, stalls           0.4         1-2           2
(0xa3,0x04) CYCLE_ACTIVITY.CYCLES_NO_DISPATCH  0.17        0.6-4 (!)     4  
(0xb1,0x01) UOPS_DISPATCHED.THREAD stalls      0.08        0.2-1 (!)     1  
(0x9c,0x01) IDQ_UOPS_NOT_DELIVERED.CORE        0           1             2
(0xa8,0x01) LSD.UOPS                           12          9             6

AVX    unroll 10                            3 adds        2 adds        1 add
--------------------------------------------------------------------------------
(0x0e,0x01) UOPS_ISSUED.ANY                    3.1        2.1            1.1
(0x0e,0x01) UOPS_ISSUED.ANY, stalls            2.3        2.4            2.7
(0xa2,0x01) RESOURCE_STALLS.ANY                2.3        2.4            2.7
(0xc2,0x01) UOPS_RETIRED.ALL, stalls           0.75       1              2
(0xa3,0x04) CYCLE_ACTIVITY.CYCLES_NO_DISPATCH  0.14       3.8            7.6
(0xb1,0x01) UOPS_DISPATCHED.THREAD stalls      0.05       1              2
(0x9c,0x01) IDQ_UOPS_NOT_DELIVERED.CORE        0          0.3            0.1
(0xa8,0x01) LSD.UOPS                           0          10             11

AVX    unroll 100                           3 adds        2 adds        1 add
--------------------------------------------------------------------------------
(0x0e,0x01) UOPS_ISSUED.ANY                    3          2              1
(0x0e,0x01) UOPS_ISSUED.ANY, stalls            2.3        2.5            2.75
(0xa2,0x01) RESOURCE_STALLS.ANY                2.3        2.5            2.75
(0xc2,0x01) UOPS_RETIRED.ALL, stalls           0.9        1              2
(0xa3,0x04) CYCLE_ACTIVITY.CYCLES_NO_DISPATCH  0.14       4              8
(0xb1,0x01) UOPS_DISPATCHED.THREAD stalls      0.05       1              2
(0x9c,0x01) IDQ_UOPS_NOT_DELIVERED.CORE        0          0              0
(0xa8,0x01) LSD.UOPS                           0          0              0

 

I can now confidently say that the only counter that gives obvious results is UOPS_ISSUED.ANY (uff!!). Well, maybe LDS.UOPS is also not doing too bad ;)

 

The closest I can get to my Execute step stalls, I guess, is the UOPS_DISPATCHED.THREAD stalls counter (0xb1 0x01). It works well for the unrolled cases, but not for the basic loop. I guess here the loop jump instruction is dispatched in a different cycle than the add operation, which shows as fewer stalls. What is really puzzling about this counter for the basic loop is its high variability when testing 2 add instructions. The same can be said about the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH. Both counters show really large variations with no impact on the overall performance. Moreover, this variability depended on which add instruction I commented out, i.e., which registers were used. That is a bit terrible - for some configurations the results were consistently more stable than for others. That would require more tests, as the behavior is not really intuitive. Finally, except for the basic loop, both UOPS_DISPATCHED.THREAD stalls CYCLE_ACTIVITY.CYCLES_NO_DISPATCH seems to give predictable results and corresponds well to the value I would like to read. In fact it seems like those are really the same counters that differ by a factor of 4.

UOPS_ISSUED.ANY, stalls and RESOURCE_STALLS.ANY grow with the unroll factor, and are larger in cases of fewer add instructions. I guess the values for the unrolled tests are easy to explain now, after Johns comments. Let's take the 1 add instruction case, 100 times unrolled. 2.75 issue stall cycles means that there is on average 0.25 active cycle for every 3 cycles, or in other words 1 active issue cycle every 12 clocks. 1 add instruction takes 3 cycles, therefore in 12 cycles there are 4 add instructions - which is exactly one full uops issue. The same considerations for 2 add instructions yield 0.5 active issue cycles every 3 cycles, or 1 issue cycle every 6 clocks - this is exactly 4 add uops. For 3 adds we have on average 3 issue cycles every 12 cycles, which again fits the 12 instructions executed in 12 cycles. So it seems that the issue logic tries to use the entire 4-uop issue bandwidth, with stall cycles in between.

UOPS_RETIRED.ALL, stalls behaves strangely - for the unrolled case and 3 adds it seems to converge to 1, being a predictable 1 and 2 for 2 adds and 1 add, respectively. I guess no bursts in retirement in these cases for some reason.

I am not sure yet ;) but I think these excerises and your replies help me understand this a bit better now. Still, some things are not clear: as John noted, the speed of RAT seems to be too high. Also, on a more phylosophical note - how can I use those counters, if their interpretation is so complicated?? :)

 

Patric, the only thing similar to your suggestion that I have found is the following description found in the Oprofile documentation (http://oprofile.sourceforge.net/docs/intel-sandybridge-events.php)

idq_uops_not_delivered

uops not delivered to IDQ.

0x01: (name=core) Count number of non-delivered uops to Resource Allocation Table (RAT).
0x01: (name=cycles_0_uops_deliv.core) Counts the cycles no uops were delivered
0x01: (name=cycles_le_1_uop_deliv.core) Counts the cycles less than 1 uops were delivered
0x01: (name=cycles_le_2_uop_deliv.core) Counts the cycles less than 2 uops were delivered
0x01: (name=cycles_le_3_uop_deliv.core) Counts the cycles less than 3 uops were delivered
0x01: (name=cycles_ge_1_uop_deliv.core) Cycles when 1 or more uops were delivered to the by the front end.
0x01: (name=cycles_fe_was_ok) Counts cycles FE delivered 4 uops or Resource Allocation Table (RAT) was stalling FE.

 
 
 
 

It would seem this event analyzes the instruction bandwidth from the Instruction Decode Queue to RAT. Do I understand correctly that this stage takes place is this before the 'uops_issued' in the pipeline?

This is the event I have used in the second round of tests. What is interesting is that it reads all 0 in the unrolled cases, which means that there is a constant flow of instructions from IDQ to RAT. It is not clear to me how to trigger different functionality of the counter (all masks are the same, and Cmask/inv settings are not explained). In the Intel manual I find this:

9CH 01H IDQ_UOPS_NOT_DELIVERED.CORE  Count number of non-delivered uops to RAT perthread. Use Cmask to qualify uop b/w.

The above description is a bit cryptic. How do I get all the counts described by Oprofile documentation?

 

On a simple test case (dependent pointer chasing), I have found that Event A2h, Umask 01h "RESOURCE_STALLS.ANY" tracks the idle cycles correctly.   I.e., if my code report 245 cycles for the memory latency, RESOURCE_STALLS.ANY reports 244 stall cycles per load.   This is not a big surprise, since the there is only one instruction capable of executing at any given time.

One problem with trying to come up with a "stall cycle" metric is that there are too many degrees of freedom -- many of the stalls are irrelevant because they are hidden behind other stalls, and many of the stalls are irrelevant because they are effects rather than causes.

What you want is a tool that looks for delays in the critical path.  Unfortunately the part that is critical can move from one functional unit to another very quickly.   Assuming that you have found a delay on the critical path, a direct "fix" may or may not help, depending on what other stalls were hiding behind the one you are concerned about.

For most of the codes that I look at, the ultimate limit to performance is memory access, with loads being a more common limit than stores.  So I would probably start by looking at how many cycles had uops issued to port 2 and port 3 (Event A1 Umask 0C and Event A1 Umask 30).    You can probably invert these counters to get cycles in which no operations were issued, but since that is the same as subtracting from total cycles, it is not particularly helpful.   It would be very interesting to know how many cycles have uops issued to both ports 2 and 3 and how many cycles have uops issued to neither.  It is not inconceivable that you could get the latter by trying Event A1 Umask 3C and setting the Inverse bit, but a lot of these bit mask combinations don't do what you might expect, so it would take some testing....

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today