Hi All, UOPS_ISSUED.ANY rate/sec is more than UOPS_RETIRED.ANY rate/sec. Is that always correct? In one test I am observing that retired UOPs is more than issued UOPs. Just want to understand in what situation that could happen.  Thanks

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

UOPS issued any, it means the instructions already were decoded into uops, now they are in Resource Allocation Table, then apply (issue) into Reservation station then ready to execute.

UOPS retired any, it counts all retired uops. (executed and retired).

uops can be issued but not executed, or not retired.

>>>In one test I am observing that retired UOPs is more than issued UOPs.>>>

Maybe this is counting instructions which are decoded into more than one uops. Could it also count micro-code assists?

I checked optimization manual. - A "per thread" measurement can be made from the difference between the uops issued and uops retired as the latter two of the above events can be counted per thread. It over counts slightly, by the mispredicted uops that are eliminated in the RS before they can waste cycles being executed, but this is usually a small correction.

In general speaking, UOPS_RETIRED > UOPS_ISSUED. However - if some uops are fused first, so dispatched uops number is reduced, but retired uops number should be same as original uops?? 

When you take  into account uops fusion that's makes more sense.

Best Reply

Agner Fog provides a good discussion of "micro-op fusion" in the "microarchitecture.pdf" document at http://www.agner.org/optimize/

He documents the way that uops are counted in the "fused domain" (which applies to uop issue) and in the "unfused domain" (which applies to uop retirement).  Starting with the Pentium M, memory writes (stores) and "read-modify" operations (i.e., arithmetic or logical operations that include an operand from memory) are subject to micro-op fusion ("microarchitecture.pdf", pp 89-91), and he reports that subsequent processors use the same approach on the same types of operations. 

In the document "instruction_tables.pdf" at the same location, he provides detailed tables of the uop counts of each instruction for each processor generation, including counts in both the "fused domain" and the "unfused domain".  Whenever the number of uops in the unfused domain is greater than the number of uops in the fused domain, there will be an increase in retired uops relative to issued uops.

As noted previously, the relative count is shifted in the other direction whenever speculatively issued uops are not retired.  It seems likely that either effect could be larger depending on the details of the code under test.

To make things more confusing, there are also counters for "uops dispatched" (to the execution ports), and these can be much larger than either uops issued or uops retired due to instruction retries.  Instructions are retried if their input operands are not "ready" when they are dispatched.  The most common cause of this is L1 Data Cache misses.   It is pretty easy to come up with test cases in which instructions are dispatched an average of 4 times or more, and I have seen cases with an average of 12 dispatches per retired uop.  (There is no reason to believe this to be an upper bound -- it just happened to be the largest value I saw in a particular set of tests with lots of cache misses to memory locations located behind the other socket in a two-socket system.)

If I understand the descriptions correctly, "macro-op fusion" will not cause "uops issued" to be larger than "uops retired" because these cases remain fused through execution and retirement.  The standard example here is the fusion of the compare and branch operations at the bottom of a loop.

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today