PMC accronyms.. what do they stand for..

PMC accronyms.. what do they stand for..

Imagen de perfwise

Hi, I am trying to understand the PMC purposes in the Intel System Guide 2 so as to measure the PMC performance on my SandyBridge cpu. Can someone tell me what: IQ (as in PMC 87H) PMH (as in PMC 85H) MITE (as in PMC 79H) DSB (as in PMC 79H) PBS (as in PMC C2H) LBR (as in PMC CCH) also.. looking at the definition of: PMC F0H mask 20H: L2_TRANS.L2_FILL are these fills to the D$ or I$ that are serviced by the L2. Similarly for: PMC F0H mask 20H: L2_TRANS.L2_WB are these writebacks to the L2 from either the D$ or I$? Lastly.. if there is any outline of where these resources are in the pipeline of my SandyBridge.. that would also be useful. Thanks..

publicaciones de 22 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Patrick Fay (Intel)

Hello perfwise,

Refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

IQ = instruction queue, see Figure 2-1, page 40 (labeled 'Instr Queue' in the figure)

PMH is Page Miss Handler. page 73

MITE = Legacy decode pipeline. This is the top of figure 2-1 from '32K L1 Instruction Cache' to 'Decoders'. See also the figure in section B.3.7.1 (page 683) where it it called the Legacy decode pipeline.

DSB is Decode Stream Buffer. This is also the 'Decoded Icache' in the section B.3.7.1 figure and '1.5K uOP Cache' in figure 2-1.

PBS... I assume you mean PEBS. PEBS is Precise Event Based Sampling. See section B.2.3.

LBR = Last Branch Record (or Register). See section B.2.3.4

For L2_TRANS.L2_FILL, this will take more digging but I expect they can be either due to either code misses (I$)or data (D$) misses.

For L2_TRANS.L2_WB, this should be entirely D$ (dcache).WB (writebacks) only occur when a cacheline in L2 is modified and needs to be evicted to memory. Code is usually only read-only (unless you have self-modified code) so code should not be getting 'written back' to memory.

Hope this helps,
Pat

Imagen de perfwise

Pat, Thank you for your very through response. One more question. On the writeback policy of the L1 and L2 and the inclusive nature of the L3. Data in the L3 is guaranteed to be in "either" the L1 or the L2, correct (because it's inclusive). Correct? When data is written and modified within the Data cache and then written back to the L2, at what point is this data written to the L3? Is it written when evicted from the L2 (I presume this is the case since the L2 is writeback). The copy of data in the L3 is not current though with that copy modified in the L1 and L2 through till it's written back from the L2 though, correct? Thanks

Imagen de Patrick Fay (Intel)

Hello perfwise,
On Sandy Bridge, the L2 can be characterized as 'non-inclusive, non-exclusive'.
See Table 2-5 of the Optimization guide (URL in previous reply) for the cache policies and characteristics by cache level.
You asked "Data in the L3 is guaranteed to be in "either" the L1 or the L2, correct (because it's inclusive). Correct?"
Yes, sort of...
Thecacheline can be in L1 and L2 and L3 or
The line can be in L1 and not in L2 but always in L3 or
The line can be in L2 and not in L1 but always in L3 or
The line can be only in L3 (not in L1 nor in L2).
A modified line in the L1 will be written back to the L2 if the L2 has a copy of the line or, if the line isn't in the L2, the line can be written back directly to the L3.
If the modified line is written back to the L2 the the linewon't be written back to the L3 unless the line is evicted from the L2 or the line is requested by another core.
The L3 keeps track of which core has the line. When the L3 gets a request for that line it checks that core's L1 for the line and then it checks the L2 for that line. If neither the L1 nor L2 have the line then the L3 copy is the most current.
Pat

Imagen de perfwise

Pat,
I'm looking at the L1D hit and miss rates, or at least trying to determine what they are. There's very little documentation about how to measure them.. but I've found some documentation on the internet stating:

PMC 43 MASK 1 = ALL L1D req
PMC 43 MASK 2 = ALL L1D req - cacheable

PMC 51 MASK 1 = ALL L1D fills
PMC 51 MASK 2 = ALL L1D fills in modified state (accesses which modify the line requested?)
PMC 51 MASK 8 = ALL L1D evictions of modified data

Is there a PMC to measure MISS rates from the L1D cache. I see that PMC 48 MASK 2 measures something related to L1D Misses.. but I'm confused by the documentation here:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/e...

What does this last PMC measure specifically? Once a miss to the L1D occurs a FP allocation occurs and this PMC is "incremented" by the number of allocations currently outstanding? is this what this PMC measures? if so.. it's for both cacheable demand requests as well as HW prefetch requests.

Any help in clarifying how to measure L1D requests, L1D Hits, L1D misses, L1D writebacks (PMC 28 MASK F) is very helpful.

Thanks

Imagen de perfwise

Pat,
Following up, the cache protocal used is MESI. All "Invalid" requests are actually misses to the cache, correct? If so then to tabulate the misses to the L1D then you would only need to add the:

L1D LD's in I state: PMC 40 MASK 1
+
L1D ST's in I state: PMC 41 MASK 1
+
L1D WriteBacks in I state: PMC 28 MASK 1

Is this correct? Also, could you clarify what a L1D writeback in I state is? I can envision the load and store, which in MESI are misses to a cache which doesn't have the request allocated. What is the later.. a writeback of data from the L1 to the L2 that's invalid?

Thanks

Imagen de Patrick Fay (Intel)

Hello perfwise,
Event 0x43 is not in the SDM. Usually when an event is not in SDM it means that an issue was found with the event orthe eventwas not tested.
So I can't comment on event 0x43.
You've got event 0x51 pretty well characterized.
Usually there are 2 main ways to characterize misses.
We can use HitPerUop (hits per uop (micro-op)) which also permits comparing different sections of code.
Hit ratio is = 'hits / accesses' and tells you the percentage of accesses which hit the cache.

Miss = total accesses - total hits.

For the L1D Load Hit Ratio, you can use:
MEM_LOAD_UOPS_RETIRED.L1_HIT/MEM_UOPS_RETIRED.ALL_LOADS

The L1D Load Miss Ratio is:
(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS

I don't see a MEM_STORE_UOPS_RETIRED.L1_HIT event so we apparently can't compute a L1D store hit ratio on Sandy Bridge.

I'm prettysure the PMC event 0x51 (L1D.*) counts lines fetched due to prefetchers.
I ran a simple test looping overa size 64KB array (so miss every cache line) and MEM_UOPS_RETIRED.ALL_LOADS was very closeto L1D.REPLACEMENT. Prefetchers were enabled so I conclude that L1D.REPLACEMENT counts lines fetched into the L1 due to demand and/or prefetchers.

You can compute a 'L1D load Hits per Uop' with:
L1DMEM_LOAD_UOPS_RETIRED.L1_HIT/UOPS_RETIRED.ALL.

The SDM only has 2 subevents for PMC 28 (L2_L1D_WB_RQSTS.*), umasks 0x4 and 0x8. I'm not sure these will be helpful for measuring L1 hits/misses.

Hope this helps,
Pat

Imagen de Patrick Fay (Intel)

For the MESI question...
PMC 40 mask 1, PMC 41 mask 1, and PMC 28 mask 1 are not in the SDM.

I think you can calculate lots of info for the L1D with the 4 events:

MEM_UOPS_RETIRED.ALL_LOADS (PMC 0xD0 umask=0x81)
MEM_UOPS_RETIRED.ALL_STORES (PMC 0xD0 umask=0x82)
MEM_LOAD_UOPS_RETIRED.L1_HIT (PMC 0xD1 umask=0x1)
L1D.REPLACEMENT (pmc 0x51 umask=0x1)

Then you can compute:

%loads= 100.0*MEM_UOPS_RETIRED.ALL_LOADS/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)

%stores= 100.0*MEM_UOPS_RETIRED.ALL_STORES/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)

%L1D_load_hit = 100.0*(MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS

%L1D_load_miss = 100.0*(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS

%L1D_store_hit= min(100.0, 100.0*(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES - MEM_LOAD_UOPS_RETIRED.L1_HIT - L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES)

%L1D_store_miss= 100.0*zeroifneg(-MEM_UOPS_RETIRED.ALL_LOADS + MEM_LOAD_UOPS_RETIRED.L1_HIT + L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES

The last 2 equations have caps to keep them between 0 and 100. I've seen the last 2 equations be 'out of bounds' by up to 5%.

Hope this helps,
Pat

Imagen de perfwise

Pat, Thanks for the very detailed and informative response. FYI I have used now the PMCs 28, 40,41 and 42 and their reports match pretty well with the expected values.. so at least they appear to be reliable. I measured this is pointer chasing and read/write and copy tests across L1 hit and L1 miss scenarios. One interesting observation I've made is that using PMC 9C and MASK 01 to measure the # of UOPS_NOT_DELIVERED.CORE I am getting, in a very optimized piece of code ~3.5-3.6 uops per cycle while the PMC C2 MASK 01 reports 3.15 uops/clock. For the previous PMC.. I'm using the CMASK to get the distribution of UOPS delivered to the core (I'm also doing this from the MS, MITE, DSB and LSB sources of UOPs). The uops coming out of the MS, MITE, DSB and LSB are matching with that reported from the UopQ -> ROB, but these 2 differ from the other. This isn't a piece of code which is prone to lots of speculation either, at least I wouldn't expect it to be since it's a highly optimized GEMM implementation. Any ideas as to why i'd observe this discrepency? Thanks.

Imagen de Patrick Fay (Intel)

Hello perfwise,
Sorry to take so long to reply. I had to do some other work.

Can you try the technique in Section B.3.2 of theSDM optimization guide "Locating Stalls in the Microarchitecture Pipeline"?
The technique uses the 3 sandy bridge events:
IDQ_UOPS_NOT_DELIVERED.CORE

UOPS_ISSUED.ANY
UOPS_RETIRED.RETIRE_SLOTS

This lets us breakdown the stalls in the pipeline.

For instance, for a memory latency test (array size 40 MBs), the pipeline should be stalled on the backend waiting for memory. The breakdown shows:
%FE_Bound 0.71%
%Bad_Speculation 0.03%
%Retiring 0.98%
%BE_Bound 98.27%

For a memory latency test (array size4 KBs), the pipeline is still mostly stalled waiting on loads. The latency program is just a big unrolled loop of nothing but dependent link listloads.
%FE_Bound 0.091%
%Bad_Speculation 0.002%
%Retiring 6.619%
%BE_Bound 93.288%

If I do a memory read bandwidth test and shorten the array size to fit in L1D (down to 4KB) then I get the result below. For the read bw test, I just do a touch of each 64 byte cache line. The out-of-order pipeline is able to figure out the next load so that lots of loads are underway at the same time.
%FE_Bound 0.414%
%Bad_Speculation 0.003%
%Retiring 99.539%
%BE_Bound 0.044%

If I do a memory read bandwidth test with an array size of 40MBs I get the results below. Now the prefetchers can work effectively and bring the data quickly enough into L1D so that we still retire (relatively) a lot of uops (compared to memory latency test where we were 98% BE_bound).
%FE_Bound 0.907%
%Bad_Speculation 0.135%
%Retiring 11.023%
%BE_Bound 87.935%

Pat

Imagen de perfwise
As always thanks for the response, but I'd like to back up.
PMC 9C, IDQ_UOPS_NOT_DELIVERED, as stated in B53, B.3.7.1, is counting the # of UOPS not delivered from the UopQ to the Retire/Rename/ROB stage every cycle. Couple questions: * when I use a CNT mask of 4, I am measuring the # of cycles where 0 uops were delivered, right? * when I use a CNT mask of 3, I am measuring the # of cycles where 0 or 1 uops were delivered, right? vice versa for CNT masks of 2 and 1 and from these I can get the distribution of UOPs dispatched from the UopQ. I can get the following: # clocks where 0 uops:IDQ_UOPS_NOT_DELIVERED w/CMASK=4 # clocks where 1 uops:IDQ_UOPS_NOT_DELIVERED w/CMASK=3 -IDQ_UOPS_NOT_DELIVERED w/CMASK=4 # clocks where 2 uops:IDQ_UOPS_NOT_DELIVERED w/CMASK=2-IDQ_UOPS_NOT_DELIVERED w/CMASK=3 # clocks where 3 uops:IDQ_UOPS_NOT_DELIVERED w/CMASK=1-IDQ_UOPS_NOT_DELIVERED w/CMASK=2 Am i correct in the logic above? I am comparing the "Uops per clock" from this distrubtion and from that of the "retired Uops" in PMC C2 and they are differing by a wide margin. This is the problem I have and why I'm asking if my statements above regarding the behavior of the PMC 9C make sense. I'm using UNIT MASK=0x1.
Imagen de Patrick Fay (Intel)

Hello Perfwise,
You can look here (http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/snb/events/idq_uops_not_delivered.html or google IDQ_UOPS_NOT_DELIVERED and then take the software.intel.com link) for a description of the event.
A cmask of 4 has the definition: Cycles per thread when 4 or more uops are not delivered to Resource Allocation Table (RAT).
So yes, if I've got all my negations correct then your question "when I use a CNT mask of 4, I am measuring the # of cycles where 0 uops were delivered" is correct.
The next question (cmask=3 means 1 or less uops delivered) is correct as well.

If cmask is 0 then you just get "Uops not delivered to Resource Allocation Table (RAT) per thread".

I think your "# clocks where ..." logic is ok as long as you are measuring event over the same interval.

So I'd expect that (assuming you are measuring everything over same # cycles):
EQN_1 = #cycles * (count_of_just_1_uops_retiring + 2 * count_of_2_uops_retiring + 3 * count_3_uops_retiring + 4 * count_of_4_retiring) is = retired_uops.all
and where
count_of_4_retiring = IDQ_UOPS_NOT_DELIVERED w umask= 0x1, cmask=0x1, invert bit= 0x1.
count_of_3_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=3 - IDQ_UOPS_NOT_DELIVERED w/CMASK=4
count_of_2_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=2 - IDQ_UOPS_NOT_DELIVERED w/CMASK=3count_of_1_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=1 - IDQ_UOPS_NOT_DELIVERED w/CMASK=2

How close is EQN_1 to being equal?

Imagen de perfwise

If you do as you say above, and get a distribution. You can determine the Uops per clock and then compare with PMC C2 (Uops.retired), you will not get the same Uops Per clock. I'm somewhat troubled by this. I'm running a simple L2 bandwidth read test, getting about 1.5 upc in PMC C2 and contrarily getting >3 upc from the PMC IDQ_UOPS_NOT_DELIVERED. So something isn't making sense in the definition of this PMC or my understanding is flawed.

Imagen de Patrick Fay (Intel)

We are basically talking about following SDMOptimization manual section B.3.7.1.

B.3.7.1 does it slightly differently:

cycles_DELIVER.1UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE

cycles_DELIVER.2UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_1_UOPS_DELIV.CORE

cycles_DELIVER.3UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_2_UOPS_DELIV.CORE

cycles_DELIVER.4UOPS =CPU_CLK_UNHALTED.THREAD - IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE

I think that the right hand side of each eqn can be figured out from the umask/cmask info in http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/snb/events/idq_uops_not_delivered.html

Imagen de perfwise

Pat, This is what I'm doing. Unfortunately.. the UOPs retire distribution (PMC C2) and that reported by IDQ_UOPS_NOT_DELIVERED (PMC 9C) differ by a wide margin from what I am determining using formulas like those above. See below for 3 separte tests you can run, L1 read (all reads and from the L1D), L2 read (all reads and from the L2) and L3 read (all reads and from the L3): L1 read: ======== Upc at Retire 2.9168 " - % of clk 8 uop Ret 0.0000 " - % of clk 7 uop Ret 0.0553 " - % of clk 6 uop Ret 0.1179 " - % of clk 5 uop Ret 0.3096 " - % of clk 4 uop Ret 48.2388 " - % of clk 3 uop Ret 0.2291 " - % of clk 2 uop Ret 47.3791 " - % of clk 1 uop Ret 0.6329 Upc from UopQ 2.9918 " - % of clk 4 uop Disp 50.7043 " - % of clk 3 uop Disp 0.5193 " - % of clk 2 uop Disp 47.2180 " - % of clk 1 uop Disp 0.3713 L2 read: ======== Upc at Retire 1.3127 " - % of clk 8 uop Ret 0.0000 " - % of clk 7 uop Ret 0.0005 " - % of clk 6 uop Ret 0.0443 " - % of clk 5 uop Ret 0.0443 " - % of clk 4 uop Ret 22.4056 " - % of clk 3 uop Ret 1.8244 " - % of clk 2 uop Ret 13.2909 " - % of clk 1 uop Ret 9.1035 Upc from UopQ 3.5675 " - % of clk 4 uop Disp 78.5401 " - % of clk 3 uop Disp 0.0764 " - % of clk 2 uop Disp 21.1532 " - % of clk 1 uop Disp 0.0549 L3 read: ======== Upc at Retire 1.3518 " - % of clk 8 uop Ret 0.0000 " - % of clk 7 uop Ret 0.0008 " - % of clk 6 uop Ret 0.0055 " - % of clk 5 uop Ret 0.0085 " - % of clk 4 uop Ret 27.8184 " - % of clk 3 uop Ret 3.4786 " - % of clk 2 uop Ret 5.1014 " - % of clk 1 uop Ret 3.1904 Upc from UopQ 3.8059 " - % of clk 4 uop Disp 80.7067 " - % of clk 3 uop Disp 19.2463 " - % of clk 2 uop Disp 0.0069 " - % of clk 1 uop Disp 0.0108 So.. my question is why am I getting such different distributions on PMC 9C. I've already stated what I'm using to measure these.. any chance you can check and verify the documentation I'm using is correct? The Retire UPC measures close to the IPC.. so something isn't adding up. Correct me. The count for PMC 9C with umask=0x01 is incremented by (4-#uops delivered), correct?

Imagen de Patrick Fay (Intel)

I believe the issue is that all of the IDQ*CYCLE* events only count while uops are being retired.
So the AVG.uops.per.cycle equation (in SDM Optimization manual section B.3.7.1) has to be adjusted.
If you compute (as in section B.3.7.1)
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ (CPU_CLK_UNHALTED.THREAD * 4))

and compute:
Adj.AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100

then I think you'll find that
Adj.AVG.uops.per.cycle is = UOPS_RETIRED.ANY/CPU_CLK_UNHALTED.THREAD

I'll explain this more tomorrow but it is very late now and I have to go to bed.
Pat

Imagen de perfwise

Pat, I'm looking at B.3.7.1, and I'm using the formulas which are listed there, and are idential to the formula's I'm using to generate the data above. I think of water in a pipe with different expanding sections. Unless there's a leak (speculative uops which are never retired), then the # of UOPS which once can measure using PMC 9C should match with that measured from PMC C2, especially using the documenation we are referring to. This isn't happening, so something is incorrect with the documentation or not documented in addition. That's my assertion. The documentation in B.3.7.1 doesn't mention anything about "retire slots" in their formulas. Can you rewrite the formulas which will generate a distribution which does match, because right now I'm "thoroughly" confused. I appreciate the help.. but I'm going by what's written here, and today I measure more Uops, 64B, coming out of IDQ than the 23B that are being retired. The IDQ_UOP_NOT_DELIVERED stat is supposed to measure UOPs coming from the UopQ, but now above you mention it only counts when UOPs are retired. Going by the counts I'm measuring using the difference of: CLKS_3_UOP_DELIVERED = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE -IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE if you use these numbers for the CLKS, you simply are not getting counts which are in agreement with PMC C2. Thanks Pat, but I think I need it spelled out really clearly, why I'm observing this. I'd like to get these distributions to match and I don't understand the statement that PMC 9C only counts when UOPS are retired.. the total count out of PMC 9C isn't matching with PMC C2, which I've stated above. Thanks for any clarification

Imagen de Patrick Fay (Intel)

Sorry to confuse you.
I probably shouldn't have mentioned about the "event only counting when uops are beingretired".
That was just late night speculation.
Section B.3.2 talks about %retiring and says it is:
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; where N=4

Here is what I see:
Currently Section B.3.7.1 defines:
AVG.uops.per.cycle = (4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) + (%FE.DELIVER.1UOPS ) ) / 100

I found that AVG.uops.per.cycle was sometimes much higher than
uops.per.cycle = UOPS_RETIRED.ALL/CPU_CLK_UNHALTED.THREAD

I also observed that, if I compute an 'adjusted AVG.uops.per.cycle' as in:
adjusted AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then 'adjusted AVG.uops.per.cycle' is = uops.per.cycle.
At this point I'm not going to speculate why this extra factor is necessary.
If you use the factor, do you get an 'adjusted AVG.uops.per.cycle' that agrees with 'uops.per.cycle'?
If not, I might have a programming error (always a possibility).
Pat

Imagen de perfwise

Pat, Can you inform me what "retire slots" are? In C2 umask=0x01 it measures actually retired uops. In C2 umask=0x02 it measures how many "retire slots" are used. What's the difference, can you quantify this succienctly for me. I want a clear idea. The reason I ask is B.3.2 is using UOPS_RETIRED.RETIRE_SLOTS rather than UOPS_RETIRED.ALL, and I don't understand why the the former is used rather than the later. And what you're saying is you need to "reduce" the number of events measured for PMC 9C by this number to get a meaningful number? That seems quite broken to me. Is there no better explanation for this overcounting, maybe some architect has a better explanation since both you and I have observed this behavior? It seems to me the event doesn't work if you have to arbitrarly reduce the counting, which you also have seen an overcounting off in your own work. Thanks Pat

Imagen de Patrick Fay (Intel)

Hello Perfwise,
I can't tell what the effective difference is between UOPS_RETIRED.RETIRE_SLOTS and UOPS_RETIRED.ALL. I think the number that the 2 counters return will be the same. But they do count different things.
UOPS_RETIRED.ALL counts simply what it says.
UOPS_RETIRED.RETIRE_SLOTS counts, for each cycle, the number of retirement slots used.
The 2 quantities should be the same and in my meausrements, they are the same to 4 significant digits.

I'm saying the equation for AVG.uops.per.cycle doesn't work as expected.
If, for example the %FE.DELIVER.0UOPS component of AVG.UOPS.per.cycle is not counting correctly then the number AVG.uops.per.cycle will be too high.

Just looking at another case below.
The %Retiring is 70%. The %DELIVER.4UOPS is 99%.
Does it make sense that you can be retiring 4UOPs 99% of the time but only retiring 70% of the time?
So 30% of the time you aren't retiring uops.

It seems %FE.DELIVER.0UOPS and/or %DELIVER.1UOPS and/or%DELIVER.2UOPS and/or %DELIVER.3UOPS may be undercounting.
There are many possible explanations that will require time to check.An event could be coded wrong, an event could be broken, my utility can have an error.

I have asked the guy who wrote that section of the SDM to help figure this out but he is very busy and it may be a week before he gets back to me.
In the meantime I'll try some other tools for collecting the events.
Please be patient and we'll figure this out.
Pat

%FE_Bound 0.210371
%Bad_Speculation 0.267872
%Retiring 70.738993
%BE_Bound 28.782764

%FE.DELIVER.0UOPS 0.110400
%DELIVER.1UOPS 0.061143
%DELIVER.2UOPS 0.010504
%DELIVER.3UOPS 0.061832
%DELIVER.4UOPS 99.756121

AVG.uops.per.cycle 3.992921
adj.AVG.uops.per.cycle 2.824552
uops.per.cycle 2.830101

Imagen de perfwise

Pat, Thank you for following up and reporting the issue. Also thanks for the comment on the Retire slots vs Retired Uops. I look forward to hearing what the experts find about the behavior of the PMC 0x9C. Thanks again.. perfwise

Imagen de Patrick Fay (Intel)

Hello Perfwise,
After talking to more folks, I realized the answer is there in the optimization guide.

The manual says The event IDQ_UOPS_NOT_DELIVERED counts when the maximum of four microops are not delivered to the rename stage, while it is requesting micro-ops. When the pipeline is backed up the rename stage does not request any further micro-ops from the front end.

So, when there are cache misses (back end stalls), the rename stage is not requesting uops and so the front-end is not delivering uops and so this counter doesnt increment.

The methodology in sections B.3.2-B.3.7is intended to be used in sequence.
First determine if you are front or back end stalled (section B.3.2) and then, if you are front end stalled, use section B.3.7 to further analyze the workload.
Or, as the manual puts it:
B.3.7 Front End Stalls
Stalls in the front end should not be investigated unless the analysis in Section B.3.2
showed at least 30% of a granularity being bound in the front end.

Sound reasonable?
Pat

Inicie sesión para dejar un comentario.