Load_Latency performance counter ambiguity

Load_Latency performance counter ambiguity

The May 2018 Combined SDM, Chapter 19, Section 2 and Section 6 list the performance counters for skylake and haswell, respectively.

 

Under section 2 you will find the following 8 events:

Event          Umask

Number       Value

CDH             01H       MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2

CDH             01H       MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4

CDH             01H       MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8

.

.

.

CDH            01H      MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256

CDH            01H      MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512

 

Their description reads: "Counts loads when the latency from first dispatch to completion is greater than <X> cycles." for the correspoding value of X; 2, 4, 8, etc. In particular, there is no indication in the description that these counters measure randomly sampled memory loads. In fact, as stated I would expect a precise count of these events up to skidding in perf record.

 

Under section 6, among others, you will find:

Event         Umask

Number      Value           Event Mask Mnemonic                                   Description

CDH           01H             MEM_TRANS_RETIRED.LOAD_LATENCY  Randomly sampled loads whose latency is above a user defined threshold. [Specify threshold in MSR 3FAH]

 

My question is: Can the "MEM_TRANS_RETIRED.LOAD_LATENCY" be used to emulate the former 8 performance counters showing up for Skylake, or are the semantics as stated in the description correct thus prohibiting this emulation by proxy?

 

I am aware that the Events and Umask are the same, but I am unsure if the implementation of these in hardware are consistent across haswell and skylake. I would like to get an official answer from Intel.

 

 

Thank you.

 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

But seriously. an answer would be nice.

To my knowledge the implementation for these latency events is similar on all microarchitectures - they randomly select loads to track.

One simple way to check is to collect MEM_INST_RETIRED.ALL_LOADS_PS and MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 events at the same time. You should see that MEM_INST_RETIRED.ALL_LOADS_PS will have much lower count.

The reason I am asking this seemingly trivial, pedantic, useless trivia sounding matter is because the Linux kernel assumes the answer to that question is "Yes, the semantics are the same and emulation as stated is correct." And it has code which makes use of this.

The statement " implementation for these latency events is similar on all microarchitectures" is exactly the problem. Which is why I am asking an extremely pedantic question. Your suggestion at a solution is frustrating because it tells me that you really did not look into my question; that approach is literally impossible to do to answer my question.

The "GT" counters are not available on Haswell....hence, any suggestion involving their use is out of the solution. I do not own a skylake based machine...even if I did, whatever the results would be it would tell me that for Skylake specifically, these two counters were or were not counted with random load samples....which would be useful in telling Intel to make your descriptions more precises one way or the other. Whatever the result either the "GT" description would have to change from implying "exact counts" to explicitly stating "random samples", or the retired load latency event would have to change from "random loads" to "exact counts". Based on what I've been told so far, there are no other logical options left.  

This is a question for your engineers....after you have found out the answer...could you please for the sake of all that is holy, correct, exact, precise and trustworthy update the bloody manual?

 

Thank you. 

Leave a Comment

Please sign in to add a comment. Not a member? Join today