MEM_TRANS_RETIRED.LOAD_LATENCY_GT* unexpected results.

MEM_TRANS_RETIRED.LOAD_LATENCY_GT* unexpected results.

Aram S.'s picture

Hi all,

I am trying to use VTune Amplifier (Linux version) to profile memory access latency. I was using it to get familiar with it by profiling a toy program that just loads a big array of data. I use the command line version like this.

amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 ./load The result I get is the following.

============================================================================

CPU
---
Parameter          r000runsa                      
-----------------  -------------------------------
Name               Intel(R) Xeon(R) E5v2 processor
Frequency          2394229995                     
Logical CPU Count  48                             

Summary
-------
Elapsed Time:  7.757
CPU Usage:     1.000

Event summary
-------------
Hardware Event Type                   Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
------------------------------------  -------------------------  --------------------------------  -----------------
CPU_CLK_UNHALTED.REF_TSC                            18538027807                              9269  2000003          
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32                          0                                 0  100007           
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64                      24036                                 6  2003             
amplxe: Executing actions 100 % done

=======================================================================

From the explanation of the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events, the count of *_GT_32 must be greater that *_GT_64. In this case it is not, and this behavior is reproducible.

I checked the errata published at the specification update and stumbled upon the paragraph BT241 which mentions that "The affected events may undercount, resulting in inaccurate memory profiles" and the list of events contains MEM_TRANS_RETIRED.LOAD_LATENCY.

Can somebody explain why the count of  MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 is less than MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 please?

Thank you,

Best Regards, ARam

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Peter Wang (Intel)'s picture

Hardware Event Type                   Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
------------------------------------  -------------------------  --------------------------------  -----------------
CPU_CLK_UNHALTED.REF_TSC                            18538027807                              9269  2000003          
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32                          0                                 0  100007           
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64                      24036                                 6  2003             

Is it possible due to bigger SAV of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32? It only has 6 samples for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64,  0 sample for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32.

Recommend to try:

amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32:sa=2000,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64:sa=2000 ./load

 

Aram S.'s picture

Peter thank you for your response.

I wasn't aware about this sample-after-value parameter. A high default SAV number indeed explains why GT_32 is 0.

I tried your recommendation and the results are:

Event summary
-------------
Hardware Event Type                   Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
------------------------------------  -------------------------  --------------------------------  -----------------
CPU_CLK_UNHALTED.REF_TSC                            19358029037                              9679  2000003          
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32                      88000                                22  2000             
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64                      12000                                 3  2000             
amplxe: Executing actions 100 % done                                           

Much better. However I would expect the total number of events to be samples * events_per_sample. However the number I get

is two times more. Why is that?

Thanks.

Peter Wang (Intel)'s picture

I explained to you why MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 count was zero, due to its SAV = 100007, no sample was captured - it didn't mean event didn't occur...

You have good question - why their counts = 2 * SAV * samples. I think that the reason was - sometime two events occurred at same time, but VTune can only record one event at a time. If you profile their events separately - you will get "counts = SAV * samples".  

Aram S.'s picture

Hi  Peter,

I wrote

"A high default SAV number indeed explains why GT_32 is 0."

I tried to say that I understood your explanation in your first post. I apologize for the misscommunication,

my English skills are not that good.

 

About counts = 2 * SAV * samples, yes you are correct, If I profile only one LATENCY event then

counts = SAV * samples. To be more precise counts = NUM_OF_LATENCY_EVENTS * SAV * samples,

but the end result (total counts) stays (more or less) the same when using 1 2 or 3 events, simultaneously,

so the end result is accurate.

 

Thanks again, for your answers.

Regards,

Aram 

MrAnderson (Intel)'s picture

@Aram

How many cores are in your system?  Each core will generate this sample, after the SAV number of events.

Regards, MrAnderson
Aram S.'s picture

Hello MrAnderson,

the system is a two socket IVB-EP machine. Each package has 12 physical cores. For this experiment 

I have HT enabled so the visible processors are 48 in total.

I can't see a correlation with the number of available cores on my system and the following

formula  

COUNTS = NUM_OF_LATENCY_EVENTS * SAV * SAMPLES.

Additionally the profiled program was single threaded. I think the explanation is related

with mechanism that MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events are collected.

Currently I'm reading documentation to figure this out.

MrAnderson (Intel)'s picture

Hi Aram:

Another thought, and I just haven't taken the time to check, but you might check the documentation (Software Development Manual) for the exact processor family.  Sometimes, the hardware counters are known to "double count".  This is something out of VTune Amplifier XE's control.  Also, have checked out the Tuning Guides?  There might be some guidance wrt these counters in there.

Regards, MrAnderson
MrAnderson (Intel)'s picture

BTW, did you check out this article, MEM_TRANS_RETIRED.LOAD_LATENCY events?

Regards, MrAnderson
Aram S.'s picture

Hi,

thank you for the pointers you provide I am checking them out.

In the related article from the above post, the author claims that only one LATENCY event can be sampled at a given time period, although the explanation for this limitation is not clear to me yet. If I collect LATENCY_GT_4 and LATENCY_GT_64 at a given time and a load instruction with 100 cycles latency is encountered, it is perfectly reasonable to me that  both GT_4 and GT_64 must be incremented. I tried few test by using single event and multiple events (2-3) I couldn't find any discrepancy.

Cheers

Login to leave a comment.