MEM_TRANS_RETIRED.LOAD_LATENCY events

There are 8 MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* precise events available on Intel® Microarchitecture Codename Sandy Bridge.  The events allow you to pinpoint loads that exceeded a given latency, measured in CPU clock cycles.  For example, the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 event is for loads exceeding 4 clocks in latency, and the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512 event is for loads longer than 512 clocks. 

These events are sampled by Intel® VTune™ Amplifier XE performance profiler in a different way from most other events.  When a user elects to sample one of these events, special hardware is used that can keep track of a data load from issue to completion.  This is more complicated than simply counting instances of an event (as with normal event-based sampling), and so only some loads are tracked.  Loads are randomly chosen, the latency determined for each, and the correct event(s) incremented (latency >4, >8, >16, etc).  Due to the nature of the sampling for this event, only a small percentage of an application's data loads can be tracked at any one time. 

By sampling a range of latencies with this event, you can determine your application's general latency distribution and pinpoint (because the event is precise) any overly long loads.  But data from this event will not correlate to other events such as MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS, which is counting all loads within the sampling interval.  You will see a much smaller total number of loads using the MEM_TRANS_RETIRED.LOAD_LATENCY events.  

An additional thing to know about these events is that only one can be sampled in a given time period.  If you are sampling the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 event, for example, all tracked loads with a latency above 4 would be counted, including loads greater than 128, 256, etc.  The events also count only data loads (not code loads) and only demanded loads (not hardware prefetches).

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Hi Shannon and thank you for

Hi Shannon and thank you for the article,

you write

An additional thing to know about these events is that only one can be sampled in a given time period.  If you are sampling the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4 event, for example, all tracked loads with a latency above 4 would be counted, including loads greater than 128, 256, etc.

This part made me wonder. You say that the obtained latency distribution (when measuring multiple LATENCY_GT events at the same time) would be a cumulative histogram, thus not desirably?

Can you explain further why only one such event can be sampled at a given period?

Thank you.

Aram