Utilizing Performance Monitoring Events to find Problematic Loads Due to Latency in the Memory Hierarchy

The most common bottleneck found across applications is stalls on loads due to latencies in the memory hierarchy.  Admittedly this is one of the most difficult issues to fix as well.  I plan to utilize this blog to help users identify the issue but will follow this blog with another on methodologies to alleviate the issue.  I promise not to recommend generic fixes such as cache blocking which only be applied to a very small percentage of applications in the market. 

The first problem is to determine what sort of view can help flush out issues with load latencies.  My favorite view to flush out problematic loads is to see the load presented in context of the surrounding instructions in the order as they were most typically retired along with the clocks tagged to each instruction.  The problematic loads are then presented within these common streams of execution with a breakdown of where the load was satisfied in the memory hierarchy (L1, L2, L3, etc).  This view allows us to relate the interactions between events firing during execution and allows each load to be presented alongside a cost estimate of the load in the workload.  I will spend the rest of the blog how we produce the view below on the Core (Nehalem) Architecture.

In the view above, the instructions are presented in retirement order along the x-axis while the clockticks which are tagged to each of those instructions is presented along the y-axis.  Any instructions producing significant bottlenecks along the “stream” of instructions are marked as “spikes” and a combination of static performance event analysis is used to attempt to determine the issue.  For example a bottleneck due to a load is labeled “SPIKE2” and data from the performance monitoring events is indicating that it is missing the last level cache ~6% of the time.  Keep in mind that every LLC miss results in a ~200 + cost in performance so a miss percent of 6% in the LLC is more relevant that the ~21% hit rate in the LLC costing 30+ cycles each.

On Core processors, the Intel architects have done a phenomenal job in the processors performance monitoring unit to quantify load latency issues.  The Core architecture is the first from Intel to include event capability to precisely breakdown every level of the memory hierarchy (L1D, L2, L3, other cores cache, RAM, etc) where any given load is satisfied.  This is accomplished by having a precise event which tags where a load hits.

Loads tend to have more performance impact than stores because the latencies on loads tend to push out the execution and retirement of an instruction.  In the common case, stores do not push out retirement of an instruction.  The load breakdown can be accomplished at any granularity including process, module, function and instruction.  When a load is found to be a bottleneck it is recommended to start with the precise load breakdown and if it does not explain the bottleneck move to other issues which can impact loads.  The two most useful views for these events are to estimate costs of the load to explain a bottleneck and to obtain a percentage breakdown of what level of cache the load was found.  Recommendation is to rely on using the precise load breakdown events listed below instead of the load latency events.  The load latency events rely on a statistical sampling technique in the time domain which can skew the data.

All of the load events follow the same skid as all precise events which tag to the next instruction retired.  This mean that the precise loads will always be tagged to the next instruction retired after the load.

Percentage breakdown of each load source can be tagged at any granularity including a single instruction, function, module, or process.  This is particularly useful at a single instruction and showing the breakdown of where the load was found in the cache hierarchy.  In its current state this study will work for single socket only but can be extended for multi-socket as well.  Each level of the memory hierarchy can be broken down using this methodology although only the L3 is provided as an example. 

Example:  LLC_HIT% = Percentage of time the load was found in LLC cache.

We can also utilize the precise load breakdown events to provide rough estimates on the latencies of the load.

Do not use this methodology to attempt to determine the cost of L1D issues since the typical ~4 cycle cost is easily hidden by the pipeline.  If you are hitting 100% in L1D and you have highly dependent loads, then L1D may be the problem.  Please do not pay too much attention to the latencies I have provided for each level of the memory hierarchy.  This estimate only has to be in the right ball park so I have provided very rough estimates.

Cost of L2 Latency

Cost of LLC Latency

Cost of HITs in Other Cores Cache




Cost of HITMs in Other Cores Cache



Cost of Memory Latency (assumes not bandwidth bound and local DRAM)



Hitting in the line fill buffer (LFB) is one of the difficulties with this analysis.  A load that hits in the LFB means that a previous hardware prefetch, load or store has already missed the L1D on an address contained on the same cache line and it has allocated a fill buffer for that cache line.  The latency for our immediate demand load is variable since it hits in the existing line fill buffer.  There is an advanced technique which involves using the LFB source of the load latency event to cover this case which can be covered in later blogs.

Performance issues due to load latency in the memory hierarchy are among the most difficult performance issues to resolve.  One study that I have found is particularly useful is to find where heavy cache line replacements are occurring in the code which demonstrates where the application is throwing out potentially reused data in the various levels of the cache.   I will provide a methodology to accomplish this in a follow-up blog.  This may or may not coincide with your hot portions of code being researched in the above study for load latency but often does not.  For instance, regular traversal of a large data structure can unintentionally clear out the various levels of cache.
For more complete information about compiler optimizations, see our Optimization Notice.