Reducing cache misses in your application on Sandy Bridge and Atom Architectures

I have had enough questions on some articles I wrote for the Intel64 and IA-32 Architectures Optimization Manual that I decided to port these articles over to blogs so I can dynamically respond to customers and update the blogs.  This cache line replacement analysis is one of the articles I am porting from the optimization manual with some additional edits and notes.

When an application has many cache misses, it is a good idea to determine where cache lines are being replaced at the highest frequency. The instructions responsible for high amount of cache replacements are not always where the application is spending the majority of its time, since replacements can be driven by the hardware prefetchers and store operations which in the common case do not hold up the architecture pipeline.  This means that a cold area of code can trash your cache causing heavy cache misses in other areas of your code and costing performance. Through traversing large arrays or data structures, developer’s can unknowingly cause heavy cache line replacements.

Required events (Sandy Bridge Architecture)

L1D.REPLACEMENT - Replacements in the 1st level data cache.

L2_LINES_IN.ALL - Cache lines being brought into the L2 cache.

OFFCORE_RESPONSE.DATA_IN_SOCKET.LLC_MISS_LOCAL.DRAM_0 - Cache lines being brought into the last level cache (LLC)

Usages of events:

Identifying the replacements that potentially cause performance loss can be done at process, module, and function level. Do it in two steps:

• Use the precise load breakdown to identify the memory hierarchy level at which loads are satisfied and cause the highest penalty.  That study is described by the following links:

Nehalem = /en-us/blogs/2010/09/30/utilizing-performance-monitoring-events-to-find-problematic-loads-due-to-latency-in-the-memory-hierarchy

Sandy Bridge = Intel64 and IA-32 Architecture Optimization Manual

• Identify, using the formulas below, which portion of code causes the majority of the replacements in the level below the one that satisfies these high penalty loads.

For example, if there is high penalty due to loads hitting the LLC, check the code which is causing replacements in the L2 and the L1. In the formulas below, the nominators are the replacements accounted for a module or function. The sum of the replacements in the denominators is the sum of all replacements in a cache level for the process. This enables you to identify the process, module and function that is causing the majority of the replacements.

L1D Cache Replacements

%L1D.REPLACEMENT = L1D.REPLACEMENT / SumOverTheProcess(L1D.REPLACEMENT );

L2 Cache Replacements

%L2.REPLACEMENT = L2_LINES_IN.ALL / SumOverTheProcess(L2_LINES_IN.ALL );

L3 Cache Replacements

%L3.REPLACEMENT = OFFCORE_RESPONSE.DATA_IN_SOCKET.LLC_MISS_LOCAL.DRAM_0/ SumOverTheProcess(OFFCORE_RESPONSE.DATA_IN_SOCKET.LLC_MISS_LOCAL.DRAM_0 );

On Atom architecture you can accomplish the same analysis through using the following events:

L1D Cache Replacements Event = L1D_CACHE.REPL

L2 Cache Line Replacements Event = L2_LINES_IN.SELF.ANY
Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.