Understanding L2 Miss performance counters

Understanding L2 Miss performance counters

I am trying to understand the performance counters related to L2 misses on Haswell microarchitecture. Can someone tell me why is L2_RQSTS:MISS counter value greater than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE? Sometimes these two counter values are very close but for some benchmarks, L2_RQSTS:MISS is around 20-30% more than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE. Is that because L2 misses for the cache line already being serviced do not generate offcore responses? Or is there any other reason? Thanks in advance.

24 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I see the L2_RQSTS:MISS event described in the VTune configuration files, but I don't see anything that maps exactly to OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE.    (On the other hand, the most recent version of VTune uses a fairly complex syntax in its configuration files and it is certainly possible that I am not understanding it correctly.)

Is there any way to find out exactly what is programmed into MSR 0x1A7 for this case?

John D. McCalpin, PhD
"Dr. Bandwidth"

Hi John,

             The MSR 1A7H is used for "OFFCORE response Performance monitoring". You may get the complete details about the event code and counters here :- http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

futureishere,

Could you please send me the result file for the above observed behavior?

Regards,

Sukruth H V

Hi Sukruth,

I am using libpfm4.4 with perf_events to get the counter values. This is what I get when I run yada benchmark from STAMP benchmark suite with locks on 4 threads with L2 prefetcher turned off from BIOS:

task -i -e L2_RQSTS:MISS,OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE ./yada -a15 -i inputs/ttimeu1000000.2 -t 4

264442632 L2_RQSTS:MISS
218706989 OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE

 

I am aware of the use of the 0x1A7 MSR -- the question was determining exactly which bits it contained.   There is no point in trying to understand a difference between two performance counter events unless the exact programming of the events is clear.

I will look up the results from the libpfm4 translation after lunch....

John D. McCalpin, PhD
"Dr. Bandwidth"

Hi John,

ANY_REQUEST is alias to DMND_DATA_RD:DMND_RFO:DMND_IFETCH:WB:PF_DATA_RD:PF_RFO:PF_IFETCH:PF_LLC_DATA_RD:PF_LLC_RFO:PF_LLC_IFETCH:BUS_LOCKS:STRM_ST:OTHER

and  ANY_RESPONSE is used to set the bit 'Any' (Offset 16) in MSR_OFFCORE_RSP_x.

So the event OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE would set bits 16-15,11-0 in MSR_OFFCORE_RSP_0 register as per my understanding.

Xeon processors support 2 forms of L2 streaming prefetches. In one case, the data will be fetched into L2. In the other case, the data will only fetched into L3. This 2nd case is also known as LLC prefetch (or L3 prefetch) though it is still initiated by L2.

Haswell PMU has a bug and it can't count whether LLC prefetches hit in LLC or miss LLC. However, L2_RSQTS.MISS will count those. That is why you are seeing the difference. If you disable L2 prefetcher, then your numbers should match

According to section 18.11.4 of the latest revision of Vol 3 of the Intel Arch SW Developer's Guide (325384-051), bits 3 and 7-11 of the offcore response MSRs are reserved in Haswell microarchitecture.   It is hard to tell exactly what this means, but it is certainly possible that the implementation has changed enough that setting these bits causes inaccurate readings.

John D. McCalpin, PhD
"Dr. Bandwidth"

Thanks for the reply Vish. The results I posted before were with L2 prefetcher already turned off.

Thanks for pointing that out John. I ran the benchmark again with just bits 15,6-4,2-0 set and got same result.

Some ideas for further analysis:

1. Have you tried setting up a test case for which the number of L2 misses is known in advance?  (Since you already know how to turn the L2 prefetcher off this should not be too tricky.)  Then you could see which counter is closer to the expected value.

2. Have you tried this test on an earlier processor?  Sometimes the counter functionality is effectively identical across processor generations, sometimes changes in implementation cause a legacy event to misbehave, and sometimes the new implementation brings in new bugs.

John D. McCalpin, PhD
"Dr. Bandwidth"

I looked at some cases where I knew what answers to expect on my Sandy Bridge EP (Xeon E5-2690) systems....

Using the STREAM benchmark (one thread) with inline performance counter reads and hardware prefetch disabled, I get exact matches between L2_RQSTS.DEMAND_DATA_RD_MISS (Event 0x24, Umask 0x02 -- note that the encodings are very different on Sandy Bridge and Haswell !) and OFFCORE_RESPONSE_0 with Request type "DMND_DATA_RD" (bit 0) and Response type "Any" (bit 16) as the only two bits set.  Here "exact match" means that the counts differ by at most one increment in the last decimal place when counting 5 million events (for the COPY kernel) or 10 million events (for the TRIAD kernel).

For the same benchmark, I also get exact matches between L2_RQSTS.RFO_MISS (Event 0x24, Umask 0x08) and OFFCORE_RESPONSE_0 with Request type "DMND_RFO" (bit 1) and Response type "Any" (bit 16) as the only two bits set.

When I enable the L2 prefetchers the results no longer match.   The Offcore Response counts make sense -- the sum of Demand Read responses and L2 HW Prefetch Read responses is about 2% higher than the expected number of read responses.   Similarly the sum of Demand RFO responses and L2 HW PF RFO responses is about 2% higher than the expected number of RFO responses.    An overcount of 2% is more than I expect (since STREAM uses all the data in a 40 million element vector -- prefetches should all be used), but it is close enough for most performance work.

On the other hand, with the HW prefetchers enabled I have been unable to come up with an interpretation of the L2 counters that makes much sense. Some L2 demand read misses are converted to L2 demand read hits, but the sum of demand read hits and misses is 30% lower than the expected value for the COPY kernel and 17% lower than the expected value for the TRIAD kernel. 

John D. McCalpin, PhD
"Dr. Bandwidth"

Hi  John,

Sorry for delayed response. I was trying to get access to an Ivy Bridge machine to confirm your observations. And yes, I see the same results as you do on Ivy Bridge Core i7-3770 machine. The L2_RQSTS and OFFCORE_RESPONSE_0 results do match with L2 prefetcher turned off.

So I don't understand that why I can't get these counters to match on my Haswell machine (Core i7-4770), unless there's a bug in Haswell PMU.

I am observing few more strange things on Haswell. With L2 prefetcher turned off, for the yada benchmark, I still see some count for L2_RQSTS:ALL_PF but L2_RQSTS:L2_PF_HIT and L2_RQSTS:L2_PF_MISS are zero as expected: 

task -i -e L2_RQSTS:ALL_PF,L2_RQSTS:L2_PF_HIT,L2_RQSTS:L2_PF_MISS ./cmd 4

52220075        L2_RQSTS:ALL_PF

0        L2_RQSTS:L2_PF_HIT
0        L2_RQSTS:L2_PF_MISS

That can't be expected behavior, can it?

I am not sure prefetchers are actually getting disabled properly on your system. Can you please read MSR 0x1A0 and report the value?

Value of 0x1A4: 3

I am not sure which MSR this is as this address is not mentioned in the manual. In case you meant IA32_MISC_Enable MSR then it's value is 0x1A0: 4000850089.

 

I meant to write 0x1A0:). Does your BIOS expose disabling L1 prefetcher. If so, can you give that a try

No, it doesn't. :(

btw, how are you measuring these events? Is that through Linux Perf or your own tool programming the PMU?

I am using libpfm 4.5.0 that uses Linux perf_events underneath.

Hi Vish,

Any updates on this? I tried to verify the results on a system with Intel motherboard (DQ87PG) but it doesn't even provide the option to disable hardware prefetcher in BIOS! :(

We just now publicly disclosed how to enable/disable h/w prefetchers on Intel processors code named Nehalem, Westmere, SandyBridge, Ivybridge and Haswell. Please refer to https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-co... that I just posted

 

Hurray!

John D. McCalpin, PhD
"Dr. Bandwidth"

Thanks. So here's what I have observed till now:

Ivy Bridge (Core i7 - 3770): I can get L2_RQSTS and OFFCORE_RESPONSE counters to match after I turn off L2 prefetcher:

Example:

task -i -e L2_RQSTS:DEMAND_DATA_RD_HIT,L2_RQSTS:ALL_DEMAND_DATA_RD,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd 

282424771		 L2_RQSTS:DEMAND_DATA_RD_HIT
453431389		 L2_RQSTS:ALL_DEMAND_DATA_RD
171006619		 OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE
93759998		 L2_RQSTS:RFO_MISS
93759998		 OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE

Haswell (Core i7 - 4770): With just L2 prefetcher turned off, I cannot get L2_RQSTS and OFFCORE_RESPONSE to match:

task -i -e L2_RQSTS:DEMAND_DATA_RD_MISS,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:DEMAND_RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd 

147112571		L2_RQSTS:DEMAND_DATA_RD_MISS
174721797		OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE
92979516		L2_RQSTS:DEMAND_RFO_MISS
47244887		OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE

When I turn off both L2 prefetcher and L1 prefetcher (by writing 0xF to MSR 0x1A4), I get demand data reads to match but not RFO.

task -i -e L2_RQSTS:DEMAND_DATA_RD_MISS,OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE,L2_RQSTS:DEMAND_RFO_MISS,OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE ./cmd

158255799		 L2_RQSTS:DEMAND_DATA_RD_MISS
158291310		 OFFCORE_RESPONSE_0:DMND_DATA_RD:ANY_RESPONSE
94228412		 L2_RQSTS:DEMAND_RFO_MISS
48380242		 OFFCORE_RESPONSE_1:DMND_RFO:ANY_RESPONSE

I have verified Haswell results on two different machines.

In my tests on SNB-EP systems I did not see any changes when I disabled the L1 HW prefetchers, but the test code that I used only has contiguous access that either all miss or all hit at any level of the cache.  There is very little in the documentation that helps to understand how the counters treat L1 HW prefetches, so I think quite a broad suite of tests would be required to understand what is going on.

The results so far show that there is something funny going on with L2 HW prefetches -- in my testing I see 20%-30% *fewer* demand read access to the L2 (hits + misses) when L2 HW prefetching is enabled.  That suggests that the data fetched by the L2 HW prefetchers is getting "picked up" by demand read accesses via a different path -- one that does not increment this performance counter.  (Can anyone think of another scenario that would fit these results?)

We know that there are significant undercounts for the LLC events due to some kind of bypassing, since the "partial workarounds" involve (mostly undocumented) features with names like "disable bypass", so it is certainly not implausible that bypass mechanisms exist at the L2 cache level as well.  The Haswell results above suggest that L1 HW prefetches are putting data in a bypass path that can be picked up by L1 Data Cache miss demand reads -- something that did not happen on SNB and IVB, but which is not a surprising evolution.  

The low counts for OFFCORE_RESPONSE:DMND_RFO:ANY_RESPONSE on the Haswell system might point to a simple bug in the OFFCORE_RESPONSE counter event, or may point to a bypass path that is available to RFOs on Haswell, but not on earlier systems, and which can be activated by demand RFOs (i.e., does not require HW prefetches to activate). 

Another possibility is that Haswell handles some interactions between demand accesses and prefetches differently.  For example, if an L2 miss buffer is allocated by a HW prefetch and a demand access reaches the buffer before the data is returned, the request type associated with the buffer might be changed and the information about the *original* request type could be lost.

Another possibility is that Haswell has different policies for HW prefetching.  Intel's documentation has always been a bit sparse about RFO prefetches (for both L1 and L2 HW prefetchers), and there is considerable fuzziness about the algorithm used to decide whether the L2 HW prefetches will bring the data into the LLC or into the LLC and the L2.   One could also imagine L1 prefetches bringing data into the LLC and L1, but not allocating it in the L2.   Unfortunately it is tricky to study any of these topics when you have to deal with uncertainties in both the underlying hardware behavior and the accuracy of the performance counters.

It might be interesting to see if the bypass-disable workarounds for the LLC undercounting on SNB have any impact on these L2 counts.

Unfortunately, the first-order takeaway is that these L2 access counters do not appear to give reliable results under normal operation (i.e., with the L2 HW prefetchers active).   The offcore response counters look reliable on SNB and IVB, but (at least) the RFO sub-event might be unreliable on Haswell.  This deserves more investigation -- perhaps those events are picked up in another category, or perhaps we will need to go to the CBo counters to capture the information we need.   (I have had trouble wrapping my head around the CBo counter definitions, so I have not included them in most of my analyses so far -- now that the Haswell EP Uncore Performance Monitoring Guide is available, I guess it is time to get to work on adding them.)

I have a lot of sympathy for the folks that have to implement the core performance counters -- it is extremely difficult to architect a monitoring facility when the thing being monitored is being designed by a different team and when that thing becomes more complex in unexpected ways from one generation to the next.  It is even worse when what you are trying to monitor involves the interaction of two or more subsystems being designed by two or more teams -- their primary focus has be to (1) get the interaction right, and (2) make its performance better than the previous version.  Making sure that all paths between the units report information to the performance monitoring units in the way that the performance monitors expect does not have the same kind of career-determining implications as (1) and (2).   It gets worse when there is a fundamental inconsistency between what the performance monitoring unit wants and what the innovative new hardware design actually does.  This is one of the main reasons why it has been so hard for any vendor to have a stable set of performance event definitions.

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today