First.. I generally can make excellent sense of the Evictions/Allocations on L1D and correlate them to requests to L2_RQSTS and likewise misses from L2_RQSTS can be correlated to both LONGEST_LAT_CACHE (which doesn't include any L2 HW PF) and OFFCORE_REQUESTS.
However, when you are doing long vector operations that are greater than the L3, and thus miss all levels of the cache, I observe discrepencies from the L1D alloc and L2_RQSTS:Demand_RD and the L2 HW PF reported missing the L2 by L2_RQSTS and coming to OFFCORE_RQSTS.An illustration is a sequential read loop with 64B read with 1 SW pref (512B ahead), and then loop.. coping 16MB of data. I observe via PMCs on SandyBridge that:There are 124 L1D Alloc/EvictThe L2_RQSTS:DEMAND_REQ = 82 (not 124!)The L2_RQSTS:ALL_PF = 156 (23 hit and 133 miss)The OFFCORE_REQESTS: (ALL_DATA_RD - DEMAND_DATA_RD - DEMAND_CODE_RD) = 47 (this count includes L2 HW PF but not LLC HW PF)Questions:
- the allocations/evictions associated with L1D do not match the DC requests/RFO requests to L2_RQSTS. I'm using a simple loop that either loads or stores, with software prefetch.
- can someone please explain this observation. It states for L2_RQSTS that that count is inclusing of L1 HW pref as well as demand requests. Is the L1D able to read or write directly from the L3, when it detects that a streaming operation is being performed?
- please provide some explanation.
- The L2 HW PF missing in the L2 reported by the L2_RQSTS:PF_MISS is much higher than that which can be devived above from OFFCORE_REQUESTS. (L2_RQSTS reports 133 miss but OFFCORE_REQUESTs only shows 47 getting to the L3)
- why is this? I only observe this when memory intensive tests are run. Can someone please provide some explanation?
- OFFCORE_REQUESTS is very close to the UNC_CBO_CACHE_LOOKUP values which I'm also measuring. I'm measuring across all CBO (0-3) and the number from MESI makes sense, I requests are misses while MES are hits. I'm observing, I believe (by comparing the counts in OFFCORE_RQSTS for Demand Rd and L2 PF with the hit and miss onUNC_CBO_CACHE_LOOKUP) that the L2 PF requests (the 47 reported by OFFCORE_RQSTS) are missing the L3, and there are approximately 82 Demand Requests hitting in the L3, likely upon the L2 HW PF data.
If someone can explain the behavior and discrepencies in L1D req and L2_RQSTS/OFFCORE_RQSTS HW PF counts.. it would be very helpful.Thanks in advanceperfwise