I'm trying to analyze and compare two simple geometric multi-grid kernels using performance counters to see how much of the data (in Bytes) is coming from L1, L2, LLC and the DRAM for each implementation.
I realize that getting an accurate count is extremely difficult with so many different things going on underneath (prefetching, instructions, cache lines, etc.), so I am trying to get at least a *rough estimate*.
I'm using LIKWID to analyze my code and I was hoping to get what I need using the following counters:
More precisely, I was hoping that A = B + C + D + E and using these to get a rough estimate on the proportion of data that's coming from L1, L2, L3 and DRAM. e.g. if my total byte accessed was 1 GB, then L1 data will be roughly (1 GB * B / A) or (B / (B + C + D + E)) and so on.
Unfortunately, A > B + C + D + E and I'm not sure why. Maybe B, C, D, E, doesn't count off-core cache access? Or maybe I'm just wrong about what these counters are counting.
So basically I have two questions
1) What are A, B, C, D and E counting exactly? and
2) Is there any way to get a (rough) breakdown of how much of my kernel's required/demanded data is coming from where (L1, L2, L3, or DRAM)?