We are trying to perfomance monitor our application. Our goal is to measure the number of L3 cache misses that are fulfilled by Local DRAM, Remote DRAM, and Remote Cache on our 4 socket NUMA machine. The hardware counters that we are currently using are called MEM_LOAD_UOPS_RETIRED.LLC_MISS, MEM_LOAD_UOPS_LLC_MISS_ RETIRED.LOCAL_DRAM, MEM_LOAD_UOPS_LLC_MISS_ RETIRED.REMOTE_DRAM in the Intel manual. We have written a small benchmark to verify these results. We frequently see zero local DRAM accesses even when running on a single core, with the memory pinned to that socket. Furthermore, the amount of memory that is written to and read exceeds the size of L3 cache. When we throw MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS into the mix, we find a huge number of L3 misses that are not accounted for by any memory. We have also tried disabling pre-fetchers. What's going on here?
We're currently using 4 Intel Xeon Processor E5-4620.