Following this previous question http://software.intel.com/en-us/forums/topic/500928 I am still working with Intel PMU. My configuration is a dual socket Westmere-EP system running linux 3.11.0-15-generic SMP kernel.
In this context I wrote a benchmark application doing the following:
- Pin my mono threaded application to a given core C
- Allocates 64 mega bytes of memory on the NUMA node associated to C (using Linux libuma)
- Start counting off_core_response events as described in Intel Architectures Software Developer’s Manual Volume 3B (table 18-15) with the MSR_OFFCORE_RSP register configured to count REMOTE_CACHE_FWD.
- Read all the allocated memory using pointer chasing
- Stop counting and display the result: the magnitude order of the number of REMOTE_CACHE_FWD is the same than the size of allocated and read memory.
The code is available here: https://github.com/ManuelSelva/c4fun/blob/master/pmu_msr/pmu_msr.c
Changing the memory allocation to allocate memory on the remote NUMA node, results in a quasi null number of remote caches count (I checked that these events have been replaced by offcore_response events with response REMOTE_DRAM).
So my question is what are these REMOTE_CACHE_FWD events and how can I have such events in the benchmark I described above ? I was thinking to observe such events only in a multithreaded application where cores on different sockets are sharing data, is this ture ?
Thanks in advance for any hint you may have on the subject.