I have done a few NUMA memory bandwidth tests on our 4-socket Intel testbed using STREAM benchmark and Vtune for hardware counter moniter. Our CPU model is Intel Intel(R) Xeon(R) CPU E5-4620, Sandy Bridge microarchitecture.
All the tests are on the same host using the same settings, exception for different memory and CPU nodes binding combinations(local/remote)
The bandwidth, without expectation is that local memory access bandwidth is two times faster than the remote case. But I find some readings of the hardware counter is somehow hard to explain.
The first one is the LLC MISSES. The Local case have only a half of LLC misses than remote case. Is this related to the prefetch mechanisms? Both remote and local access cases should have similar number of cache misses，right?
The second is the LOAD_HIT_PRE.HW_PF reading. I can see that the local case have only one third prefetch hits of remote case. That is also oppsite of our expectation.
What is the possible explanation on these results?