I'm looking into memory performance results on a Xeon E5-2620V3 system with 2 NUMA nodes and 2 QPI links in between. With NUMA enabled in the BIOS, the Memory Latency Checker tool reports 44GB/s local throughput and 6GB/s remote, which looks too low.
Numa node Numa node 0 1 0 44266.2 6004.0 1 5980.9 44311.9
With NUMA disabled (which results in cache line interleaving AFAIU), the combined throughput is ~40GB/s. PCM shows an increased QPI traffic in this mode. So I would expect the figure to be somewhere in the middle between 44GB/s and 6GB/s with NUMA on.
Memory node Socket 0 1 0 39537.2 39588.7 1 39515.2 39527.0
I'm also curious to know how the tool (mlc) measures the bandwidth? Does it rely on PMU counters, or does it just count the memory ops from the standpoint of a client?