I measure the performance of memory copy in a NUMA machine with 4 Xeon(R) CPU E5-4620 processors. When I copy data in the local memory, I can get up to almost 10GB/s. However, when I copy data from remote memory, I get much worse performance, only around 1GB/s. I use memcpy() to copy data and each copy is a page size (4KB).
I wonder if Intel processors provides special instructions for inter-processor data movement. I know Intel use QPI for inter-processor communication. Does it expose some interface for programmers? Is the performance above the best I can get?