We have an Intel Sandy Bridge E5 4640 machine. It has 4 sockets and specs say QPI should provide 16GB/sec from one NUMA node to another. I run a stream benchmark by allocating all the memory in node-1 and running all the 16 threads in node-0, the bandwidth should normally correspond to the QPI number in my opinion. However, I see a bandwidth of ~4GB/sec. Also the local memory bandwidth is limited to ~30GB/sec where it should have been ~50GB/sec. Are there any ideas why these might be the case?