I'm optimising a multi-threaded application for a Xeon E5-2680 (2 sockets, 8 physical cores per socket). Threads are bound to cores through affinity, thread 0-7 to cores 0-7 in socket 0, then thread 8-15 to cores 8-15 in socket 1. What happens is that performance scale linearly up to 8 threads, i.e. when the application is confined to a socket (or NUMA node). With more than 8 threads, the application performance have very variable execution times.
I suspect that it has to do with the inefficient remote memory access. How can I confirm that with Vtune? Is there a way to count the number of remote memory accesses? Which analysis should carry on?