I have read the book, Intel Xeon Phi High Performance Programming and in chapter 8 (Architecture) and there was something that bothered me. To quote;
Core-to-core transfers are not "always" significantly better than memory latency times. Optimization for better core-to-core transfers has been considered, but because of ring hashing methods and the resulting distribution of addresses around the ring, no software optimization has been found that improves on the excellent built-in hardware optimization. No doubt people will keep looking! The architects for the coprocessor, at Intel, maintain searching for such optimizations through alternate memory mappings will not matter in large part because the performance of the on-die interconnect is so high.
In what situations, core to core transfers work poorly? and to what extend we can rely on ODI? If the communication is relatively fine grained or let's say 4 different cores tried to communicate over ODI at the same time. Will this have a major impact on ODI performace because it is ring based? How many simultaneous communications can ODI handle?