Question about ODI (on-die-interconnect) performance

Question about ODI (on-die-interconnect) performance

Hi all,

I have read the book, Intel Xeon Phi High Performance Programming and in chapter 8 (Architecture) and there was something that bothered me. To quote;

Core-to-core transfers are not "always" significantly better than memory latency times. Optimization for better core-to-core transfers has been considered, but because of ring hashing methods and the resulting distribution of addresses around the ring, no software optimization has been found that improves on the excellent built-in hardware optimization. No doubt people will keep looking! The architects for the coprocessor, at Intel, maintain searching for such optimizations through alternate memory mappings will not matter in large part because the performance of the on-die interconnect is so high.

In what situations, core to core transfers work poorly? and to what extend we can rely on ODI? If the communication is relatively fine grained or let's say 4 different cores tried to communicate over ODI at the same time. Will this have a major impact on ODI performace because it is ring based? How many simultaneous communications can ODI handle?

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

We did not mean to imply that the core-to-core works poorly. The ring is much better than a bus would be since many simlataneous transfers can be in flight on the ring at once. The ring is bidirectional, so you can think of it as a long series of point-to-point connections from one core to the adjacent core. With 61 cores, you can imagine this means a large number of messages can be in flight at one time (essentially think of having each point-to-point able to have communication going on independent of the other point-to-point connections that make up the rest of the ring).

The system is highly tuned to distribute memory accesses around the ring, so that making numerous requests to adjacent memory will not bottleneck to a single memory bank.  Depending on your application, a large number of requests to adjacent memory which happens to be cached (and dirty) in another core may create enough of a bottleneck to that single core that the aditional latency of a memory bank would have been similar.

While the observation we make is correct, and was made to point out that optimizing for core-to-core has not proven productive in our experience, but we do not have an benchmarks to prove the point.

We'd be very interested to see any investigations created by others to try to illustrate this behavior.

Login to leave a comment.