The Intel® Xeon Phi™ coprocessor: What is it and why should I care? Pt 1: Fitting it all on one slab of silicon

Hi, my name is Taylor Kidd. You many know me from such notables as, “The Beginning Intel® Xeon Phi™ Coprocessor Workshop,” and, “The Advanced Intel® Xeon Phi™ Coprocessor Workshop,” where I mesmerized audiences with over 10 hours of highly technical information. (See, the “Training” tab for both.)

I am here to give you a very short introduction to the Intel® Many Integrated Core Architecture. Thus, you have the reason for the brief and snappy title, “The Intel Xeon Phi coprocessor: What is it and why should I care?”

The Intel® MIC Architecture, a.k.a. the Intel® Xeon Phi™ coprocessor, is what some have called a super computer on a card. If you are going to go so far as to call it that, you might also call it a super computer on a chip since it is all on one slab of silicon.

This simple fact is its biggest advantage as well as its biggest limitation. You see, by putting everything on one chip, designers are able to tightly integrate the entire coprocessor and avoid the onerous limits that result from having to move data off of and then back onto different pieces of silicon. This gives you a remarkable degree of both cooperation and communication bandwidth between the different cores of the coprocessor. And what a number of cores it is. Depending upon the SKU, you can have more than 60 cores on that coprocessor, all on that single slab of silicon.

What is the greatest bottleneck related to computation in modern processors? That is right, it is communication. When you are trying to squeeze all that data across a small number of pins, something has got to give. By having everything on one slab of silicon, you have just greatly reduced the magnitude of this problem. By the way, the figure below is for illustrative purposes only.

This same advantage, though, is also its greatest constraint. You have got to make some compromises when you squeeze that many processors on one slab of silicon. Remember that the largest slab of silicon you can reliably create (combined with the feature size) sets the upper limit on the number of transistors you have available as building blocks. So an Intel Xeon Phi coprocessor with 60+ cores has roughly the same number of transistors available to it as a state of the art Intel® Xeon® processor. Wait, we have eight cores on one hand (a top of the line Intel® Xeon® processor) and 60+ cores on the other (Intel Xeon Phi coprocessor). So how does the Intel Xeon Phi coprocessor do it with the same number of transistors?

Let us take a look at the maximum transistor count possible on a useful slab of silicon. By “useful,” I mean something that you can use to create a commercial product. So, I am leaving out all those “world records” where the maximum transistor count is some unbelievably large number, but the defect rate is so high and the yield so small that it will never be economically viable. Right now that number is around 5 billion. And that is to build one state-of-the-art Intel® 8-core / 16-thread souped up processor. If that is the state of the art, how do we get 60+ cores / 240+ threads on one slab of silicon?


The answer is simple once you think about it. What is the big advantage of the Intel MIC Architecture? Is it because it is the fastest processor around? No. It is because it has a whole lot of cores that can support even more threads, meaning that the advantage is in numbers, not speed. So why not use an older, smaller but still very capable core? And that is what they did. The designers went back generations, literally back to one of the first modern cores, the Intel® Pentium® processor. The Intel® Pentium® processors had transistor counts between 3 and 40 million, depending upon which generation you choose. An 8-core Intel® Xeon® Processor E7 has 2.6 billion with a “b.” If you do the math, you can now get 60+ cores on one slab of silicon along with all that other stuff you need to tie them together.


For more complete information about compiler optimizations, see our Optimization Notice.


Taylor IoT Kidd's picture

Hi Dmitry, Pardon's for the delayed response. It was the end of the quarter and we all know what that means. You're right. Each core has 4 HW contexts independent of the vector engine (except that there is only 1 vector engine per core). The result is a maximum number of 240+ HW threads though your mileage will vary since all 4 share the same pipeline.

To find a painful amount of additional detail, take a look at the System Software Developers' Guide (SDG) at, specifically chapter 2.

Dmitry Oganezov (Intel)'s picture

Great post, Taylor! Though I don't quite understand one little thing...

It's clear that we can get 60+ cores on one slab of silicon, considering the fact that these are close to classic Pentium cores. But what about 240+ threads? Do you mean 4 threads per core because of in-order pipeline or because of vector instructions?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.