Part 1 showed how to fit 60+ cores onto a single chip. Part 2 showed how those cores can, theoretically, execute up to 3840 instructions in a single clock cycle1. Even with the advances of the latest generation of processors, this is a figure that cannot be beat. So why is not every computer built this way? Why do we need those big core processors? It is because we mostly write serial programs. We think serially, plodding along doing only one thing after another, so we program serially. It does not matter that it is possible to do 3840 operations simultaneously if your program only allows you to do one instruction at a time.
So am I saying that the large degree of parallelism available on the Intel Xeon Phi coprocessor is not useful? Not at all, but it does limit the class of applications where the coprocessor is most effective. Eventually, we will all probably think in parallel terms, but right now, such thinking is limited to scientists, and some strange computer programmers, who study things like airflow around jets, exploding stars, and turbulent flow around honey bee wings. And what are those super computers we always hear about computing? Well, they compute airflow around jets, exploding stars, turbulent flow around honey bee wings, and a whole host of other similar things. This is why Diane Bryant, SVP of the Datacenter and Connected Systems Group at Intel, when introducing the Intel Xeon Phi Coprocessor, referred to the coprocessor as a super computer in a box.
THE HARE AND THE TORTOISE’S EXTENDED FAMILY
Now that we spent all that time setting the stage, let us take a look at a more evocative analog. Let us look at the classic case of the tortoise vs the hare of storybook fame, though in this scenario the hare does not fall asleep under a tree but keeps on doing his best all the way through the race.
Here is the scenario. There are two huge barrels that have to be filled with apples. There are two huge piles of apples, one for the rabbit and one for the tortoise. The tortoise fills one barrel while the hare fills the other. Each pile has 16384 apples. The hare, as you predicted, is faster than the tortoise, some 128 times faster. If we assume both the tortoise and the hare are diligent high tech workers who labor ceaselessly for their corporate masters, the hare will get his barrel filled (including unpaid overtime) in 128th the amount of time required by the tortoise. (See the clever illustration.)
I bet you can see where I am going with this analogy, given that you have obviously graduated from primary school. If we instead had 3840 tortoises facing that one hare, even though that hare is 128 times faster than any given tortoise, he will only have a fraction of the barrel filled by the time the tortoises finish.
Need I point out that the host of turtles is an Intel Xeon Phi coprocessor, and the hare is a latest generation IA processor? I did not think so.
So that is how and when the Intel Xeon Phi coprocessor can achieve its great performance, and why that tortoise is so snappy and happy.
Yes, I know about FLOPs and GPGPUs, but that is a different story. Let me just say that it is not just apples that need to be moved into those barrels.
*No hare, tortoise or copyright was hurt in the writing of this series of blogs. The 3rd grade drawing is mine and mine alone.
1 Since the coprocessor’s cores execute asynchronously, this is not really a single clock cycle so much as equivalent to a single clock cycle.