The Intel Xeon Phi coprocessor: What is it and why should I care? Part 2: Getting even more parallelism

TITLE: “The Intel Xeon Phi coprocessor: What is it and why should I care?”

PART 2: “Getting even more parallelism”

In part 1, we talked about how it was possible to squeeze all those 60+ cores onto one slab of silicon. Even so, 60+ Intel® Pentiums® processors do not get you all the way to the actual performance of an Intel Xeon Phi coprocessor. You need to be able to magnify the computational advantage of those 60+ cores even further, and given the design of the Intel MIC architecture, that means even more parallelism. The designers found two different ways of magnifying the capability of each of those Intel® Pentium® generation cores. The first was by increasing the number of hardware threads that can execute per core. Do you recall a claim that we only use a quarter of the capability of our brain? The same can be said of a computer core. It has a whole host of capability there, but much of it lies unused. For example, if you are doing an add instruction, what is the multiplication circuitry doing, playing Pinochle?

That circuitry is just sitting there idle, taking up room and energy but doing little else. So why not use it? Intel® did just that by enabling cores to execute multiple instructions simultaneously, assuming that they did not need the same circuitry at the same time. Modern IA cores do this today by allowing two hardware threads to execute simultaneously. Given the special purpose environment of the Intel Xeon Phi coprocessor, the designers knew they could get away with four simultaneous threads. Now remember, four simultaneous instructions is the maximum per core. The actual number of HW threads you can get away with will vary. For the well-optimized application, it is roughly three.

Even given four threads executing per core, it was still not enough given the generational difference between the modern big core and the Intel® Pentium® generation. The Intel MIC Architecture still needed more parallelism.

First let me give you some background. SIMD is Single Instruction, Multiple Data. This describes one of four possible computer architectures as defined by Flynn in 1966. The conventional computers we are all familiar with are SISD, or Single Instruction, Single Data. SISD means that the computer can execute only one instruction on one piece of data at one time. SIMD is where you have that one instruction operating on multiple pieces of data simultaneously. Here is a simple way to visualize this. Say we have eight pairs of numbers to add. A SISD computer, i.e. a conventional computer, will perform eight adds, one right after another. In contrast, a SIMD computer will lay both sets of eight data items in a row, and execute that same instruction simultaneously, i.e. in parallel, on each pair. Thus you have a Single Instruction operating simultaneously on Multiple Data.

Most modern processors since the Intel® Pentium® generation have had this SIMD capability. For example, the Intel® Pentium® processor with its MMX™ SIMD technology could add two floating point numbers, or 64-bits, simultaneously. So in our above example, the Intel® Pentium® processors could theoretically add those eight FP (floating point) values in four instructions by taking two at a time. Life, and the computer industry, have not been idle since the Intel® Pentium® MMX™ days and have constantly expanded upon the original MMX™ SIMD technology. Intel® AVX, the latest generation, has a 256-bit (eight FP values) SIMD engine compared to the Intel® Pentium® processor’s 64-bit (two FP values) SIMD engine.

This is how the Intel MIC Architecture gets the scaling it needs. It combines a large number (i.e. many) of Intel® Pentium® generation cores (60+), enhances those cores with the ability to run four threads per core, and adds to that a whopping 512-bit (16 FP value) SIMD engine. Putting it all together, you have the capability to do greater than 60*4*16 = 3840 instructions, simultaneously. Unfortunately, this does not translate to 3840 simultaneous FP operations since only one of those four threads can use the SIMD engine at a time.


Next: PART 3: “Splitting Hares and Tortoises too”