Seeing One TeraFlop/sec, the software side, and feeling a bit emotional

I've known this day was coming - but when I saw Knights Corner clearly sustaining a TeraFlop (DGEMM, wide range of block sizes) per second - I was surprised by my emotional reaction inside. Hard to describe; it was a good feeling.

Tuesday November 15, 2011, we showed a Knights Corner co-processor for the first time outside Intel. It is fresh silicon - first silicon - which is always exciting (if it works at all). Not only does it work, we were able to boot Linux* on it and demonstrate it doing a very real sustained TeraFLOP/s. We ran DGEMM with many block sizes (with a lot of consistency, which is something that not all hardware and software can do). Our Math Kernel Library product will include DGEMM for Knights Corner when it comes out as a product, so this will be reproducible by all.

To our knowledge - we demonstrated the world's fastest DGEMM, and the first to go above one TeraFLOP/s on DGEMM.  And it is a conservative measure: real, sustained TeraFLOP/s (not "raw" or other theoretical measures). And it is doing it now, not just on "paper." That part really hit home as I looked at it.

I knew what I would see, Knights Corner was not a surprise to me. But when I saw it, and could reach out and touch it - I was struck by the power.  I was part of the ASCI Red project between Intel and Sandia National Labs, that built the world's first TeraFLOP/s computer. We got to the same point (one TeraFLOP/s), before we finished building the machine, in December 1996. Now, we've done it again... this time with a single processor. Both used x86 processors from Intel - ASCI Red used over nine thousand Pentium Pro processors (later upgraded to Pentium II Xeon processors to be the first past 2 TeraFLOP/s), and now on a single pre-production Knights Corner to do the same.

Obviously, both projects involve a lot of people both inside and outside Intel. There is a great team inside Intel, and great partners, that can all feel good about both accomplishments. I'm happy to be one of a handful of people involved in both "firsts" to a TeraFLOP/s.

And, the trend continues. By the end of this decade, we should see a TeraFLOP/s from a 20W part (simple math: ExaFLOP/s at 20MW, means a TeraFLOP/s will be 20W).  A DGEMM sustained TeraFLOP/s in a notebook... it's coming. For now, we have Knights Corner, which is plenty amazing.

For more complete information about compiler optimizations, see our Optimization Notice.