"The Terascale-on-a-Chip Environment"
"Developing for Terascale-on-a-Chip" is a series of three articles on looking at the future of programming for many-core parallel processors that run in the teraflop and above range – terascale-on-a-chip. This is the first article in the series.
Intel®-and the entire processor industry-have been chanting a parallel processing mantra for several years. It’s only going to get more intense. The company now produces 4-core processors, and with Moore’s law as a guide, we should expect 128 cores in ten years, 256 in twelve, 512 in fourteen, and over one thousand by 2023. Of course, that’s all theoretical. The reality is that Intel built an experimental 80-core processor called Polaris and tested it last year using a parallel application with message passing. Polaris delivered over one single-precision teraflop (and is capable of 2 TFs) for a Stencil benchmark from a single chip that dissipated only 62 watts-terascale-on-a-chip. That’s a first.
Intel has been delivering firsts in parallel computing for many years. They were the first commercial company to ship a massively parallel processing (MPP) supercomputer in 1985 and, in 1996, the first to deliver a machine that ran over one teraflop (ASCII Option Red with over 9000 processors). In 2006, they were the first to deliver a quad-core processor for the general-purpose computing space. Intel’s experience in parallel computing has a long history.
When Intel talks about terascale-on-a-chip, they mean more than the multi-core architecture currently shipping. They’re talking about a near future where applications can run on hundreds of cores, processing terabytes of data per second using a single processor.
The future terascale processor will be quite different from today’s CPUs. Today’s processors are multi-core, consisting of a few high-performance, power-efficient, megalithic (fat) cores with multiple advanced caches communicating over one or more system buses.
Tomorrow’s terascale processor will be “many-core,” with a lot of smaller, low-power cores. Its architecture will likely be heterogeneous in some cases, integrating fat cores with a variety of many smaller cores; advanced, shared caches; special-purpose accelerators (network offload, XML acceleration, graphics processors, etc.); and even on-chip, hierarchical memory arrays – all interconnected by a high-speed network on a chip. Unlike today’s processors, failure of some of the cores in a terascale processor will not cause the entire processor to fail. One might compare it to a fault-tolerant cluster of machines on silicon.
The whole architecture concept is designed to continue increasing processor performance-into the terabits per second rate and beyond-without increasing the chip’s power requirements.
The result is an extremely power-efficient, parallel processor with many cores that can support a lot more concurrency. And that has profound impact on a programmer.
For years, developers have reaped the benefits of the hardware evolution described by Moore’s Law without having to do much to their code. Improving the performance of their code simply required running it on the next-generation processors.
To see continued performance improvements on many-core processors, developers will have to think, from the ground up, in terms of highly parallel applications. They will need to expose more concurrency in their design problem and express a lot more concurrency in their code.
Fortunately, programmers will be able to utilize the same tools they do for threading today – OpenMP*, MPI*, and Pthreads* – in their applications for many-core processors. Plus, they will be able to take advantage of parallel programming tools, like Intel® Thread Checker, and Intel® Thread Profiler. Intel developers are also working on new programming models to meet the needs of a new generation of parallel programmers, such as Intel® Threading Building Blocks. There is a lot of information on the Multi-Core Developer Community on these topics. But programming for terascale-on-a-chip will demand even more.
Benefiting from terascale-on-a-chip comes down to Amdahl’s law and the serial fraction of the program (you can learn more about Amdahl’s law in “Nuts and Bolts of Multithreading”). With an infinite amount of parallel capacity, Amdahl’s law says that your best speedup will be the inverse of the code’s serial fraction. For example, a program that is 25 percent sequential will run a theoretical maximum of four times faster on infinitely parallel hardware
Another way of looking at it is if you want maximum speedup from a target machine of four cores, you only need to parallelize 75 percent of your code. This means that, with multi-core, finding loops and other obviously threadable functions and threading them for four or eight cores can deliver significantly better throughput performance on a multi-core machine. Leaving the rest of the application sequential doesn’t degrade performance for a system with a few cores. For 64 cores, it’s an entirely different matter.
To maximize speedup on 64 or more cores requires making the code 99 percent, or higher, concurrent. Getting to this level of concurrency is challenging. You can’t do this by taking your serial algorithm and sprinkling the code with Pthreads or OpenMP constructs. You need to rethink the problem in terms of a well-designed parallel algorithm. You have to get creative with “wetware,” the stuff that goes on behind your eyes and between your ears. You have to go back to the original problem and consider it from a whole new perspective-that of a many-core, parallel computing environment-to expose every bit of concurrency you can find.
This process takes work. Experienced parallel programmers will tell you that getting the serial fraction to 10 percent is difficult-but possible. And there are no automated tools (yet) that can replace the kind of intelligent creativity required.
Starting from the problem requires having knowledge (or access to someone with it) of the domain, the problem, algorithm types, and the kinds of data structures the problem depends on.
From there you can design a parallel algorithm with maximum concurrency in mind, which you can then express in your code. Methods to do this are discussed in article 2.
To continually maximize performance from your code, you need a lot more concurrency in the code than there is hardware in the system-both for today’s platform and tomorrow’s. Software outlives hardware. Codes developed years ago for supercomputing and general purpose applications are still used today. Well developed codes continue to scale as hardware evolves.
Parallel code for many-core processors will need to be adaptable to take advantage of a continuing many-core evolution when these machines finally enter the market. When cores jump from, say 64 to 128, if the code is not scalable, the serial portion of the code becomes the obstacle to achieving a leap in throughput performance. But, if the designer has squeezed every bit of concurrency out of the code, when more processing elements (cores) become available, the system scheduler can take advantage of the resources and run more simultaneous threads or processes, resulting in instant benefits from the new hardware – without modifying the software.
Stay ready to learn, work, and explore
There is already a solid foundation in the industry for developing highly parallel applications. Scientists working with supercomputers have relied for years-even decades in some cases-on languages and APIs, like OpenMP, MPI, Pthreads, the C family of languages, FORTRAN, and more, to develop code on huge machines. There is much to learn from these experts. For example, programmers familiar with coding in a shared memory environment might find that using a distributed memory model with message passing (MPI), like on clusters, will better serve their needs and simplify debugging when coding for many-core processors.
Intel® threading and programming tools for developers provide a familiar environment for writing and optimizing highly concurrent code for terascale-on-a-chip. Programmers will be able to leverage these for their new code, yet there will be numerous opportunities to learn and apply new techniques and tools.
Intel is working on several projects that will benefit the programmer developing highly concurrent code. Here are a few.
- Software Transactional Memory is intended to simplify the task of sharing data in a shared memory environment without locking. As concurrency grows, sharing data and protecting variables from thread activities becomes a more delicate balance. Software Transactional Memory will help programmers find the right tradeoffs.
- Exosequencer will ease the integration of heterogeneous cores in a programmer’s environment. With many and different cores on a single chip, programmers will likely get exposed to different instruction set architectures (ISAs). The exosequencer will help integrate these different instruction sets into a familiar Intel® instruction set.
- Ct is designed to simplify parallel programming for some kinds of applications. It’s a new data parallel programming model that helps abstract some of the expression of concurrency in code for certain types of applications with data structures that can be updated in parallel. With Ct, instead of explicitl y coding how to update the data structures, the programmer identifies what data structures can be updated in parallel, and the language takes care of the rest.
- OpenMP 3.0 and beyond will address the special needs of many core programmers. The OpenMP 3.0 specification due late in 2007 will define a powerful task-queue model to support a wider range of algorithms than could be supported in the original versions of OpenMP. Future specifications will tackle the problem of heterogeneity, complex memory hierarchies, and improved robustness.
- Intel research teams are working on classes of emerging workloads, like Recognition, Mining, and Synthesis (RMS) – or model-based programming. Looking at these workloads, Intel engineers can study the future of parallel programming on possible future workloads. This is critical to understand the issues that will drive many-core platforms five to ten years from now, when terascale-on-a-chip performance will be routine.
From the application level to the BIOS and operating systems, there is plenty of opportunity for developers to innovate for future terascale-on-a-chip machines.
Intel is currently working with academia and other imagineers. Research teams have come up with some applications and codes. They range from advanced financial analytics to building “sports highlight reels” out of thousands of frames of footage-in real time-and other data and processing-intensive applications that demand terascale-on-a-chip capability.
Terascale-on-a-chip will allow developers to stretch their imagination beyond taking computing to the next level. It will enable truly interesting and amazing usages and applications that either have never been thought of or have been mere science fiction until now. Think in terms of applications like the holodeck, and you’re in the realm of future applications enabled by highly parallel systems. We should all let our imaginations run wild.
- Developing for Terascale-on-a-Chip Computing : 2 of 3
- Developing for Terascale-on-a-Chip Computing : 3 of 3
About the Author
Ken Strandberg writes technical articles, white papers, seminars, web-based training, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom. His technology background enables him to write from an engineering perspective for technical audiences. Ken’s work has appeared in national engineering magazines, such as EE Times and EE Design, and on Fortune 100 enterprise web sites. Mr. Strandberg lives in Nevada and can be reached at email@example.com.