by Shu-ling Garver and Bob Crepps
The demand for increased performance does not diminish, so more efficient ways to deliver that performance must be found.
Microprocessor performance has scaled over the last three decades from devices that could perform tens of thousands of instructions per second to tens of billions of instructions per second in today’s products. Our processors have evolved from super-scalar architecture to instruction-level parallelism, where each evolution makes more efficient use of fast single instruction pipeline. Our goal is to continue that scaling, to reach a capability of 10 tera-instructions per second by the year 2015. The obvious question is: How do we get there?
The answer to that lies in Moore’s Law. In parallel with our architecture scaling, our process technology has advanced or scaled at a rate predicted by Moore’s Law. If we look ahead to the next decade, we will have the ability to integrate tens of billions of transistors on a single die. The dual-core Intel® Itanium® processor, code-named “Montecito,” already uses 1.7 billion transistors.
As mentioned, our architecture has evolved to get the maximum performance from a single pipeline. However, we can make better use of the increasing number of transistors by moving to architectures that use multiple pipelines, or threads, or cores. We call this shift to multiple threads and cores the Era of Tera-scale Computing.
Why is multi-core and multithread so important? An Intel Fellow, now retired, named Fred Pollack postulated that a single-threaded processor will provide a diminishing return in performance versus power. That is, performance will scale more slowly than power for a single-thread processor. Multiple-thread processors can scale performance much better. The demand for increased performance does not diminish, so it’s important to find more efficient ways to deliver that performance.
Multi at the Logic Level
An example of power and performance scaling at the logic level is shown here. Consider a logic block, which can be any function – an ALU for example. Assume that this logic block has an operating Voltage of 1 unit, a Frequency of 1, Throughput of 1, Active Power of 1 and Leakage Power of 1. If the Voltage is reduced to .7, then the Frequency must also reduce to .7. That causes a reduction in Throughput to .7. However, the Active and Leakage Power both reduce to .35. If the Logic block is then replicated, so that two blocks operate in parallel, then the total power for the two blocks is .7, but the Throughput is 1.4. This logic-level example illustrates how parallel processing can increase performance while reducing power consumption.
Next consider Multithreading. In the chart below, the 1GHz line shows decreasing performance as a result of cache misses. For each cache miss, the processor must wait hundreds of cycles for main memory access. During this time, the processor is not performing useful work, but it is consuming power. With just a few percent of cache misses, the performance has dropped to 40%. As processor frequency increases, performance decreases even more.
System design requires that power supplies and thermal solutions must be designed for full processor utilization. The cost to build a system includes the cost of the full power delivery and thermal solution. How can we keep the system fully utilized, to get maximum performance and value?
When a single thread is running, the hardware is fully utilized. When there is a cache miss, the processor waits, consumes power and performs no useful work. If the processor can launch a second thread that runs while the first is waiting, and a third when the second thread waits, etc., then the hardware stays fully utilized all the time. In that way, multithreading can improve performance without impacting thermals and power delivery
Dual- and Multi-Core
We’ve looked at Tera-scale from the logic level and the thread level. Now let’s look at multiple cores.
First, consider this Rule of Thumb, which is derived from power, voltage and frequency and considers active and leakage power: To make a 1% change in voltage, a corresponding change of frequency of 1% is required. That will cause a 3% change in power (power varies as a cubic function of voltage and frequency). Performance will change by 0.66%.
Assume we have a single core with cache whose voltage, frequency, power and performance are normalized to 1. Now replicate the core and share the cache between the two cores. Next, reduce the voltage and frequency by 15%, so that the power for each core is .5 and the total for the two cores is 1. According to the Rule of Thumb, the performance will be 1.8 for two cores consuming the same power as the single core.
Next, compare two cores with multiple cores. Start with a large core and cache that consumes 4 units of power and delivers 2 units of performance. Compare that to a small core that has been scaled to use one unit of power and has a corresponding performance of 1 unit. By combining 4 of the small cores together, the total power is equal to that of the large core, or 4 units, and the total performance is equal to 4 units, twice that of the large core for the same power.
So far, we’ve talked about using multiple but identical logic blocks or cores, but there are other ways to scale performance by using special-purpose cores. In the chart shown below, one of the curves is labeled “GP MIPS @ 75W”. This represents the general purpose MIPS (millions of instru ctions per second) over time. The data was derived from various processors used to service a saturated Ethernet link. The processor power was normalized to 75 watts. The curve labeled “TOE MIPS @~2W” comes from a special purpose device. This TOE, or TCP Offload Engine, is a test chip made to process TCP traffic. The chart clearly shows that these “specialized MIPs” consume much less power to perform the specific function than what is required by general purpose MIPs to do the same task. The die shot shows that the TOE is a very small die and requires relatively few transistors. The high performance versus power of special purpose hardware will find applications in network processing, multimedia, speech recognition, encryption and XML processing, to name a few applications.
Adding together all of the techniques covered so far, it’s clear that the future of Tera-scale computing will include a arrays of general-purpose cores and special-purpose cores linked by an interconnect fabric that can scale to tens or hundreds of cores. Such an arrangement is called a Heterogeneous Multi-Core Architecture.
Fine-Grain Power Management
Another opportunity in this Tera-scale Era is to improve efficiency through Fine-Grain Power Management. In the figure below, power supply response for today’s systems is represented. That response is very slow compared to the demands of most applications. For example, in email or a word processor, the system waits for tens or hundreds of milliseconds (or longer) for user input, while consuming full power. If the system were designed to switch from operating to standby mode in a few microseconds (or less), the power savings while waiting for the next keystroke could be huge. One way to achieve this is to move the power supply closer to the core. On-package and on-die voltage regulation are already being researched to make this happen.
Most existing applications are single-threaded, and most threaded applications have a sequential or single-threaded component. For best single-threaded performance, the core that is processing the thread should run at maximum voltage and frequency. By varying the voltage and frequency of individual cores in a multi-core processor as the number of threads varies, high performance and high energy efficiency can be obtained. Fine-grain voltage and frequency can be used to balance performance and energy use.
Parts of some applications can be very compute-intensive and can create hot spots on the die. In a multi-core device, these high energy-use functions can be moved from core to core to spread the heat across the die to reduce the overall temperature, avoid temperature throttling, and improve reliability. Tera-scale architectures offer many new opportunities to improve energy efficiency and thermal management.
From multithreading to multi-core, from circuit to microarchitecture, multi-everywhere is taking advantage of Moore’s Law. This provides us opportunities and challenges for Giga-scale integration for future Tera-scale computing.
Performance Scaling in Multi-Core
Amdahl’s law - Amdahl's Law is a law governing the speedup of using parallel processors on an application, versus using only one serial processor. The more parallelized the software, the better the speedup.
The following is a quote from Gene Amdahl in 1967:
“For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit co-operative solution...The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput of five to seven times the sequential processing rate, even if the housekeeping were done in a separate processor...At any point in time it is difficult to foresee how the previous bottlenecks in a sequential computer will be effectively overcome.”
The speed of a program is the time it takes the program to execute. This could be measured in any increment of time. Speedup is defined as the time it takes a program to execute in serial (with one core) divided by the time it takes to execute in parallel (with multi-core).
A Speedup Curve is simply a graph with an X-axis of the number of cores, compared against a Y-axis of the speedup. The best speed we could hope for would yield a 45-degree curve. That is, if there were 10 processors, we would realize a ten-fold speedup. Anything better would mean that the program ran faster on a single processor than in parallel, which would not make it a good candidate for parallel computing.
In addition, for example, a serialization of 20% has a worse speedup than of 6.7%. More cores added would diminish in return past certain core counts.
With some substitution and number manipulation, we get the formula for speedup as:
From Multi to Many
Assume a 13mm size processor in the 22nm process time frame, where there are 4 billion transistors, 48MB cache, and it consumes 100W. With the amount of transistors, we can design a processor with 12 large cores, 48 (multi-core) medium cores or 144 small cores (many-core).
Now let’s make another assumption. The throughput for a large core is 1, for a medium one is 0.5 and for a small one, 0.3.
In an ideal environment, total throughput for large cores should be 12, medium cores 24, and small cores 48. However, software parallelism plays a big role in reality. In the graph, you can see that multi-core and many-core performance suffer when the applications are less multithreaded. Therefore, software parallelism is the key to multi-core and many-core’s success.
The Leap to Parallelism: Driving Energy-Efficient Performance
From multiprocessor supercomputing in the past to Hyper-Threading Technology (HT Technology), to dual-core (Intel® Core™ Duo and Intel® Core™2 Duo) and quad-core, the next leap is multi-core and Tera-scale computing.
Future applications require extremely high computing capabilities, but are currently unobtainable due to frequency scaling and power limits. Intel’s research scientists are working to rethink micro- and platform architectures. Currently architecture yields orders-of-magnitude improvements in inter-core latency and bandwidth over traditional multiprocessor systems. A fundamental architectural shift to many core, with 10s or 100s of low-power, highly-threaded IA cores per processor, will fully capitalize on benefits. Platform capabilities for high-bandwidth memory and I/O, aggressive power management, and scalable balanced platform architecture are our motivation and opportunities for the future of Tera-scale computing.
Parallel Software Research at Intel labs - The Key to Success
As we mentioned above with Amdahl’s Law and multi-core to many-core's progress, we know that software parallelism is very important to performance. Tomorrow’s "killer applications" demand it. The technology underlying these applications is likely to have broad applicability to a wide range of emerging applications with mass appeal in various market segments, including digital enterprise, digital home, and digital health. This presentation gives an example workload that is for our future vision: RMS.
Recognition is a type of machine learning which enables computers to model objects or events of interest to the user or application. Given such a model, the computer must be able to search or mine instances of the model in complex, often massive, static or streaming datasets. Synthesis is discovering “what if” cases of a model. If an instance of the model doesn’t exist, a computer should be able to create it in a virtual world.
The wave of digitization is all around us. While none of us has a crystal ball to predict the future, it is our belief that the next round of applications will be about solving the data explosion problem for end-users, a problem of growing concern for both enterprise and home users. Digital content continues to grow by leaps and bounds in various forms, including unstructured text on the Web; digital images from consumer cameras to high-definition medical images; streams of network access logs or e-Commerce transactions; and digital video data from consumer cameras and surveillance cameras. Add to this massive virtual reality datasets and complex models capable of interactive and real-time rendering, and approaching photo-realism and real-world animation. (Source: Bob, Liang, Pradeep Dubey, Compute-Intensive, Highly Parallel Applications and Uses, May 19, 2005).
Not only Intel is researching on Multi in the circuit and microarchitecture level, Multi-everywhere truly includes software research in its languages, workloads and applications, etc.
Programming Model for Tera-scale
Intel’s microprocessor research in programming model for Tera-scale includes parallel runtime, compilers, tools, libraries and scalable OS. Programming language research is targeting ease of use.
- Explicit concurrent languages – parallelism that is visible to programmers. This kind of programming is for seasoned programmers like compiler writers.
- Implicit concurrent languages – parallelism that is invisible to programmers who are the majority. Being implicit, programming for multiple cores is easy, since you don’t need to know thread/core details. This is what we call Ease of Use: shifting from “programmer does the work” to “compiler does the work.”
- Domain-specific languages – languages for special usage models, e.g. multi-modal recognition, computer vision, graphics, etc.
Hardware and Software Speculative Multithreading
Tomorrow's computing workloads from RMS and other applications of the future will require exponential increases in computing power. Multi-core platforms and multithreading are paving the way for meeting these performance needs. However, to do it, they require a companion effort in accelerating the development of the thread-level parallelism necessary for reaching their full performance potential.
The Mitosis technology is a combined software/hardware approach showing great promise in this area. The Mitosis technology's distinguishing feature is the support for speculative threads that the compiler can exploit to perform aggressive optimizations to parallelize code, even if they are unsafe (speculative). In the end, what makes them safe is hardware that provides a safety net to recover correct states whenever needed. The Mitosis technique shows excellent potential for providing additional thread-level parallelism with small power-overhead in applications that are hard to parallelize by conventional approaches. (Source: Antonio Gonzalex – Speculative Threading: Creating New Methods of Thread-level Parallelization)
Multi-Core’s Future and Intel’s Research Program
Future multi-core processors will be scalable and energy efficient with balanced design. Our research scientists are actively working on innovations in achieving these.
Examples of these innovations are configurable cache architecture, high-speed CMOS voltage regulators for core voltage, and frequency fine-grain control and 3D stacked memory to bring more memory closer to the cores. At the software level, we have Transactional Memory to replace locks to simplify the writing of parallel programs. We are also actively collaborating with leading application developers as our multi-core co-travelers.
There are many more research programs both software and hardware to list for multi-everywhere. However, success is measured by Ease of Use-simplicity to the end-users.
Intel has successfully transitioned from an Era of Pipelined Architecture/Era of Instruction Level Parallelism to the current new Era of Computing Thread and Processor Level Parallelism.
Multi is everywhere, and software is the key. There are many opportunities. Let’s embrace them.
About the Authors
Shu-ling Garver is the Technical Assistant (TA) at Intel DEG Architecture and Planning. Shu-ling joined Intel in 1989, and has since held several engineering and technical marketing positions. Shu-ling received her master's degrees in computer science and engineering from Oregon Graduate Institute, and holds a bachelor's degree in Computer Science from Portland State University.
Bob Crepps is a ten-year veteran of Intel. He has worked as an engineer in Motherboard Products and in Industry Enabling for technologies such as USB, AGP and PCI Express. His current role is as a technology strategist for the Microprocessor Technology Lab in Corporate Technology Group. Prior to Intel he was an analog design engineer and information systems engineer for fifteen years for various technology companies in the Northwest. Bob lives and works in Hillsboro, Oregon.