As the number of execution cores in multi-core processors continues to increase, software that is not properly threaded will suffer performance penalties and become less competitive in the marketplace. Decision makers at software companies should act now to position their products for success as hardware parallelism grows exponentially.
By Matt Gillespie
The shift to multi-core computing is still in its infancy. In the relatively near future, we will almost certainly look back at processors with single-digit numbers of cores with a touch of nostalgia. As hardware parallelism increases, software architectures must keep pace, and the performance penalties associated with failing to do so will continue to get more and more severe. For example, consider that, on a dual-core system, unthreaded software may lead to as little as 50% theoretical processor utilization. That figure decreases to 25% on a quad-core system, 12.5% on an 8-core system, and so on. Similarly, as the number of processor cores increases, performance penalties associated with improper threading also increase geometrically.
In February 2007, Intel demonstrated an 80-core processor. While this chip was created as a proof of concept, rather than as a marketable product, it demonstrates the scope of possibilities as on-die parallelism continues to advance. Such massive parallelism opens the door to application possibilities that are only hinted at today, like a phone that can translate into multiple languages in real time, advanced data mining techniques based on artificial intelligence, or even photorealistic PC games. As these new product areas develop, a clear prerequisite for software companies to succeed will be to thread their applications sufficiently to take advantage of the hardware.
Plan for Both Challenges and Opportunities
As software companies all over the world began to multithread their applications around the turn of the 21st Century, they encountered many challenges. Identifying the opportunities for parallelism within their software models, choosing an appropriate threading model, and successfully implementing their threaded designs presented difficulties that many found to be daunting enough that they may not have pursued threading at all. Other companies might have taken shortcuts in development that allowed them to take some advantage of hardware parallelism without truly re-architecting their applications.
Even given the likely multi-core advances over a time horizon of only a few years or so, such shortcuts will reveal themselves as temporary solutions that need to be corrected with proper threading practices. For example, unsound threading practices may introduce high overhead for the threads to synchronize their data flows through the application. The missed performance opportunity associated with such shortcomings may be significant when the software is run on a dual-core processor, while still allowing the application to achieve better performance than the serial version. If the application is able to take advantage of a quad-core processor, the synchronization overhead is likely to become more significant, and as the number of cores continues to increase, the overhead will event ually become unacceptable.
In this sense, the move from unthreaded software intended to run on single-core systems up to threaded software intended to run on dual-core systems may be regarded as a proof of concept. As mainstream processors increase to four cores and beyond, the threading models and practices that underlie mainstream software must become more sophisticated. This paper introduces some of the issues and opportunities for software makers associated with the industry transition to processors with large numbers of cores. It focuses on what decision makers can do today to position themselves and their companies for success in the future.
Recognize Multi-Threading as a Strategic Imperative
The first important step that software-development organizations must take to position their products for the multi-core future is to make a commitment to that goal. While the decision itself is simple enough, following through with it requires a certain amount of discipline, since it requires a long-term commitment to a design methodology that may (especially early on) seem to be at odds with near-term requirements.
For example, when a product is on a tight timeframe for completion, rigorous analysis of the opportunities for parallelization in its architecture can easily be pushed down the list of priorities. While it is inherently logical (and to some degree inevitable) to prioritize immediate product requirements over support for future ones, it is also true that 'retrofitting' parallelization into a product's architecture is likely to be more expensive in the long run than using an architecture that lends itself to multi-threading from the earliest design stages.
Thus, the key to accommodating multi-threading in all of the products a company develops is to achieve a balance between near-term time-to-market and cost requirements on one hand and long-term strategy on the other. For that reason, formally identifying highly threaded application designs both as a near-term and a long-term goal is an important step. To support the effort to thread existing and future applications, strategists must educate both management and developers about the importance of multi-threading as a long-term requirement. To be successful, that education must include both business needs and development techniques.
In terms of business needs, it is useful to consider the trends in multi-core hardware development. While today's mainstream processors can handle at most a few simultaneous software threads, it is clear that future hardware designs will support hundreds or even thousands. Industry resources such as Intel Software Insight, a magazine published by the Intel® Developer Zone, can be instrumental in educating stakeholders about the importance of planning software for the future capabilities of massively multi-core hardware architectures. Free electronic subscriptions allow you to receive the bi-monthly magazine, as well as optional e-mail delivery of articles, white papers, case studies, and product briefs selected using your user profile. You can also choose to receive notification of product announcements, training, and events that are of relevance to you.
Supporting the technical needs within a development organization to support multi-threading requires increased knowledge and ongoing training. In this area, Intel provides extensive resources to software developers, and the Intel® Developer Zone Parallel Proramming Community is a good place to start. Key resources that are maintained at this site include white papers and other technical documentation, training, and a knowledgebase of brief, developer-focused articles that each addresses a specific challenge associated with threaded software development and gives a concrete solution that a developer can take action with immediately. Intel Developer Zone also maintains a training portal that provides access to web-based training, as well as online seminars and on-demand webcasts that address a wide range of development topics.
Characterize the Threading Process in Sophisticated but Simple Terms
To instruct other people and the organization as a whole about the importance and requirements of software multi-threading, it is important to be able to explain clearly and concisely what the process entails. In simple terms, then, threading is the process of breaking the larger tasks addressed by a software application into subtasks that can each be processed separately and simultaneously. Each of those tasks can then be assigned to a separate core, and the results of the subtasks are recombined as necessary ('synchronized') to generate a coherent result.
Because multiple subtasks are performed at once, the overall result can be achieved more quickly, and that is the core benefit of multi-threading. As you might suspect, processes such as creating execution threads, assigning them to specific work, and then dismantling the threads afterward add complexity and performance overhead; done incorrectly, these processes can produce incorrect results or poor performance. That is why correct threading methodology is so important, both in terms of dividing the task into the right subtasks and in coordinating the efforts of the threads that perform work on those subtasks. In considering the first of these issues-dividing the task into the right subtasks-it is useful to be familiar with two general ways of breaking tasks into smaller pieces: data decomposition and functional decomposition.
Data decomposition is based on a relatively simple idea: tasks that involve performing the same work over and over on different pieces of data can be subdivided by giving pieces of the overall data set to separate threads for parallel execution. One classic example of data decomposition is applying a filter or performing a graphical effect on an image. By giving each of several threads a defined section of the picture, each one can apply the required algorithm, pixel by pixel, until it has completed its job. Each of these 'worker' threads is coordinated by a master thread, which coordinates all of the tasks to ensure that the overall result remains coherent. Note that in this example, balancing the workloads among individual threads is relatively straightforward, since each workload can be defined as a given number of pixels.
Functional decomposition is also simple in concept. It consists of identifying each of the discrete things that a piece of software needs to do at any given time, and assigning a thread to each of them. For examp le, a word processor might need to continually refresh the display to reflect the user's work, perform background save and printing functions, and cross-reference each word of text as it is written against a saved dictionary list. Each of these tasks (and certainly many others) could ostensibly be assigned to a separate thread for separate and independent execution, in addition to a coordinating master thread, as in the data-decomposition example. Workload balancing can be somewhat more complex in functional-decomposition problems than in data-decomposition ones, since it can be difficult to gauge how much work is required for each subtask.
Characterize the Difficulty of Parallelizing Specific Workloads
Of course, an individual application may contain both data decomposition and functional decomposition problems, and the actual identities of workloads are somewhat more complex than this 'two sizes fits all' model, but characterizing workloads in this fashion is a useful way of looking at high-level threading architectures. Another way of characterizing the threadability of workloads is by the amount of developer effort involved in creating the threaded version. In this sense, workloads may be seen as fitting into one of three broadly defined categories:
- Easily threaded workloads: These are problems such as those described above, which imply an obvious threading model. Such problems are sometimes referred to as "embarrassingly parallel," and they constitute perhaps 10-20 percent of all workloads.
- Moderately difficult-to-thread workloads: A far more common set of circumstances exists in cases where workloads can be parallelized with substantial effort, which is warranted by potential performance gains to protect a competitive advantage. Examples include some database applications, data mining, synthesizing, text and voice processing, constituting some 60 percent of workloads.
- Very difficult-to-thread workloads: This category includes workloads that are very difficult to parallelize, due to linear arrangements where the input data of one subtask is generally dependent upon the output data of another. The business advantages of threading such workloads must be carefully considered against the cost and technical complexity of doing so.
By correlating the difficulty of threading tasks with the commercial value of doing so makes it possible to create a framework for identifying parallelization priorities. Doing so provides software makers with the means to begin general strategic planning around which workloads within their applications to parallelize first.
Architect Software with Future Hardware Innovation in Mind
An important consideration in developing threaded software is to consider how many execution cores a given piece of software intends to support. If an application is architected specifically to support up to a certain number of cores, it may need to be redesigned once the mainstream machines that support it have moved significantly beyond that number of cores. Instead, where possible, application logic should be built to take advantage of the full number of cores that are available to it. A very simple example of how this might be accomplished concerns the data-decomposition problem applying a filter to the entirety of an image, which is described above. In order to take advantage of an open-ended number of execution cores, the code could simply detect the number of logical processors available and create a suitable number of threads to accommodate them. In this description, the number of logical processors could be calculated using the following expression:
where PL is the number of logical processors, PP is the number of physical processors, CP is the number of execution cores per processor, and H=2 for systems with Hyper-Threading Technology enabled and H=1 for systems without Hyper-Threading Technology. Note that the optimum number of threads is typically no greater than the number of logical processors, although it may be significantly less, depending on the nature of the problem.
Such flexibility will be increasingly important as processor architectures become larger and more complex. The 80-core chip mentioned above makes use of a 'tile' design that allows replication of many sets of identical structures on the silicon. Such innovations may be expected to allow the development of chips with an open-ended number of cores in the future. Another field that shows promise is the development of technologies that allow software to turn cores on and off as needed, as a power-efficiency measure. While such specific innovations cannot necessarily be anticipated, it is useful to consider the range of hardware that a piece of software may eventually run on. Robust and flexible software designs can accommodate many such changes as they arise, which emphasizes the value of spending extra time and effort in the early design phase of development projects.
The real benefit of the dramatic increases in performance that are possible by taking advantage of these massively parallel systems is the ability to address new types of problems that were not previously possible. The core capability to address increasingly complex problems with parallel computing depends upon breaking those problems up in novel ways; one model that helps visualize this sort of analysis is Recognition, Mining, and Synthesis (RMS); the three components of this model are as follows:
- Recognition: Creating mathematical models for describing a particular pattern of input data, such as traits that signify a given human personality type, or a pattern in medical imaging that represents a tumor.
- Mining: Querying data to identify instances of a model, such as examining survey results to locate people that match the desired personality type, or automated analysis of large numbers of medical images to locate tumors of the type under consideration.
- Synthesis: Analyzing the data to postulate its significance to a set of problems, such as estimating the success of an individual in a specific job role, or identifying what the likely progression of a tumor is in a given patient.
The RMS model is very powerful, and it represents the sort of software design that might benefit from the ongoing development of more and more parallelism in hardware. Large RMS models require very high compute density, and the large size of the data sets involved often offers obvious opportunities for parallelism in the resolution of these problems. Because of these characteristics, RMS implementations are a good example of the sort of forward-looking solution design that positi ons software makers for success that is tied to future processor innovation.
Implement Tools to Simplify and Verify Threading Models
Another important aspect to positioning a development organization for success in a long-term commitment to parallel software is to implement tools and techniques that can simplify the effort to create high-quality threaded software. The full range of Intel® Software Development Products is designed to work together in facilitating the development of parallel software:
- Intel® Compilers can automatically introduce threading into software. Compiler settings enable rich control over the use of industry-standard threading techniques such as auto-parallelization and OpenMP*. This automation makes the Intel Compilers a sound choice for initial threading efforts, enabling higher performance with a low commitment of effort.
- Intel® Performance Libraries provide very highly tuned software functions that can easily add parallelism to applications. Many of these functions are pre-threaded, and all are thread-safe, making them well suited to implementation in threaded software. This low-cost option also builds threading performance without extensive effort.
- Intel® VTune™ Performance Analyzer identifies sections of code that are good candidates for hand parallelization. By systematically identifying these sections and addressing the opportunities associated with them, you can focus on threading the code that will yield the greatest results.
- Intel® Threading Tools vastly simplify the debugging and performance optimization of threaded code. Intel® Thread Checker identifies potential threading errors in code before they happen, even when developers cannot observe those errors on a test machine. Intel® Thread Profiler examines thread behavior to identify potential points of performance improvement, enabling you to efficiently fine-tune threading behavior.
- Intel® Threading Building Blocks consists of ready-made library functions that simplify the development of threaded data structures and algorithms, speeding up software on multi-core hardware. Ready to use parallel algorithms support easy plug-in deployment into applications to deliver scalable software speed-up, detecting the number of available execution cores and dynamically tailoring code on that basis.
Consider the following perspective from James Reinders, director of marketing and business for the Intel® Software Development Products division, who says, “It strikes me that in terms of future development, the magnitude of the change that software developers are going to experience will be substantial. A decade from now, we’ll be looking back and thinking how much differently we approach writing program code. Parallelism, for everyone, is going to be ubiquitous. And, truthfully, this offers significant opportunities for companies and new possibilities for every single software developer, including tools developers such as our team in the Intel Software Development Group. Right now we have an excellent opportunity to rethink and plan programming strategies for the next decade and find ways to fully exploit the potential of multi-core processor architectures.”
Most software development organizations have made some commitment to optimizing their products with regard to parallel architectures through multi-threading. Even as the industry has begun to see the necessity of this paradigm shift within its operations, however, many software companies have not yet fully committed to positioning themselves for success in a future of dozens, hundreds, or thousands of execution cores in a single processor. In as much as that scenario seems likely in the not-so-distant future, it is valuable to consider steps that provide long-term planning in that regard.
By recognizing the fact that less-than-expert threading practices are likely to lead to unacceptable performance on platforms with large numbers of cores, organizations are beginning to plan long-term strategies around supporting massive parallelization in their software. This change makes it important for thought leaders within organizations to be able to articulate the technical and business requirements associated with positioning their companies for increased parallelism, making choices in terms of process and planning that will support current and future generations of hardware.
The following materials provide a point of departure for further research on this topic:
- Intel® Multi-Core Developer Community brings together a wide range of developer resources related to creating software that takes optimal advantage of multi-core processing.
- Intel® Multi-Core Technology and Research Portal provides access to a variety of resources about current multi-core technology at Intel, as well as ongoing innovation and research.
- Intel® Software Development Products help to simplify the development of high-quality parallel software with tools that integrate with popular development environments.
About the Author
Matt Gillespie is an independent technical author and editor working out of the Chicago area and specializing in emerging hardware and software technologies. Before going into business for himself, Matt developed training for software developers at Intel Corporation and worked in Internet Technical Services at California Federal Bank. He spent his early years as a writer and editor in the fields of financial publishing and neuroscience.