by Clay Breshears
The Threading Methodology used at Intel has four major steps: Analysis, Design & Implementation, Debugging, and Performance Tuning. These steps are used to create a multithreaded application from a serial base code. While the use of software tools for the first, third, and fourth steps is well documented, there hasn’t been much written about how to do the Design & Implementation part of the process.
There are plenty of books published on parallel algorithms and computation. However, these tend to focus on message-passing, distributed-memory systems, or theoretical parallel models of computation that may or may not have much in common with realized multi-core platforms. If you’re going to be engaged in threaded programming, it can be helpful to know how to program or design algorithms for these models. Of course, these models are fairly limited and many software developers will not have had the opportunity to be exposed to systems that need such specialized programming.
Multithreaded programming is still more art than science. This article gives 8 Simple Rules that you can add to your palette of threading design methods. By following these rules you will have more success in writing the best and most efficient threaded implementation of your applications.
You can’t execute anything concurrently unless the operations that would be executed in parallel can be run independently of each other. We can easily think of different real world instances of independent actions being performed in order to satisfy a single goal. For example, building a house can involve many different workers with different skills. That is, carpenters, electricians, glazers, plumbers, roofers, painters, masons, lawn wranglers, etc. There are some obvious scheduling dependencies between pairs of workers (can’t put on roof shingles before walls are built, can’t paint the walls until the drywall is installed), but for the most part, the people involved in building a house can work independently of each other.
Another real-world example would be a DVD rental warehouse. Orders for movies are collected and then distributed to the workers that go out to where all the discs are stored and find copies to satisfy their assigned orders. Pulling out “My Fair Lady” by one worker does not interfere with another worker that is looking for “The Terminator,” nor will it interfere with a worker trying to locate episodes from the second season of “Seinfeld.” (We can assume that any conflicts that would result from unavailable inventory have been dealt with before orders are transmitted to the warehouse.) Also, packaging and mailing of each order will not interfere with disk searches or the shipping and handling of any other order.
There are cases where you will have exclusively sequential computations that cannot be made concurrent; many of these will be dependencies between loop iterations or steps that must be carried out in a specific order. An example for the latter is a pregnant reindeer. Normal gestation period is about eight consecutive months, so you can’t get a calf by putting eight cows on the job for one month. However, if Santa wanted to field a whole new sled team as soon as possible, he could have eight cows carrying his future team all at the same time.
There are two directions that you can use when approaching a project to thread a serial code. These are bottom-up and top-down. In the Analysis phase of the Threading Methodology, you identify the segments of your code that take the most execution time (a hotspot). If you are able to run those code portions in parallel, you will have the best chance at achieving the maximum performance possible.
In a bottom-up approach, you would attempt to thread the hotspots in your code. If this is not possible, you can search up the call stack of the application to determine if there is another place in the code that can be run in parallel and still executes the hotspots. For example, if you have a picture compression application, you can divide the processing of the picture into separate, independent regions to be processed in parallel. Even if it is possible to employ concurrency at the hotspot code, you should still look to see if it would be possible to implement that concurrency at a point in the code higher up in the call stack. This can increase the granularity of the execution done by each thread.
With the top-down approach, you first consider the whole application, what the computation is coded to accomplish, and, at a very abstract level, all the parts of the app that combine to realize that computation. If there is no obvious concurrency, you should distill the parts of the computation into successively smaller parts until you can identify independent computations. Results from the Analysis phase can guide your investigation to include the most time-consuming modules. Consider threading a video encoding application. You can start at the lowest level of independent pixels within a single frame, or realize that groups of frames can be processed independent of other groups. If the video encoding app is expected to process multiple videos, expressing your parallelism at that level may be easier to write and will be at the highest level of possible concurrency.
The granularity of concurrent computations is loosely defined as the amount of computation done before synchronization is needed. The longer the time between synchronizations, the coarser the granularity will be. Fine-grained parallelism runs the danger of not having enough work assigned to threads to overcome the overhead costs of using threads. Adding more threads, when the amount of computation doesn’t change, only exacerbates the problem. Coarse-grained parallelism has lower overhead costs and also tends to be more readily scalable to an increase in the number of threads. Top-down approaches to threading (or driving the point of threading as high in the call stack) are the best options to achieve a coarse-grain solution.
Processors have recently gone from being dual-core to quad-core. Intel has announced the 80-core Teraflop chip. The number of cores available in future processors will only increase. Thus, you should plan for such processor increases within your software. Scalability is the measure of an applic ation’s ability to handle changes, typically increases, in system resources (number of cores, memory size, bus speed, etc.) or data set sizes. In the face of more cores being available, write flexible code that can take advantage of different numbers of cores.
To paraphrase C. Northecote Parkinson, “Data expands to fill the processing power available.” This means that as the amount of computational power increases (the more cores), the more likely it will be that the data to be processed will expand. For the video encoding example above, it is unlikely the size of frames will change. Thus, increasing the number of threads working within a single frame when the number of cores increases will lower the total number of pixels each thread will process. This change will only serve to increase the overhead of using more threads and create an even more fine-grain execution. More likely is that the length or number of video streams to be encoded will increase. Adding more threads to handle longer (or additional) video streams will divide the increased workload at the point where it has actually increased. This makes the application very scalable.
Designing and implementing concurrency by data decomposition methods will be more scalable than functional decompositions. The number of independent functions is likely limited and fixed during the course of an application’s execution. Once each independent function has a thread and core to execute on, increasing the number of threads to take advantage of more cores will not increase performance of the application. Since data size tends to increase with the number of cores available, data decomposition designs will have the best chance for scalability.
Even though an application has been written with threads assigned to independent functions, there still may be possibilities of using extra threads when the input workload increases. Consider the house building example from above. There are a finite number of separate tasks to be done. However, if the size of the house to be built was doubled, you would expect extra workers might be assigned within some of those tasks. That is, extra painters, extra roofers, extra plumbers, etc., would be used. Therefore, you should be aware of the data decomposition possibilities that can arise from increased data sets, even within functionally decomposed tasks, and plan for the use of extra threads on extra cores available.
If your hotspot computations can be executed through a library call, then you should strongly consider using an equivalent library function instead of executing hand-written code. It’s never a good idea to “reinvent the wheel” by writing code that performs calculations already encapsulated with library routines that have been optimized. Many libraries, such as Intel® Math Kernel Library (MKL) and Intel® Integrated Performance Primitives (IPP), have functions that are already threaded to take advantage of multi-core processors.
Even more important than using threaded library routines, though, is to ensure that any library calls used are thread-safe. That is, if library routines are called from two different threads, the results of each call will return the correct answers. If routines are referencing and updating shared variables within the library, there may be data races that can adversely affect the reliability of the computations. In order to be thread-safe, a library routine can be written to be re-entrant (never updates anything except local variables) or add synchronization to protect access to shared resources. Check the library documentation for the thread-safety of any third-party library you are using.
If threaded libraries are insufficient to cover all the concurrency of an application and you must employ user-controlled threads, don’t use (more complex) explicit threads if OpenMP has all the functionality you need. Explicit threads do allow for finer control of the threading implementation. However, if you are only parallelizing compute-intensive loops or otherwise don’t need the extra flexibility you can get with explicit threads, there’s probably no reason to do more work than necessary. The more complex the implementation, the easier it will be to make a mistake and the harder it will be to maintain such code later.
OpenMP is focused on data decomposition methods, especially targeted to threading of loops that range over large data sets. Even if this is the only type of parallelism that can be introduced into an application, there may be external requirements (engineering practices dictated by your employer, management preferences) that will prohibit your use of OpenMP and you will need to implement your threading with an explicit model. In such a situation, you could still use OpenMP to prototype the planned concurrency and better estimate the potential performance gains, possible scalability, and how much effort will be needed to thread the serial code with explicit threads.
With serial computations, it is easy to predict the statement that will be executed following any other statement in a program. On the other hand, thread execution order is non-deterministic and controlled by the OS scheduler. What this means is that there is no reliable way of predicting the order of threads running from one execution to another or even which thread will be scheduled to run next. This is done primarily to hide execution latency within an application, especially when run on a system with fewer cores than threads. If a thread blocks because it needs memory that is not located in cache or to process an I/O request, the scheduler will swap out the blocked thread and swap in a thread that is ready to run.
Data races are a direct result of this scheduling non-determinism. If you assume that one thread will write a value into a shared variable before another thread will read that value, you may be right all of the time or you may be right some of the time or you may be right none of the time. Sometimes, if you’re lucky, the order of thread execution remains unchanged on a specific platform each and every time you run an application. Each and every difference between systems (bit locations on the disk or memory speed or frequency of the AC power coming out of the wall sockets) has the potential to alter the thread schedule observed. Code that relies on a particular order of execution among threads, through nothing more than positive thinking, will be in trouble with things like data races and deadlock.
From a performance perspective, it would be best to allow threads to run as unencumbered as possible, like greyhounds or thoroughbreds in a race. Don’t try to enforce a particular order of execution unless absolutely necessary. You need to be able to recognize those times when it is absolutely necessary and implement some form of synchronization to coordinate the execution order of threads relative to each other. Consider two friends reading a newspaper. Looking over the pages spread out on a table, the two may read at different rates and will likely be interested in different articles. Whichever friend finishes reading the pages first must wait for the other to complete her perusal of the articles. There are no restrictions on the order of articles either reader chooses to read or the amount of time one can devote. Each reader proceeds at their own pace and then synchronizes with the other reader before going to the next set of pages.
Synchronization is overhead that does not contribute to the furtherance of the computation, except to guarantee the correct answers are produced from the parallel execution of an application. It is a necessary evil. Even so, you should actively seek to keep the amount of synchronization to a minimum. This can most readily be done through the use of storage local to threads or use of exclusive memory locations (e.g., an array element indexed by thread ID).
Temporary work variables rarely need to be shared between threads. These should be declared or allocated locally to each thread. Variables that hold partial results for each thread should also be local to threads. Combining the partial results into a shared location will require some synchronization. Ensuring that the shared updates are done as infrequently as possible will keep the amount of overhead to a minimum. If using explicit threads, there are Thread Local Storage APIs available to enable the persistence of data local to threads from one threaded region to another or from one threaded function call to the next execution of the same function.
If local storage for each thread is not a valid option and you must coordinate access to shared resources through synchronization objects (like a lock), be sure to properly associate (or “attach”) locks to data items. The easiest way to do this is to have a 1:1 relationship of locks to datum. You can set a lock to protect more than one data item if the set of memory locations are always accessed in the same critical regions.
What if you have a large collection of data, for example, an array of 10,000 items? Protecting the whole array with a single lock will likely lead to a synchronization contention bottleneck. Should a different lock be created for each of the array elements? Even with 32 or 64 threads competing for access, this seems like a waste of memory space to protect against conflicts with a less than 1-in-100 average chance of occurring. Something in between these two extremes is called for, specifically, a set of “modulo locks.” Modulo locks protect every Nth instance of a data collection, where N is the number of locks used. For example, with two locks, one lock protects access to the even numbered elements and the other protects access to the odd-numbered. To access a protected element, a thread computes the modulus of the desired location and then must hold the corresponding modulo lock. The number of locks utilized should be based on the number of threads used and the likelihood of concurrently accessing the same element by two threads.
However you decided to associate locks to data items, never associate more than one lock to a single data object. Segal’s Law states “A man with a watch knows what time it is. A man with two watches is never sure.” If two different locks protect access to the same variable, one part of the code may use one lock for access while another section of code can use the other lock. Threads executing in these two code portions will create a data race since both will assume they have exclusive access to the contested data.
For comparing performance of applications, both serial and parallel, the bottom line is wall clock execution time. When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution. This theoretic metric will almost always correlate with an application’s performance. That is, with everything else held constant, an application that uses an O(n log n) algorithm (i.e., quicksort) will run faster than an O(n2) algorithm (i.e., selection sort) that generates the same results.
In parallel applications, algorithms with a better asymptotic order of execution will run faster, too. Nonetheless, there will be times that the best serial algorithm will not be amenable to being parallelized. If a hotspot turns out to be something that can’t be easily turned into threaded code (and you can’t find a point higher in the call stack of the hotspot that can be made parallel), you should consider using a sub-optimal serial algorithm that may parallelize easier than the better serial algorithm currently in the code. Of course, there may be other, less drastic, changes that will allow you to parallelize a given portion of code.
As an example of this latter point, consider the linear algebra operation for the multiplication of two square matrices. Strassen’s Algorithm has one of the best asymptotic orders of execution, O(n2.81). This is obviously better than the O(n3) of the traditional triple nested loop algorithm. Strassen’s method divides up each of the matrices into four pieces and uses seven recursive calls to multiply the n/2 × n/2 submatrices. To parallelize these recursive calls, you could create a new thread to execute each of the seven independent submatrix multiplications until the submatrices achieve a given size. The number of threads will increase exponentially (much like the wives, cats, sacks, etc., coming from St. Ives). As the submatrices get smaller and smaller, the granularity of the assigned work given to a newly created thread will get finer and finer. Alternatively, a “pool” of seven threads could be spawned. Each of the initial seven submatrix multiplications would be assigned to one of the threads in the pool for concurrent execution. The thread pool executions proceed by recursively calling Strassen’s method to multiply the submatrices assigned, as the serial version had done. Systems with more than eight cores will hav e idle resources under this alternate scheme.
A much easier means to parallelize matrix multiplication is to use the triple-nested loop algorithm. There are several ways to perform data decomposition on the matrices (divide by rows, divide by columns, or divide by blocks) and assign the necessary computations to threads. This can be done by OpenMP pragmas at one of the loop levels or by explicit threads that implement the division of the matrices as needed. Less code modification is required for the simpler serial algorithm and the structure of the code would likely be left more intact than it would be if you attempted to thread Strassen’s Algorithm.
We’ve covered 8 Simple Rules that you should keep in mind when you are designing the threading that will transform a serial application into a parallel version. By following the rules presented here and keeping in mind practical programming rules, you can more easily create concurrent solutions that will be more robust, less likely to contain threading problems, and move toward optimal performance in less time.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804