Load Balancing Between Threads

Submit New Article

Last Modified On :   August 24, 2009 11:05 AM PDT
Rate
 


By Andrew Binstock

Unequal workloads can often eliminate the performance benefit of parallel code. Avoid this performance killer by profiling your code and knowing how to load-balance your software.

Unequal thread workloads occur whenever one thread requires more time to do a given amount of work, thereby diminishing performance. End users experience load imbalances as sluggish program response. Developers experience them as unexpectedly poor lift from parallelization. Such imbalances often have characteristic visual profiles in system monitoring and software development tools. For example, on modern operating systems that have graphical performance monitoring tools, thread imbalances can show up rather dramatically, as shown in Figure 1.



Figure 1. A load imbalance among threads as displayed in Microsoft* Windows* XP's perfmon application.


In this example, a dual quad-core workstation is running an application that is poorly load balanced. Four threads (CPUs 1-4 top row, where CPU 0 is leftmost) are very busy while the other four threads are comparatively inactive.

For developers with access to a performance profiler, such as the Thread Profiler feature of the Intel® VTune™ Performance Analyzer, it is possible to drill deeper into imbalances and compare the performance of individual threads against each other and begin to identify causes. In Intel® VTune™, unbalanced loads can be displayed in several formats. Figure 2 illustrates one of them.



Figure 2. Poorly load balanced threads as displayed in the Intel® VTune™ Performance Analyzer


Here, four threads are started at 0.43 seconds into program execution time, but they finish at very different points in time, even though they are performing the same actions.

When writing parallel code, load balancing between threads is a goal that should be kept in mind during all stages of program design and development. This article explains the ramifications of poor load balancing across threads and it presents various solutions to use when you discover that a thread is doing too much work.


The Hidden Cost of Uneven Load Balancing

It can be easy to think of load balancing as a best practice that is desirable, but not crucial. This perspective underappreciates the cost of poor load balancing. In particular, there are two common scenarios in which poor distribution of work can effectively eliminate nearly all the benefit of parallelization. Let's examine these scenarios.

Processing Barriers. Processing barriers, sometimes called synchronization barriers, occur frequently in parallel programs. Whenever several threads must complete before the next processing step can begin, there is a processing barrier. That is, a block on further work until all previous work is done. A common example is a matrix transformation that must be completed before the matrix can be processed further. In games, actions must frequently be completed in their entirety before their results can be displayed. (Barriers are implemented by several functions in threading APIs: waitForMultipleObjects() in Windows and #pragma omp barrier in OpenMP* are typical.

Whenever a barrier occurs, it becomes critical to insure that the code preceding it is load balanced. If the code is not balanced and one thread takes a long time to finish, it holds up all the other threads affected by the barrier. Essentially, it converts the processing to a purely sequential architecture-a disaster for parallel code. So, when profiling code, always look at thread performance that precedes synchronization barriers.

Producer-Consumer. In a producer-consumer design, one group of threads produces data that is passed via a data structure to a group of threads that processes it. One example is the event queue of multiple player games: events are created by multiple user threads and placed in a queue to be processed by a collection of consumer threads. There is in this arrangement an implicit dependency: consumer threads must wait on producers. This creates the potential for significant delays if the producer is hung up due to a load imbalance. Then, all downstream processing comes to a halt and the program reverts to a serial condition.

Notice that the same problem arises if the consumer threads become blocked, the producer can no longer pass data to them and must stop collecting data or find some other way to cache the data until the consumer threads resume.

Producer-consumer relationships can sometimes be difficult to profile. However, as they often use a queue to pass work from the producers to the consumers, it is possible to track how often the queue is empty or completely full. Either scenario highlights a situation in which load balancing has broken down: one group of threads is doing nothing while another group is working hard.

Other Scenarios. In programs where lots of threads are doing lots of work on different tasks simultaneously, poor load balance can be especially difficult to detect, but it is frequently less costly. The work done by many threads can essentially mask the latency of the excess load and the program will appear to run correctly. However, it will not be operating optimally. If such unbalanced loads occur in multiple places, then the overall performance might be considerably reduced. This situation should be looked for whenever the benefits of parallelization give surprisingly little lift.


What To Do?

How to handle load balancing issues tends to vary on whether the work load was designed via functional or data decomposition. Each form of decomposition suggests a different cause and has somewhat different solutions.

Functional Decomposition.
In this method of decomposition, threads are assigned different tasks to perform in parallel. A good example is the startup sequence of a game: many things must be verified, resources initialized, and tasks run before actual play can commence. This is a situation in which it behooves developers to use functional decomposition to reduce wait time and provide the best possible user experience: the sense of "instant on."

In decomposing the tasks, it is important to understand the order in which tasks need to be performed. If step A must be performed before step B, then developers have two options: place an execution barrier between the two steps or place both tasks in the same thread, thereby forcing them to be performed sequentially. As discussed earlier, both scenarios raise the possibility of a significant impact if load balancing is poor.

Returning to the game start-up example, there is an inherent processing barrier between the start-up sequence and the game playing itself. Consequently, start-up load needs to be balanced well so that no single unbalanced thread forces the game to wait. Should a thread appear to be doing this during a profiling session, the best course is to redesign the functional decomposition.

Look for one of two opportunities: creating smaller sub-tasks that can be assigned to other threads or, failing that, ways to mask the long thread's latency. In the first case, the design needs to be improved so that rather than relying on long running functional threads, lightweight tasks are issued and run in parallel. For example, if a single thread handles all the start-up activities, the code has an implicit case of unbalanced threading. It's clear, however, that the splash screens can be displayed and audio run as separate, independent, parallel tasks. So, it's important to look for small independent tasks in the overloaded thread that can be extracted out and run separately. This activity frequently entails some redesign. For example, a newly segregated subtask might have to recreate some context state in order to run correctly. Even with some more overhead, the thread is better off running separately than adding to the imbalance of an existing thread. (As mentioned in previous articles in this series, the ideal way to run these small tasks is by assigning them to a thread pool, so that their scheduling can be optimized by the pool and the OS scheduler.)

In some circumstances, it becomes impossible to fully rectify the workload imbalance, even after careful review of functional decomposition. When this occurs, it is beneficial to review the execution barriers in the hope of being able to mask the latency. For example, I previously referred to the barrier between start-up and the actual game. This barrier exists as a logical construct, but how impermeable does it actually need to be for the game to proceed? There are surely some tasks that can overlap with the game itself without detrimentally affecting the user experience. These tasks should be identified and not subject to the barrier, so that any imbalance they might cause in workload does not hamper program progress.

Data Decomposition
In data decomposition, the same operation is performed over multiple groups of data items. A load imbalance occurs exclusively if the data is itself unbalanced either by its quantity or its processing requirements. The remedy is to change either the distribution of the data elements between the threads or the way the data is processed by each thread. Let's look at some examples.

Take a simple case in which four threads are each processing a quarter of an array of numbers, as part of a numerical simulation. This should represent an ideally balanced workload. If this turns out not to be the case, the data should be examined. Initially, you should verify that each thread is in fact processing the same portion of the array. If the data blocks are equal, then the processing requirements must be studied. For example, Java*-similar to many languages-throws an exception (java.lang.ArithmeticException) in the event of division by zero. Exception handling is an inherently slow operation. So, if the data generates many divide-by-zero exceptions in one part of the array, that thread will experience a workload imbalance.

The process of examining first data quantity, then processing complexity can sometimes lead to unexpected solutions that need to be considered in the hunt for balanced load. For example, suppose several threads are accessing an open hash table (one in which each bucket is the head of a linked list). If one of these linked lists is unusually long, a thread that accesses it frequently will appear to run more slowly. This is a case of greater data allocation, but its solution is a better hashing algorithm for use in the hash table-one that does a better job of distributing values evenly through the table.

As with functional decomposition, a good way to mask latency when data decomposition does not allow optimal balancing is to break up the processing into small tasks by using smaller chunks of data. This approach caps the delay any one task can cause and spreads that delay onto multiple tasks that can be run on separate threads. This lightweight threading approach is best implemented with a thread pool. The thread pool found in the open-source Intel® Threading Build Blocks (Intel® TBB) library is particularly well adapted to this load balancing work. Intel® TBB uses an unusual design in its thread pool implementation: multiple task queues (as opposed to the more common single task queue from which all threads receive their work assignments). These task queues are self-monitoring and have the ability to "steal work" from other queues in the event of workload imbalance. Consequently, if you're programming in C++, this is a particularly useful library for transparently managing low-level load balance issues.


A final word: Design

The techniques presented in this article address the problem of resolving imbalances only once they're identified. While they are an important part of the multithreading developer's arsenal, they are not the first line of defense against the problem. The first line is design. When designing and implementing threaded code, it is advantageous to repeatedly ask yourself what would cause the threads to execute unevenly. In most cases, the specific situations can be anticipated and prepared for.

In some cases, the best that you can do is to monitor situations where load imbalances are especially costly. For example, when a producer-consumer pairing is using a queue to pass data, informational messages should be logged when the program is run in debug mode at every occasion in which the queue is completely empty or full.

In other cases, you can address the problem squarely. For example, in the case of the division-by-zero exceptions, a simple test for a zero divisor eliminates the exception. However, to identify these opportunities for enhanced performance and balanced load, you must continually ask yourself at design time: what could slow down this one thread? Only then will you enjoy the full benefit of parallel code.


About the Author

Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for InfoWorld and a columnist for SD Times. He is the author or co-author of several books on programming, including two available from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus, (http://platypus.pz.org). He can be reached through his blog at http://binstock.blogspot.com.