Data Decomposition: Sharing the Love and the Data

Data decomposition is a highly effective technique for breaking work into small tasks that can be parallelized. Do it well, and performance of your games and visual applications will noticeably improve.

By Andrew Binstock

In two companion pieces [Boost Game Performance with Threads, Fine Grained Parallelism in Games], I described how games and visual software can use functional decomposition to obtain significant performance improvements. In this article, I examine data decomposition, which is the other primary form of breaking up monolithic processing into chunks that can be farmed out to multiple cores for parallel processing.

Decomposition


Whether you’re converting existing software to parallel programming or starting a new project in which you want to make use of multiple cores or processors, a key design element is the concept of decomposition. Unlike single-threaded sequential programs that dominate on desktops and client-side endpoints, multithreaded software requires an approach that decomposes large problems into smaller ones that can be run concurrently. In fact, the amount of speed-up created by parallel code is almost entirely dependent on the extent to which you can decompose the software into concurrent tasks. (This rule is known as Amdahl’s Law.) So, it pays to spend some time thinking about how to break design into discrete tasks. The better you do this, the better your performance.

Discrete tasks are of two kinds: those in which each task has its own responsibility (functional decomposition) and those in which multiple tasks have the same responsibility but work on different pieces data (data decomposition.) The previously mentioned articles present examples of the former.

A simple example of data decomposition would be a routine that steps through an array of, say, 1 million integers and performs an arithmetic operation on each one. This task could be decomposed into multiple tasks each of which performs the operation on some portion of the array. So, for example, 10 threads each working on 100,000 elements in the array. Rendering routines for graphics quite frequently take this approach: Each thread performs the computations for some sector of the screen. Today, this work is generally performed by GPUs on video cards. However, there is a renewed interest in performing this work on CPUs and thereby utilizing the many cores in today’s processor to full advantage. Intel’s new architecture development, code named Larrabee, is one of several efforts in this direction.

Software development for games and visual applications is rife with opportunities for data decomposition. Creating and traversing binary spanning trees for 3D rendering, complex collision detection, and limited geometry deformation (such as denting car bodies). In commercial applications, data decomposition figures prominently in the generation of data (Monte Carlo simulations) and in linear algebra contexts, among others.

Avoiding Coarse-Grained Data Decomposition


Let’s examine the preceding example of carving up an array of calculations into smaller sub arrays. Typically, each sub-array is assigned to a thread to perform its work. This approach can be implemented fairly simply, using only a few threading primitives found in all major threading libraries. The pseudo-code is presented in Listing 1.

  • Create an array of 1 million floating-point numbers
  • Load values into the array
  • For each of 10 threads,
    • Create a thread
    • Assign the thread 1/10th of the array to calculate
  • Wait until the threads have all finished.
  • Report results (or some subsequent action)

Listing 1. Pseudo code showing data decomposition in threading.

This implementation is known as coarse-grained threading, due to the way the threads are deployed. Until approximately ten years ago, this was the primary way of doing parallel processing. Over time as technology advanced, newer models have improve on this basic approach. The improvements address some important constraints, most important of which is that the number of threads is hard-coded. The general rule is that programs perform best when the number of threads equals the number of processor pipelines. (Pipelines are distinct from cores. Some cores, such as those on the recently released Intel® Core™ i7 processor, support symmetric multiprocessing and so have more than one processing pipeline per core. (Note: A different term for processing pipeline is a physical thread. But due to the confusion this terminology creates, I’ll stick with pipelines.)

If code based on the design in Listing 1 were run on an unusual machine that had exactly 10 pipelines, it would perform optimally. But chances are that it will run on a machine with 1, 2, 4, or 8 pipelines. If the number of pipelines is greater than the number of threads, then some pipelines will be left unused. If the number of threads is greater than the number of pipelines, then thread-management overhead from swapping threads onto and off of cores will unnecessarily consume processing time. For example, consider this code running on an old Pentium® 4 processor with Intel® Hyper-Threading Technology. That processor had two pipelines. Because there are ten threads competing for the pipelines, threads will be constantly swapped in and out by the operating system. Each swap results in lost time. For this platform, only code with two threads will work optimally.

The problem of hard-coded pipeline counts could be solved by fetching the number of processor pipelines at runtime, but this solution is only partial. First of all, the number of pipelines can be tricky to figure out. And the number of threads to use for that number of pipelines can be difficult to compute if other parts of the application are also performing parallel processing.

One way to solve this problem is to use OpenMP* (http://openmp.org/wp/), a library from a vendor consortium that enables you to place pragmas in C/C++ and Fortran code to indicate that certain sections of the code can, or rather should be, parallelized.

void a1(int n, float *a, float *b)
{
	int i;

#pragma omp parallel for
    for (i = 1; i < n; i++) 
        b[i] = (a[i] + a[i-1]) / 2.0;
}

Listing 2. Using OpenMP to parallelize a simple for loop. (Courtesy: OpenMP)

When the C compiler encounters the highlighted pragma, it ignores it if the program is not linked to OpenMP. If OpenMP is in use, the OpenMP library will fire off the appropriate number of threads and divide the array among them. It will then restore the state of variables so that the result will be the same as if the results had been computed via a single-threaded, serial execution. So, the value of i is correct when the loop is complete. (A syntactical benefit of this approach is that debugging is easy: By commenting out the pragma, it’s possible to debug the loop as a sequential, single-threaded entity.)

Intel Threading Building Blocks


Another approach that introduces additional advantages and greater control over specific operations is a thread pool, such as the ones provided in the Intel® Threading Building Blocks (TBB) library (go to: http://www.threadingbuildingblocks.org/ for information on free and paid versions). Intel® TBB can create thread pools that divide tasks among threads. It manages the threads, load balances among them, and takes into consideration cache efficiency. It uses a queuing system for each thread and sets up a work stealing arrangement, so that if one thread finishes early, it steals work from another thread’s queue.

In addition to thread pools and concurrent data structures, Intel® TBB also provides a function call similar to the one we saw in Listing 2 for OpenMP. Called parallel_for, it works as shown in Listing 3.

#include "tbb/parallel_for.h"

void ParallelApplyFoo( float a[], size_t n ) 
{
	parallel_for( blocked_range<size_t>(0,n,G), ApplyFoo(a) );
}


Listing 3. A basic implementation of a parallel_for in Intel® TBB.

As can be seen, this syntax is reminiscent of C++ templates. The parameters for blocked range refer to the beginning and end (n) of the range, followed by the “grain size,” which is a way of specifying the granularity of the computing loads per thread. (The parallel_for can also be called without specifying grain size, in which case Intel® TBB will compute the optimal size.)

The work-stealing feature of Intel® TBB works with the parallel-for loops too. So, this approach has a built-in load-balancing mechanism in addition to the for-loop parallelization capabilities mentioned in conjunction with OpenMP.

An advantage of using Intel® TBB or OpenMP for threading is that the software library is aware of all the parallelization it is doing, so that competing needs are balanced appropriately through internal automation. Developers need not worry about low-level thread management.

This aspect becomes important when combining data decomposition with functional decomposition.

Combining Data Decomposition with Functional


There are times when a program must perform several tasks of different types. Some of those tasks might be best decomposed along functional lines, while others might best be broken up into data chunks. This issue is one that frequently occurs in games. For example Intel’s Smoke technology demo which showcases threading in a game environment, highlights a typical situation. The framework (http://software.intel.com/en-us/articles/smoke-game-technology-demo/) shows how developers can write games that maximize the use of parallel system resources.

Frames in the game are individual data-decomposition units. Within a frame the work is frequently broken down along functional lines: there is the game AI layer, the physics layer, and of course, the rendering. Smoke uses a thread pool to handle this mixture of tasks. I described thread pools in detail in my previous article. Essentially, they consist of a work queue into which tasks are placed by the application. On the back end the pool manager assigns the queued tasks to available threads in the thread pool. The key to this approach - which is referred to as fine-grained parallelism - is to make the tasks small enough that threads can they can finish quickly and then pick up the next task in the queue. It is not always possible to keep tasks small, but to the extent it can be done, load balancing is eased. Figure 1 shows how Smoke breaks up work into tasks and uses a thread pool for execution.

image 1

Figure 1. How parallel tasks are handled in Smoke

As you can see, the AI work has been subdivided via data decomposition into small tasks, while the rendering and physics tasks are not as amenable to this approach.

Thread pools are a powerful construct that enable fine-grained parallelism in games and commercial applications and work well on data and functionally decomposed tasks.

From Data Decomposition to SIMD


In several ways, data decomposition resembles SIMD processor instructions. In SIMD (single instruction, multiple data), large processor registers are loaded with multiple data elements-almost invariably numeric fields-on which a single operation is performed. By performing the single instruction across many operands, the processor saves a substantial amount of processing time. Essentially, SIMD enables data decomposition to occur at the instruction level, rather than at thread level.

In the final article in this series [How Special Silicon Facilitates Parallel Arithmetic], I will look in detail at SIMD instructions as a way to improve performance by applying this parallelization at the lowest possible levels. I’ll include nifty means of automating SIMD and getting its benefits with little or no changes to existing code.

About the Author


Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for InfoWorld and a columnist for SD Times. He is the author or co-author of several books on programming, including “Programming With Hyper-Threading Technology” from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus. He can be reached through his blog.

For more complete information about compiler optimizations, see our Optimization Notice.