Divide and Conquer

Problem

Parallelize a divide and conquer algorithm.

Context

Divide and conquer is widely used in serial algorithms. Common examples are quicksort and mergesort.

Forces

  • Problem can be transformed into subproblems that can be solved independently.

  • Splitting problem or merging solutions is relatively cheap compared to cost of solving the subproblems.

Solution

There are several ways to implement divide and conquer in Intel® Threading Building Blocks (Intel® TBB). The best choice depends upon circumstances.

  • If division always yields the same number of subproblems, use recursion and tbb::parallel_invoke.

  • If the number of subproblems varies, use recursion and tbb::task_group.

  • If ultimate efficiency and scalability is important, use tbb::task and continuation passing style.

Example

Quicksort is a classic divide-and-conquer algorithm. It divides a sorting problem into two subsorts. A simple serial version looks like:1

void SerialQuicksort( T* begin, T* end ) {
   if( end-begin>1  ) {
       using namespace std;
       T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
       swap( *begin, mid[-1] );
       SerialQuicksort( begin, mid-1 );
       SerialQuicksort( mid, end );
   }
}

The number of subsorts is fixed at two, so tbb::parallel_invoke provides a simple way to parallelize it. The parallel code is shown below:

void ParallelQuicksort( T* begin, T* end ) {
   if( end-begin>1 ) {
       using namespace std;
       T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
       swap( *begin, mid[-1] );
       tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );},
                             [=]{ParallelQuicksort( mid, end );} );
   }
}

Eventually the subsorts become small enough that serial execution is more efficient. The following variation, with the change shown in bold font, does sorts of less than 500 elements using the earlier serial code.

void ParallelQuicksort( T* begin, T* end ) {
   if( end-begin>=500 ) {
       using namespace std;
       T* mid = partition( begin+1, end, bind2nd(less<T>(),*begin) );
       swap( *begin, mid[-1] );
       tbb::parallel_invoke( [=]{ParallelQuicksort( begin, mid-1 );},
                             [=]{ParallelQuicksort( mid, end );} );
   } else {
       SerialQuicksort( begin, end );
   }
}

The change is an instance of the Agglomeration pattern.

The next example considers a problem where there are a variable number of subproblems. The problem involves a tree-like description of a mechanical assembly. There are two kinds of nodes:

  • Leaf nodes represent individual parts.

  • Internal nodes represent groups of parts.

The problem is to find all nodes that collide with a target node. The following code shows a serial solution that walks the tree. It records in Hits any nodes that collide with Target.

std::list<Node*> Hits;
Node* Target;
 
void SerialFindCollisions( Node& x ) {
   if( x.is_leaf() ) {
       if( x.collides_with( *Target ) )
           Hits.push_back(&x);
   } else {
       for( Node::const_iterator y=x.begin();y!=x.end(); ++y )
           SerialFindCollisions(*y);
   }
} 

A parallel version is shown below.

typedef tbb::enumerable_thread_specific<std::list<Node*> > LocalList;
LocalList LocalHits; 
Node* Target;    // Target node    
 
void ParallelWalk( Node& x ) {
   if( x.is_leaf() ) {
       if( x.collides_with( *Target ) )
           LocalHits.local().push_back(&x);
   } else {
       // Recurse on each child y of x in parallel
       tbb::task_group g;
       for( Node::const_iterator y=x.begin(); y!=x.end(); ++y )
           g.run( [=]{ParallelWalk(*y);} );
       // Wait for recursive calls to complete
       g.wait();
   }
}
 
void ParallelFindCollisions( Node& x ) {
   ParallelWalk(x);
   for(LocalList::iterator i=LocalHits.begin();i!=LocalHits.end(); ++i)
       Hits.splice( Hits.end(), *i );
} 

The recursive walk is parallelized using class task_group to do recursive calls in parallel.

There is another significant change because of the parallelism that is introduced. Because it would be unsafe to update Hits concurrently, the parallel walk uses variable LocalHits to accumulate results. Because it is of type enumerable_thread_specific, each thread accumulates its own private result. The results are spliced together into Hits after the walk completes.

The results will not be in the same order as the original serial code.

If parallel overhead is high, use the agglomeration pattern. For example, use the serial walk for subtrees under a certain threshold.

See Also

1 Production quality quicksort implementations typically use more sophisticated pivot selection, explicit stacks instead of recursion, and some other sorting algorithm for small subsorts. The simple algorithm is used here to focus on exposition of the parallel pattern.

For more complete information about compiler optimizations, see our Optimization Notice.