parallel_scan Template Function
Summary
Template function that computes parallel prefix.Header
#include "tbb/parallel_scan.h"
Syntax
template<typename Range, typename Body> void parallel_scan( const Range& range, Body& body ); template<typename Range, typename Body> void parallel_scan( const Range& range, Body& body, const auto_partitioner& ); template<typename Range, typename Body> void parallel_scan( const Range& range, Body& body, const simple_partitioner& );
Description
A parallel_scan(range,body) computes a parallel prefix, also known as parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependences.
A mathematical definition of the parallel prefix is as follows. Let × be an associative operation with left-identity element id_{×}. The parallel prefix of × over a sequence z_{0}, z_{1}, ...z_{n-1} is a sequence y_{0}, y_{1}, y_{2}, ...y_{n-1} where:
- y_{0} = id_{×} × z_{0}
- y_{i} = y_{i-1} × z_{i}
For example, if × is addition, the parallel prefix corresponds a running sum. A serial implementation of parallel prefix is:
T temp = id_{×}; for( int i=1; i<=n; ++i ) { temp = temp × z[i]; y[i] = temp; }
Parallel prefix performs this in parallel by reassociating the application of × and using two passes. It may invoke × up to twice as many times as the serial prefix algorithm. Given the right grain size and sufficient hardware threads, it can out perform the serial prefix because even though it does more work, it can distribute the work across more than one hardware thread.
Tip
Because parallel_scan needs two passes, systems with only two hardware threads tend to exhibit small speedup. parallel_scan is best considered a glimpse of a technique for future systems with more than two cores. It is nonetheless of interest because it shows how a problem that appears inherently sequential can be parallelized.
The template parallel_scan<Range,Body> implements parallel prefix generically. It requires the signatures described in the table below.
Pseudo-Signature |
Semantics |
---|---|
void Body::operator()( const Range& r, pre_scan_tag ) |
Accumulate summary for range r . |
void Body::operator()( const Range& r, final_scan_tag ) |
Compute scan result and summary for range r. |
Body::Body( Body& b, split ) |
Split b so that this and b can accumulate summaries separately. Body *this is object a in the table row below. |
void Body::reverse_join( Body& a ) |
Merge summary accumulated by a into summary accumulated by this, where this was created earlier from a by a's splitting constructor. Body *this is object b in the table row above. |
void Body::assign( Body& b ) |
Assign summary of b to this. |
A summary contains enough information such that for two consecutive subranges r and s:
- If r has no preceding subrange, the scan result for s can be computed from knowing s and the summary for r.
- A summary of r concatenated with s can be computed from the summaries of r and s.
For example, if computing a running sum of an array, the summary for a range r is the sum of the array elements corresponding to r.
The figure below shows one way that parallel_scan might compute the running sum of an array containing the integers 1-16. Time flows downwards in the diagram. Each color denotes a separate Body object. Summaries are shown in brackets.
- The first two steps split the original blue body into the pink and yellow bodies. Each body operates on a quarter of the input array in parallel. The last quarter is processed later in step 5.
- The blue body computes the final scan and summary for 1-4. The pink and yellow bodies compute their summaries by prescanning 5-8 and 9-12 respectively.
- The pink body computes its summary for 1-8 by performing a reverse_join with the blue body.
- The yellow body computes its summary for 1-12 by performing a reverse_join with the pink body.
- The blue, pink, and yellow bodies compute final scans and summaries for portions of the array.
- The yellow summary is assigned to the blue body. The pink and yellow bodies are destroyed.
Note that two quarters of the array were not prescanned. The parallel_scan template makes an effort to avoid prescanning where possible, to improve performance when there are only a few or no extra worker threads. If no other workers are available, parallel_scan processes the subranges without any pre_scans, by processing the subranges from left to right using final scans. That's why final scans must compute a summary as well as the final scan result. The summary might be needed to process the next subrange if no worker thread has prescanned it yet.
Example Execution of parallel_scan
The following code demonstrates how the signatures could be implemented to use parallel_scan to compute the same result as the earlier sequential example involving ×.
using namespace tbb; class Body { T sum; T* const y; const T* const z; public: Body( T y_[], const T z_[] ) : sum(id_{×}), z(z_), y(y_) {} T get_sum() const {return sum;} template<typename Tag> void operator()( const blocked_range<int>& r, Tag ) { T temp = sum; for( int i=r.begin(); i<r.end(); ++i ) { temp = temp × z[i]; if( Tag::is_final_scan() ) y[i] = temp; } sum = temp; } Body( Body& b, split ) : z(b.z), y(b.y), sum(id_{×}) {} void reverse_join( Body& a ) { sum = a.sum × sum;} void assign( Body& b ) {sum = b.sum;} }; float DoParallelScan( T y[], const T z[], int n ) { Body body(y,z); parallel_scan( blocked_range<int>(0,n), body ); return body.get_sum(); }
The definition of operator() demonstrates typical patterns when using parallel_scan.
- A single template defines both versions. Doing so is not required, but usually saves coding effort, because the two versions are usually similar. The library defines static method is_final_scan() to enable differentiation between the versions.
- The prescan variant computes the × reduction, but does not update y. The prescan is used by parallel_scan to generate look-ahead partial reductions.
- The final scan variant computes the × reduction and updates y.
The operation reverse_join is similar to the operation join used by parallel_reduce, except that the arguments are reversed. That is, this is the right argument of ×. Template function parallel_scan decides if and when to generate parallel work. It is thus crucial that × is associative and that the methods of Body faithfully represent it. Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section.
If you change the example to use a simple_partitioner, be sure to provide a grainsize. The code below shows the how to do this for a grainsize of 1000:
parallel_scan(blocked_range<int>(0,n,1000), total, simple_partitioner() );