This post was originally published on blogs.rapidmind.com. RapidMind was acquired by Intel Corporation in August of 2009, and the RapidMind Multi-core Platform will merge with Intel Ct technology. Before joining Intel as part of the acquisition, Michael was a co-founder of RapidMind.
I have been posting about various structured patterns for parallel computation. These patterns can be used as the basis of a structured, deterministic-by-default approach to parallel computation. In my last post, I talked about the gather pattern. A gather pattern takes a collection of memory addresses (or equivalently, array indices) and reads them in parallel, resulting in an array of results.
Gather is a very general pattern, but it is useful to look at some special cases, since the properties of these special cases can be exploited to improve performance. In this post, I will discuss one such special case: the stencil pattern. A stencil is a gather in which all memory addresses used for reads are expressed as offsets relative to the location of a collection of outputs. Suppose we have an input array A, an output array B, and a set of integer offsets P. Then for every index k in the output array, we read from every input in A given by k + p for every p in P. Note that the index values in P may be multidimensional if the inputs and outputs are multidimensional. A stencil accesses a regular neighbourhood of the input for each output.
For example, suppose A and B are two-dimensional and P is the set {(0,0),(-1,0),(1,0),(0,-1),(0,1)}. The for output location (2,2) in B, we access A[2,2], A[1,2], A[3,2], A[2,1], and A[2,3]. For other outputs in B, the stencil is shifted appropriately. Note that the stencil does not specify what operation is performed on the inputs to generate the outputs. Also, it is possible for the stencil to generate reads from A outside its valid index range. A full specification of a stencil pattern has to include how these boundary cases are handled. This can be done by using default values in place of reads from A, or by modifying the indicies so they fall in the valid inputs of A (for instance, by clamping them to a valid range, by wrapping them, or by reflecting them).
The stencil pattern appears in a number of applications, but most notably in image processing and simulation. A wide variety of linear and non-linear image processing operations are specified using stencils, including linear convolution and non-linear noise reduction. Explicit solutions to partial differential equations, used in simulation and seismic reconstruction as well as in image processing, use iterative applications of stencil operations.
Efficient implementation of the stencil pattern depends on a number of optimizations. A good direct implementation in terms of low-level operations is surprisingly complex, but can significantly outperform a naive implementation.
The first optimization is boundary handling. There is usually a large region of the output for which no offset input reads from an invalid location in the input. As this is the most common case (since the number of samples in the interior grows much faster than the number on the boundary), we should include an optimized implementation for the interior that avoid unnecessary checking for out-of-bounds accesses into the input. Even along the boundaries, it is better to generate a number of special-case passes rather than check for boundary handling inside the inner loop. There may be a large number of such cases based on the size of the stencil and the dimensionality of the data.
Second, in the interior we want to use sliding windows over a strip in order to exploit spatial coherency. Adjacent outputs from the stencil typically share a number of inputs. By assigning a “strip” of the input volume to a single core and then working along this strip, we can cache previous reads from within the strip in on-chip memory and avoid reading them again, thereby saving off-chip main memory bandwidth. This reduces the amount of available parallelism, so there is a tradeoff here between parallelism and memory performance. However, with multidimensional stencil processing usually there is more than enough parallelism, so some can be sacrificed for more efficient use of memory (and less contention between cores for this typically limited resource). Adjacent strips may still make redundant memory reads. By making the strips relatively broad (again, resulting in a tradeoff between parallelism and memory performance) this “halo” effect can be reduced. A decision also needs to be made about the orientation of the strips. The optimal orientation can be depend on the layout of the data in memory as well as the shape of the stencil.
Finally, we want to vectorize the computation on processors with vector units. If the input consists of scalar data and is laid out contigously, stencil offsets tend to cause memory accesses that are not conveniently aligned with vectors, so swizzling (vector element rearranging) operations need to be inserted. Some of these can be vectorized and combined if the stencil and the strip processing can be treated as a whole and if the processor supports suitable whole-vector swizzling operations. Vectorization also interacts with strip processing and sliding windows. Normally we want to choose a strip width (or length, depending on the orientation of the strip) that is a multiple of the vector width. Both strip processing and vectorization lead to efficient processing of data that is a multiple of the strip width and the vector width. However, in real data the sizes may not be a convenient multiple of these widths, so usually some extra processing (without vectorization) needs to be done along an “outside” strip.
Hand implementing the above process using a low-level interface, needless to say, results in some very ugly and hard-to-maintain code. Also, some of the above optimizations are hardware-dependent (especially when it comes to proper vectorization). However, the basic “declarative” specification of the stencil is very simple and easy to understand. Therefore, doing the above optimizations for the stencil pattern is a prime candidate for automation. It should be noted that built-in support for stencil patterns is one of the key features of the 4.0 release of the RapidMind platform. With this release, it’s possible for developers to simply give the offsets of the stencil in a portable fashion and the platform will automatically create a suitable low-level implementation.
