This post was originally published on blogs.rapidmind.com. RapidMind was acquired by Intel Corporation in August of 2009, and the RapidMind Multi-core Platform will merge with Intel Ct technology. Before joining Intel as part of the acquisition, Michael was a co-founder of RapidMind.
In this blog series, I am reviewing a set of structured parallel patterns. Compositions of these patterns can be used to implement a wide variety of algorithms, and can be used as the basis for a parallel programming platform.
In this post, I am going to discuss a very simple pattern: gather. A gather operation is a parallel random read of a set of elements from memory. It is often described as a “collective” operation, where an array and a set of indices are provided as input and the output is another array which is the result of all the reads. Gather can also be seen as the combination of a simple “serial” memory read combined with the map pattern discussed previously. Using a random read inside a function used in a map results in a gather. As with other maps, the order in which elements are read in a gather should not affect the outcome of the gather.
Gather is simple but there are a few variants which are worth thinking about, since some of the variants allow for specific optimizations.
First, are the addresses static or dynamic? If the addresses are (relatively) static, then it may be possible to do some analysis to reorder memory accesses to improve performance. For example, when performing many versions of the FFT (Fast Fourier Transform) a “scramble” operation is required that reorders the inputs or outputs. This reordering is given by taking the index of every element in the array and computing a new location by bit-reversing that index. For example, in an 8-point scramble with 3-bit indices, 0 maps to 0, 1 maps to 4, 2 maps to 2, 3 maps to 6, 4 maps to 1, 5 maps to 5, 6 maps to 3, and 7 maps to 7. When performing a map operation, we have some flexibility in the order in which elements of the map are processed. Modern memory systems are most efficient when data is accessed in large, coherent blocks. So we may want to reorder processing so that once a block is read into on-chip memory, we gather all the elements from it that we need before moving onto the next. Of course, a map operation may have multiple gather operations, and also has to write its outputs somewhere. Reordering the elements of a map needs to consider the effect on the efficiency of all inputs and the output. However, if the inputs are static AND the platform knows they are static, there may be significant opportunities for optimization.
There are also some patterns of gather that are regular. For instance, a filtering operation may want to access certain neighbours of a pixel, but for every output pixel needs the same relative neighbourhood. This is a very common pattern in imaging and simulation, and we cover it later under the “stencil” pattern. Another regular pattern, which we call the “partition” pattern, results when input is explicitly broken into a set of coherent regions. Both of these can benefit from specific optimizations.
In general, though, addresses for a gather are computed dynamically, and are irregular. For instance, such a pattern may be needed when traversing a data structure for a ray-tracer. In this case, it is still possible to improve performance by using “extra” parallelism to hide latency. A memory system may have a high bandwidth, but may take a certain amount of time to satisfy each memory request. However, it is often possible to have many outstanding memory requests in flight at the same time. To hide the latency of memory access, we start one thread until it tries to access memory, then while that thread is waiting for memory we start another when, and when it reaches a point where it is accessing memory we start another one, and so forth until the memory request for the first thread can be satisfied. This approach may be supported in hardware: “hyperthreading”, or simultaneous multithreading, is mostly useful for hiding memory access latency. However, the technique can also be implemented in software by unrolling loops and using software pipelining. These optimizations can and should be optimized by a code generator, however, since they are complex and make code hard to understand and maintain. Also, unrolling loops also increases the pressure on other resources, such as cache and registers, so a balance has to be struck.
In summary, the simple operation of gather has a couple of variants with associated optimizations. If a gather is static, we may be able to reorder it for higher performance. If it is dynamic, we may be able to use latency hiding, as long as we do not exceed other resource limits. And finally, if the gather satisfies certain regular patterns, we may be able to apply yet another set of optimizations: I will discuss the stencil and partition patterns in particular in future posts.