This post was originally published on blogs.rapidmind.com. RapidMind was acquired by Intel Corporation in August of 2009, and the RapidMind Multi-core Platform will merge with Intel Ct technology. Before joining Intel as part of the acquisition, Michael was a co-founder of RapidMind.
Over a series of posts, I have been discussing various patterns for structured parallel computation. These patterns, when composed, can be used to express computations in a wide variety of domains. Scalable approaches to parallel computation have to take two things into account: tasks and data. Many approaches to parallel computation over-emphasize the former at the expense of the latter. However, in order to achieve scalable parallel performance, especially on modern hierarchical NUMA memory architectures, algorithms need to be designed in a way that emphasizes data locality and isolation.
The partition pattern provides a simple way for the software developer to express both data locality and isolation. In this pattern, a collection of data is broken down into a set of non-overlapping regions each called a partition, and then a single task is assigned to operate on each partition. Since the partitions are non-overlapping, these tasks can freely operate in parallel on the data in each partition without interfering with one another. Each task can both read and write to the data in their partition without fear of causing race conditions, since in this pattern there is a guarantee that only one task is accessing each partition. In addition, partitions that are created from contiguous regions exhibit spatial data locality, and so are a good match to hierarchical memory systems.
The partition can be thought of as a special case of the gather pattern, although since it also provides isolation it can also be used to support efficient race-free local writes (scatters). In combination with other patterns already discussed, a sequence of maps (as in the sequence and map patterns discussed earlier) using different partitions might be used in order to achieve global communication, even though each step in the sequence uses a local partition.
There are two variants of the partition pattern: regular and irregular. In a regular partition each partition is the same size. Unless specified otherwise it’s best to think of this as the default for the partition pattern, although if we want to be specific we can also call the regular partition case the “dice” pattern. In contrast, in an irregular partition, while still non-overlapping, partitions can be of different sizes. I will call irregular partitions and the associated pattern “segments” and will discuss them in a later post. Likewise, in a divide-and-conquer approach to algorithm design, you may want to apply (possibly irregular) partitioning/segmentation patterns recursively. This is related to yet another pattern, “nesting”. For now, let’s just focus on the simple regular partition.
Even with simple regular partitions, there are a few things to keep in mind. In particular, the hierarchical memory system hardware itself uses this pattern. Caches are divided into blocks (cache lines) of different sizes, and virtual memory is organized around pages. Data sharing between processors is managed on a block basis. Normally these blocks are powers of two in size and are aligned along power of two boundaries, and blocks at different levels of the memory hierarchy are nested.
It is usually beneficial to make the software partitioning used consistent with the partitioning used by the underlying hardware. Software developers should normally use aligned power-of-two partitions when designing algorithms, so that different software partitions map onto different hardware partitions. If instead the divisions between software partitions are located in the interior of hardware partitions, the hardware may treat the software partitions as shared even though they aren’t. This can result in extra (and unnecessary) interprocessor communication (cache coherency protocols at the hardware level) to keep the contents of the hardware blocks consistent when multiple processors are writing to the same blocks, even if different processors are updating different parts of the shared blocks. This “false sharing” adds hidden communication overhead that can severely impact scalability.
The optimal partition size can also be dependent on the hardware memory architecture, and there are often tradeoffs between data locality and algorithmic efficiency. A smaller partition may fit completely into a higher level of the memory hierarchy, allowing processing of that partition to be more efficient, but smaller partitions also usually incur more overhead for communication in the rest of the algorithm. As an example, consider matrix multiplication, which can be written in terms of nested sub-matrix multiplications. Larger sub-matrices give more data reuse and coherence, but use more local cache.
To avoid a dependency on any particular memory hardware, algorithms should be written in a generic way using parameterized partition sizes. Then, when installing the software on a particular processor, some testing can be used to tune the partition size to the optimum for that architecture without any need to change the software. Note that processors can vary in cache sizes and organization even within the same generation of processors from the same vendor.
The partition pattern is supported in the RapidMind platform with the “dice” data access operator, which divides an array into a kind of nested array of uniformly sized partitions. The result of the dice can then be passed to a “map” operation that operates on each partition in parallel. Using dice appropriately can significantly improve the performance of algorithm implementations. In addition, RapidMind supports very powerful mechanisms for generic programming which make it possible to write parameterized code that can be easily tuned to different processor architectures. In RapidMind this tuning can be done without any need to recompile the software.