Using the Blocking Technique
An important class of algorithmic changes involves changing the access
pattern so that working set fits in cache. By organizing data memory accesses,
you can load the cache with a small subset of a much larger data set.
The idea is then to work on this block of data in cache. By using/reusing
this data in cache we reduce the need to go to memory (memory bandwidth
pressure).
Blocking is an optimization technique that can help to avoid memory
bandwidth bottlenecks. Blocking enables you to exploit the inherent data
reuse available in the application by ensuring that the data remains in
cache across multiple uses. You can use the blocking technique on one-,
two- or three-dimension spatial data structures. Some iterative applications
can benefit from blocking over multiple iterations, which is called temporal
blocking, and further mitigate bandwidth bottlenecks, which requires merging
some kernels. In terms of code change, blocking typically involves loop
splitting. In most application codes, blocking is best performed by parameterization
and tuning of the block-factors.
For instance, the code snippet below shows an example of blocking with
two loops (
i1
and i2
) iterating over entire
data set. The original code streams through the entire set in the inner
loop, and must load the data[i2]
value from memory in each
iteration. Assuming NUM
is large, and compute is not arithmetic
intensive, we have no reuse in cache so this application is memory-bandwidth
bound.The original code without applying the blocking technique appears as
follows:
for (i1 = 0; i1 < NUM; i1 ++){ for (i2=0; i2 < NUM; i2++) { OUT[i1] += compute(data[i1], data[i2]); } }
The blocked code below is obtained by splitting the
i2
loop into an outer loop iterating over bodies in multiple of BLOCK
,
and an inner i22
loop, iterating over elements within the
block; and interleaving the i1
and i2
loops.
This code reuses a set of BLOCK
i2
values across
multiple iterations of the i1
loop. In case you choose BLOCK
so that it makes the set of values fit in the L2 cache, the memory traffic
is reduced by a factor of the BLOCK
.The modified code with use of one-dimensional blocking appears as follows:
for (i2 = 0; i2 < NUM; i2 += BLOCK) { for (i1=0; i1 < NUM; i1 ++) { for (i22=0; i22 < BLOCK; i22 ++) { OUT[i1] += compute(data[i1], data[i2 + i22]); } } }
For the more detailed example of tiling, refer to the General
Matrix Multiply Sample .