• 10/30/2018
  • Public Content
Contents

Using the Blocking Technique

An important class of algorithmic changes involves changing the access pattern so that working set fits in cache. By organizing data memory accesses, you can load the cache with a small subset of a much larger data set. The idea is then to work on this block of data in cache. By using/reusing this data in cache we reduce the need to go to memory (memory bandwidth pressure).
Blocking is an optimization technique that can help to avoid memory bandwidth bottlenecks. Blocking enables you to exploit the inherent data reuse available in the application by ensuring that the data remains in cache across multiple uses. You can use the blocking technique on one-, two- or three-dimension spatial data structures. Some iterative applications can benefit from blocking over multiple iterations, which is called temporal blocking, and further mitigate bandwidth bottlenecks, which requires merging some kernels. In terms of code change, blocking typically involves loop splitting. In most application codes, blocking is best performed by parameterization and tuning of the block-factors.
For instance, the code snippet below shows an example of blocking with two loops (
i1
and
i2
) iterating over entire data set. The original code streams through the entire set in the inner loop, and must load the
data[i2]
value from memory in each iteration. Assuming
NUM
is large, and compute is not arithmetic intensive, we have no reuse in cache so this application is memory-bandwidth bound.
The original code without applying the blocking technique appears as follows:
for (i1 = 0; i1 < NUM; i1 ++){           for (i2=0; i2 < NUM; i2++) {                      OUT[i1] += compute(data[i1], data[i2]);           } }
The blocked code below is obtained by splitting the
i2
loop into an outer loop iterating over bodies in multiple of
BLOCK
, and an inner
i22
loop, iterating over elements within the block; and interleaving the
i1
and
i2
loops. This code reuses a set of
BLOCK
i2
values across multiple iterations of the
i1
loop. In case you choose
BLOCK
so that it makes the set of values fit in the L2 cache, the memory traffic is reduced by a factor of the
BLOCK
.
The modified code with use of one-dimensional blocking appears as follows:
for (i2 = 0; i2 < NUM; i2 += BLOCK) {    for (i1=0; i1 < NUM; i1 ++) {                 for (i22=0; i22 < BLOCK; i22 ++) {                     OUT[i1] += compute(data[i1], data[i2 + i22]);                         }            } }
For the more detailed example of tiling, refer to the General Matrix Multiply Sample .

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804