How to Use Loop Blocking to Optimize Memory Use on 32-Bit Intel® Architecture


Challenge

Improve memory utilization by means of loop blocking. The main purpose of loop blocking is to eliminate as many cache misses as possible. Consider the following loop, as it exists before blocking:

float A[MAX, MAX], B[MAX, MAX]

for (i=0; i< MAX; i++) {

for (j=0; j< MAX; j++) {

A[i,j] = A[i,j] + B[j, i];

}

}

 


Solution

Transform the memory domain of a given problem into smaller chunks, rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse. In fact, one can treat loop blocking as strip mining (see the separate item "How to Use Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture") in two or more dimensions.

The two-dimensional array A is referenced in the j (column) direction and then referenced in the i (row) direction (column-major order); whereas array B is referenced in the opposite manner (row-major order). Assume the memory layout is in column-major order; therefore, the access strides of array A and B for the code in this item would be 1 and MAX, respectively.

The following code sample illustrates the loop presented in the Challenge section, after blocking:

float A[MAX, MAX], B[MAX, MAX];

for (i=0; i< MAX; i+=block_size) {

for (j=0; j< MAX; j+=block_size) {

for (ii=i; ii<i+block_size; ii++) {

for (jj=j; jj<j+block_size; jj++) {

A[ii,jj] = A[ii,jj] + B[jj, ii];

}

}

}

}

 

For the first iteration of the inner loop, each access to array B will generate a cache miss. If the size of one row of array A, that is, A[2, 0:MAX-1] , is large enough, by the time the second iteration starts, each access to array B will always generate a cache miss. For instance, on the first iteration, the cache line containing B[0, 0:7] will be brought in when B[0,0] is referenced, because the float-type variable is four bytes, and each cache line is 32 bytes. Due to the limitation of cache capacity, this line will be evicted due to conflict misses befo re the inner loop reaches the end. For the next iteration of the outer loop, another cache miss will be generated while referencing B[0,1]. In this manner, a cache miss occurs when each element of array B is referenced, that is, there is no data reuse in the cache at all for array B.

This situation can be avoided if the loop is blocked with respect to the cache size. In the figure below, a block_size is selected as the loop-blocking factor. Suppose that block_size is eight; then the blocked chunk of each array will be eight cache lines (32 bytes each). In the first iteration of the inner loop, A[0, 0:7] and B[0, 0:7] will be brought into the cache. B[0, 0:7] will be completely consumed by the first iteration of the outer loop. Consequently, B[0, 0:7] will only experience one cache miss after applying loop blocking optimization in lieu of eight misses for the original algorithm. As illustrated below, arrays A and B are blocked into smaller rectangular chunks so that the total size of two blocked A and B chunks is smaller than the cache size. This allows maximum data reuse.



As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX is huge, loop blocking can also help reduce the penalty from DTLB (data translation look-aside buffer) misses. In addition to improving the cache/memory performance, this optimization technique also saves external bus bandwidth.

The following items are related to this one:

 


Source

IA-32 Intel® Architecture Optimization Reference Manual

 


如需更全面地了解编译器优化,请参阅优化注意事项