Challenge
Improve memory utilization by means of strip mining. Strip mining, also known as loop sectioning, is a loop-transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine. Consider the following pseudocode, as it exists before strip mining:
|
The main loop consists of two functions:
- Transformation
- Lighting
For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array v[Num] is larger than the cache, then the coordinates for v[i] that were cached during Transform(v[i]) will be evicted from the cache by the time we do Lighting(v[i]). This means that v[i] will have to be fetched from main memory a second time, reducing performance.
Solution
Fragment large loops into smaller segments or strips, thus transforming the loop structure twofold:
- This technique increases the temporal and spatial locality in the data cache if the data is reusable in different passes of an algorithm.
- It also reduces the number of iterations of the loop by a factor of the length of each “vector,” or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by four times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.
In the following code, the computation has been strip-mined to a size strip_size. The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy. By doing this, a given element v[i] brought into the cache by Transform(v[i]) will still be in the cache when we perform Lighting(v[i]), and thus improve performance over the non-strip-mined code.
|
The following items are related to this one:
- How to Manipulate Data Structure to Optimize Memory Use on 32-Bit Intel® Architecture
- How to Use Loop Blocking to Optimize Memory Use on 32-Bit Intel® Architecture
Source
IA-32 Intel® Architecture Optimization Reference Manual

Comments
Wouldn't transforming the original loop to:
for (i=0; i<Num; i++) {
Transform(v[i]);
Lighting(v[i]);
}
Have the same effect? Of course many real world programs can't be transformed so easily.
Transforming the above code as:-
for (i=0; i<Num; i++) {
Transform(v[i]);
Lighting(v[i]);
}
Will not have a same effect as the strip mined code iff the size of the array/data is more than the cache size.