How to Use Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture

Submit New Article

Last Modified On :   May 7, 2008 2:48 AM PDT
Rate
 



Challenge

Improve memory utilization by means of strip mining. Strip mining, also known as loop sectioning, is a loop-transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine. Consider the following pseudocode, as it exists before strip mining:

typedef struct _VERTEX {
float x, y, z, nx, ny, nz, u, v;
} Vertex_rec;

main()
{
Vertex_rec v[Num];
....
for (i=0; i<Num; i++) {
Transform(v[i]);
}
for (i=0; i<Num; i++) {
Lighting(v[i]);
}
....
}

 

The main loop consists of two functions:

  1. Transformation
  2. Lighting

 

For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array v[Num] is larger than the cache, then the coordinates for v[i] that were cached during Transform(v[i]) will be evicted from the cache by the time we do Lighting(v[i]). This means that v[i] will have to be fetched from main memory a second time, reducing performance.


Solution

Fragment large loops into smaller segments or strips, thus transforming the loop structure twofold:

  1. This technique increases the temporal and spatial locality in the data cache if the data is reusable in different passes of an algorithm.
  2. It also reduces the number of iterations of the loop by a factor of the length of each “vector,” or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by four times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.

 

In the following code, the computation has been strip-mined to a size strip_size. The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy. By doing this, a given element v[i] brought into the cache by Transform(v[i]) will still be in the cache when we perform Lighting(v[i]), and thus improve performance over the non-strip-mined code.

main()
{
Vertex_rec v[Num];
....
for (i=0; i < Num; i+=strip_size) {
for (j=i; j < min(Num, i+strip_size); j++) {
Transform(v[j]);
}
for (j=i; j < min(Num, i+strip_size); j++) {
Lighting(v[j]);
}
}
}

 

The following items are related to this one:

 


Source

IA-32 Intel® Architecture Optimization Reference Manual