Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture

Submit New Article

March 10, 2009 2:00 AM PDT



Challenge

Improve memory utilization by means of strip mining. Strip mining, also known as loop sectioning, is a loop-transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine. Consider the following pseudocode, as it exists before strip mining:

  typedef struct _VERTEX {
float x, y, z, nx, ny, nz, u, v;
} Vertex_rec;

main()
{
Vertex_rec v[Num];
....
for (i=0; i<Num; i++) {
Transform(v[i]);
}
for (i=0; i<Num; i++) {
Lighting(v[i]);
}
....
}

 

The main loop consists of two functions:

  1. Transformation
  2. Lighting

 

For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array v[Num] is larger than the cache, then the coordinates for v[i] that were cached during Transform(v[i]) will be evicted from the cache by the time we do Lighting(v[i]). This means that v[i] will have to be fetched from main memory a second time, reducing performance.


Solution

Fragment large loops into smaller segments or strips, thus transforming the loop structure twofold:

  • This technique increases the temporal and spatial locality in the data cache if the data is reusable in different passes of an algorithm.
  • It also reduces the number of iterations of the loop by a factor of the length of each “vector,” or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by four times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.

 

In the following code, the computation has been strip-mined to a size strip_size. The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy. By doing this, a given element v[i] brought into the cache by Transform(v[i]) will still be in the cache when we perform Lighting(v[i]), and thus improve performance over the non-strip-mined code.

  main()
{
Vertex_rec v[Num];
....
for (i=0; i < Num; i+=strip_size) {
for (j=i; j < min(Num, i+strip_size); j++) {
Transform(v[j]);
}
for (j=i; j < min(Num, i+strip_size); j++) {
Lighting(v[j]);
}
}
}

 

The following items are related to this one:

 


Source

IA-32 Intel® Architecture Optimization Reference Manual