Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture


Challenge

Improve memory utilization by means of strip mining. Strip mining, also known as loop sectioning, is a loop-transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine. Consider the following pseudocode, as it exists before strip mining:

  typedef struct _VERTEX {

  float x, y, z, nx, ny, nz, u, v;

} Vertex_rec;


main()

  {

    Vertex_rec v[Num];

    ....

    for (i=0; i<Num; i++) {

      Transform(v[i]);

    }

    for (i=0; i<Num; i++) {

      Lighting(v[i]);

    }

    ....

}

 

The main loop consists of two functions:

  1. Transformation
  2. Lighting

 

For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array v[Num] is larger than the cache, then the coordinates for v[i] that were cached during Transform(v[i]) will be evicted from the cache by the time we do Lighting(v[i]). This means that v[i] will have to be fetched from main memory a second time, reducing performance.


Solution

Fragment large loops into smaller segments or strips, thus transforming the loop structure twofold:

  • This technique increases the temporal and spatial locality in the data cache if the data is reusable in different passes of an algorithm.
  • It also reduces the number of iterations of the loop by a factor of the length of each “vector,” or number of operations being performed per SIMD operation. In the case of Streaming SIMD Extensions, this vector or strip-length is reduced by four times: four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD operation are processed.

 

In the following code, the computation has been strip-mined to a size strip_size. The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy. By doing this, a given element v[i] brought into the cache by Transform(v[i]) will still be in the cache when we perform Lighting(v[i]), and thus improve performance over the non-strip-mined code.

  main()

{

  Vertex_rec v[Num];

  ....

  for (i=0; i < Num; i+=strip_size) {

    for (j=i; j < min(Num, i+strip_size); j++) {

      Transform(v[j]);

    }

    for (j=i; j < min(Num, i+strip_size); j++) {

      Lighting(v[j]);

    }

  }

}

 

The following items are related to this one:

 


Source

IA-32 Intel® Architecture Optimization Reference Manual

 


For more complete information about compiler optimizations, see our Optimization Notice.