Cookbook

  • 09/22/2020
  • Public Content

Optimize Memory Access Patterns using Loop Interchange and Cache Blocking Techniques

Understanding how your application accesses data, and how that data fits in the cache, is a critical part of optimizing the performance of your code. In this recipe, use the Memory Access Patterns analysis and recommendations to identify and address common memory bottlenecks, using techniques like loop interchange and cache blocking.

Scenario

Some of the most common problems in today's computer science domain - such as artificial intelligence, simulation, and modeling - involve matrix multiplication. The algorithm in this recipe is a triply-nested loop with a multiply and an add operation for each iteration. It is computationally intensive and accesses a lot of memory.
for(i=0; i<msize; i++) {    for(j=0; j<msize; j++) {       for(k=0; k<msize; k++) {          c[i][j] = c[i][j] + a[i][k] * b[k][j];       }    } }

Ingredients

This section lists the hardware and software used to produce the specific results shown in this recipe:

Establish a Baseline

  1. Run the Roofline analysis in the
    Intel Advisor
    CLI:
    advixe-cl --collect=roofline --project-dir=/user/test/matrix_project -- /user/test/matrix
    The matrix multiply application runs and completes in 151.36 seconds.
  2. Open the
    Roofline
    chart in the
    Intel Advisor
    GUI to visualize performance.
    The red dot is the main computational loop. Notice it is far below the DRAM bandwidth maximum, indicated by the lowest diagonal line. This tells us the loop has potential problems with memory accesses.
    Roofline chart of the main computational loop
  3. Open the
    Survey
    report in the
    Intel Advisor
    GUI to identify potential problems and recommendations.
    Notice
    Inefficient memory access patterns present
    in
    Performance Issues
    and
    49%
    in
    Efficiency
    :
    Survey report with inefficent memory access patterns messages
    Survey report containing low efficiency result
  4. To investigate possible memory bandwidth issues:
    • Run a Memory Access Patterns analysis in the
      Intel Advisor
      CLI:
      advixe-cl --mark-up-loops --select=multiply.c.c,21 --project-dir=/user/test/matrix_project -- /user/test/matrix advixe-cl --collect=map --project-dir=/user/test/matrix_project -- /user/test/matrix
    • Open the
      Memory Access Patterns
      report in the
      Intel Advisor
      GUI to analyze how the data structure layout affects performance.
      Unit stride is preferred for vectorization; however, the
      Intel Advisor
      detects 50% constant strides and 50% unit strides for this loop. Equally important, the
      Intel Advisor
      detects a constant stride of 2048 on the read access and a uniform stride of 0 on the write access, while the range of memory accessed is 6 MB.
      MAP Report showing a loop's code
      MAP Report with stride report results

Perform a Loop Interchange

The constant stride of 2048 is a result of the access of
b[k][j]
, where
k
is the index variable:
c[i][j] = c[i][j] + a[i][k] * b[k][j];
Interchanging the two inner loops will convert the constant stride to unit stride and eliminate the need to access memory over such a wide range.
  1. Edit the matrix multiply application by interchanging the two inner loops:
    for(i=0; i<msize; i++) {    for(k=0; k<msize; k++) {       for(j=0; j<msize; j++) {          c[i][j] = c[i][j] + a[i][k] * b[k][j];       }    } }
  2. Run another Roofline analysis in the CLI:
    advixe-cl --collect=roofline --project-dir=/user/test/matrix_project -- /user/test/matrix
    The matrix multiply application runs and completes in 5.53 seconds.
  3. Open the
    Roofline
    chart in the GUI to visualize performance.
    The loop, represented by the red dot, is now just below the L2 peak bandwidth:
    Roofline chart with the loop below the L2 peak bandwidth

Examine Memory Traffic at Each Level of the Memory Hierarchy

You can also use the
Intel Advisor
integrated Roofline analysis to examine memory traffic at each level of the memory hierarchy.
Prerequisite
: Set the following environment variable: ADVIXE_EXPERIMENTAL=int_roofline
  1. Run another Roofline analysis in the CLI with the cache simulation feature enabled:
    advixe-cl --collect=roofline --enable-cache-simulation --project-dir=/user/test/matrix_project -- /user/test/matrix
  2. Open the
    Roofline
    chart in the GUI to visualize performance.
    Each red dot represents the loop at each level of the memory hierarchy. Given the far left dot is only at the L3 bandwidth roof, it is clear the loop is not hitting L1 cache efficiently. The far right dot represents DRAM and it appears to be at the DRAM bandwidth roof. So our next optimization goal: Do a better job fitting our data into L1 cache.
    Roofline chart with dots representing the loop at each memory hierarchy level
  3. Check the
    Code Analytics
    tab below the
    Roofline
    chart to examine memory traffic:
    Code Analytics tab with memory traffic metrics
  4. Open the
    Survey
    report in the GUI to examine vector efficiency.
    Notice the loop is now vectorized at 96% efficiency using AVX2:
    Survey report displaying improved efficiency
  5. To verify the new access patterns:
    • Run another Memory Access Patterns analysis in the CLI:
      advixe-cl --mark-up-loops --select=multiply.c.c,21 --project-dir=/user/test/matrix_project -- /user/test/matrix advixe-cl --collect=map --project-dir=/user/test/matrix_project -- /user/test/matrix
    • Open the
      Memory Access Patterns
      report in the GUI.
      Intel Advisor
      shows all accesses are now unit stride:
      MAP report showing access are unit stride

Implement a Cache-Blocking Strategy

The matrix multiply application is now fairly well optimized. Another possible improvement that works well when processing a large amount of data: Break up the application into smaller partitions to perform computations over a smaller range of memory. This technique is called
cache blocking
, and will enable the loop to fit within L1 cache.
  1. Edit the matrix multiply application to add:
    • Pragmas to tell the compliler:
      • Memory is aligned.
      • Do not unroll.
    • Three additional nested loops so computations are performed serially in sections (or blocks)
       for (i0 = ibeg; i0 < ibound; i0 +=mblock) {       for (k0 = 0; k0 < msize; k0 += mblock) {          for (j0 =0; j0 < msize; j0 += mblock) {             for (i = i0; i < i0 + mblock; i++) {                for (k = k0; k < k0 + mblock; k++) { #ifdef ALIGNED    #pragma vector aligned #endif //ALIGNED         #pragma nounroll                   for (j = j0; j < j0 + mblock; j++) {                      c[i][j] = c[i][j] + a[i][k] * b[k][j];         }       }    } } } }
  2. Run another Roofline analysis in the CLI:
    advixe-cl --collect=roofline --enable-cache-simulation --project-dir=/user/test/matrix_project -- /user/test/matrix
    The matrix multiply application runs and completes in 3.29 seconds.
  3. Open the integrated
    Roofline
    chart in the GUI to visualize performance.
    The far left dot is closer to the L1 bandwidth roof, so we are doing a better job fitting our data into L1 cache. Interestingly enough, the far right dot representing DRAM is now far below the DRAM bandwidth roof. This is because there is less DRAM traffic and more L1-L3 traffic, thus higher Arithmetic Intensity and FLOPs. This is our objective.
    Roofline chart showing significantly less DRAM traffic

Result Summary

Run
Time Elapsed
Time Improvement
Baseline
151.36 seconds
N/A
Loop Interchange
5.53
27.37x
Cache Block
3.29
1.68x

Key Take-Aways

  • You can dramatically increase performance by optimizing memory.
  • Understanding the following is critical to meet performance optimization objectives:
    • How an application accesses data structures
    • How those data structures fit in cache

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804