Cookbook

  • 09/22/2020
  • Public Content

Identify Bottlenecks Iteratively: Cache-Aware Roofline

Improving an application’s performance is often a multi-step process. You can take advantage of the Intel® Advisor Cache-Aware Roofline feature and supporting analysis types to perform step-by-step, systematic optimization: identify and address bottlenecks, then re-run analyses to see how your code improves with each iteration and what to do next. This section provides an illustrated example of a workflow that uses the Recommendations tab, Code Analytics tab, refinement reports, and other features to address a range of real-world issues that may be affecting your code.

Scenario

While the first versions of the Roofline feature supported only floating-point operations, the
Intel Advisor
2019 added support for integer operations as well, extending its usefulness to integer-heavy domains like machine learning.
The process of optimizing an application with the Roofline feature is essentially the same for both types of operations, but the algorithm in this recipe uses integers. It is a standard matrix multiplication application and is subject to several performance bottlenecks.
for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       for (k = 0; k < msize; k++) {          c[i][j] += a[i][k] * b[k][j];       }    } }
The process of iteratively addressing bottlenecks uses multiple analysis types and reports found throughout the
Intel Advisor
interface, as shown in the diagram below.
Intel Advisor: GUI diagram

Ingredients

This section lists the hardware and software used to produce the specific results shown in this recipe:

Collect Baseline Results

  1. Run a standard Roofline analysis on the compiled application and view the result in the GUI.
  2. Change the
    Roofline
    chart settings as necessary - in this case, set the data type to INT and the core count to 1 (the second thread in this application is a launch/setup thread).
  3. Save the result for later with the Intel Advisor: Snapshot control 
				  control.
The large red dot is the matrix multiplication loop, which took approximately 166 seconds to run.
Roofline chart with matrix multiplication loop

Vectorize the Loop

The position of the loop on the chart indicates that either the Scalar Add Peak or DRAM Bandwidth roofs likely represent the bottleneck.
  1. Check the
    Recommendations
    tab for hints about which problem to address first.
    Recommendation: Assumed dependency present
  2. Run a Dependencies analysis on the loop to determine whether it can be safely vectorized.
    The result can be found in the
    Refinement
    report. There is a dependency, but the
    Recommendations
    tab provides a suggested workaround.
    Recommendation: Proven (real) dependency present
  3. Add a reduction pragma to the code as the
    Intel Advisor
    suggests.
    for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       #pragma omp simd reduction(+:c[i][j])       for (k = 0; k < msize; k++) {          c[i][j] += a[i][k] * b[k][j];       }    } }
  4. Recompile, re-run the Roofline analysis, and view the result.
    Despite the change, there is little improvement.
    Roofline chart with little change in performance

Transpose a Matrix

  1. Search the
    Survey
    report for clues about the lack of improvement.
    The
    Efficiency
    bar in
    Vectorized Loops
    is grey, indicating the loop is no faster than the compiler estimation of its scalar performance. There is also a possible inefficient memory access pattern listed in
    Performance Issues
    .
    Survey report with efficiency bar and performance issues column
  2. Check the
    Recommendations
    tab for more details on the possible performance issue.
    Recommendation: Possible inefficient memory access patterns present
  3. Run a Memory Access Patterns analysis on the loop.
    In the resulting
    Refinement
    report, we can see one of the matrices is traversed by column, which results in a poor stride of 32768.
    MAP report with stride information
  4. Create a transposed version of matrix b that can be traversed by row rather than by column.
    for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       t[i][j] = b[j][i];    } } for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       #pragma omp simd reduction(+:c[i][j])       for (k = 0; k < msize; k++) {          c[i][j] += a[i][k] * t[j][k];       }    } }
  5. Recompile and re-run the Roofline analysis.
    Performance is now drastically improved, but despite vectorization, the loop now rests under the Scalar Add Peak.
    Roofline chart with loop under the Scalar Add Peak

Use the AVX2 Instruction Set

  1. Check the
    Vectorized Loops
    data in the
    Survey
    report.
    Efficiency is now high; however, the loop is only using SSE2 instructions, because the compiler defaults to this instruction set when no other is specified.
    Intel Advisor
    indicates a higher instruction set is available.
    Survey report indicates a higher instruction set is available
  2. Recompile with AVX2 instructions and re-run the Roofline analysis.
    The loop runs more than a full second faster.
    Roofline chart with significantly improved loop

Block for L2 Cache

The Scalar Add Peak is completely addressed. The next roofs are the L3 and L2 Bandwidth roofs.
  1. Run a Memory Access Patterns analysis on all three levels of the loop to evaluate memory usage.
    The memory footprint for the outer two loops is much larger than the caches on this machine.
    MAP report displaying memory footprints of three loops
  2. Divide the calculation into blocks sized for the L2 cache, to maximize the amount of work done for each piece of data loaded into the cache.
    for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       t[i][j] = b[j][i];    } } for (ichunk = 0; ichunk < msize; ichunk += CHUNK_SIZE) {    for (jchunk = 0; jchunk < msize; jchunk += CHUNK_SIZE) {       for (i = 0; i < CHUNK_SIZE; i++) {          ci = ichunk + i;          for (j = 0; j < CHUNK_SIZE; j++) {             cj = jchunk + j;             #pragma omp simd reduction(+:c[ci][cj])             for (k = 0; k < msize; k++) {                c[ci][cj] += a[ci][k] * t[cj][k];             }          }       }    } }
  3. Recompile and re-run the Roofline analysis.
    While the dot did not move far, the time is cut nearly in half.
    Roofline report with time cut in half

Align Data

  1. There are no Recommendations at this point, so check the
    Code Analytics
    tab.
    The
    Compiler Notes on Vectorization
    indicate the data is unaligned. This can result in excessive memory operations, visible in the instruction mixes.
    Compiler Notes on the Code Analytics tab
  2. Use the appropriate aligned allocation, such as
    _mm_malloc()
    , to allocate the data. Add an alignment pragma to the loop, and recompile with
    /Oa
    or
    -fno-alias
    .
    for (i = 0; i < msize; i++) {    for (j = 0; j < msize; j++) {       t[i][j] = b[j][i];    } } for (ichunk = 0; ichunk < msize; ichunk += CHUNK_SIZE) {    for (jchunk = 0; jchunk < msize; jchunk += CHUNK_SIZE) {       for (i = 0; i < CHUNK_SIZE; i++) {          ci = ichunk + i;          for (j = 0; j < CHUNK_SIZE; j++) {             cj = jchunk + j;             #pragma vector aligned             #pragma omp simd reduction(+:c[ci][cj])             for (k = 0; k < msize; k++) {                c[ci][cj] += a[ci][k] * t[cj][k];             }          }       }    } }
  3. Recompile, re-run the Roofline analysis, and use
    Compare
    to load the baseline result.
    The loop time drops to about 1 second from its original 166 seconds.
    Using the Roofline Compare feature to compare loop time

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804