• 2020
  • 12/17/2020
  • Public Content

Frequent DRAM Accesses

This recipe explores profiling a memory-bound
application using the Microarchitecture Exploration and Memory Access analyses of the Intel® VTune™ Profiler to understand the cause of the frequent DRAM accesses.


This section lists the hardware and software tools used for the performance analysis scenario.
  • Application:
    matrix multiplication
    sample that multiplies 2 matrices of 2048x2048 size, matrix elements have the double type. The
    sample package is available with the product in the
    directory and from the Intel Developer Zone at
  • Performance analysis tools:
    • Intel® VTune™
      version 2019 or newer: Microarchitecture Exploration (formerly, General Exploration), Memory Access analysis
      • Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to
        Intel® VTune™
      • Most recipes in the
        Intel® VTune™
        Performance Analysis Cookbook are scalable. You can apply them to different versions of
        Intel® VTune™
        . In some cases, minor adjustments may be required.
      • Download the latest version of
        Intel® VTune™
        from the product download page.
  • Operating system:
    Linux*, Ubuntu* 16.04 64-bit
  • CPU:
    Intel® Core™ i7-6700K processor

Create a Baseline

The initial version of the sample code provides a naïve multiplication algorithm with the following code for the main kernel:
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE v[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i,j,k; // Naive implementation for(i=tidx; i<msize; i=i+numt) { for(j=0; j<msize; j++) { for(k=0; k<msize; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } }
Running the compiled application takes about 22 seconds. This is a performance
that could be used for further optimizations.

Run Microarchitecture Exploration Analysis

To have a high-level understanding of potential performance bottlenecks for the sample, start with the Microarchitecture Exploration analysis provided by
Intel® VTune™
  1. Click the
    New Project
    button on the toolbar and specify a name for the new project, for example:
  2. In the
    Configure Analysis
    window, select the
    Local Host
    target system type on the
  3. On the
    pane, select the
    Launch Application
    target type and specify an application for analysis.
  4. On the
    pane, click the browse button and select
    Microarchitecture Exploration
    analysis from the
  5. Click the
    VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Identify Hardware Hotspots

Microarchitecture Exploration helps you see dominant performance bottlenecks in your code. Start your analysis with the
representation in the
view that displays CPU microarchitecture efficiency and CPU pipeline stalls for the analyzed application. According to the
below, the output pipe flow is very narrow, which means that the
metric value needs to be increased to improve application performance. The primary obstacle in the pipe is the
Memory Bound
metric value:
Microarchitecture Pipe
From the metric tree on the left, you see that performance is mostly bound by access to the DRAM.
When you switch to the Bottom-up view, you see that the application has one big hotspot function
Double-clicking this function opens the Source view that highlights the most performance-critical code line:
Almost all the time was spent in source line #51 that operates over three arrays - a, b, and c.

Run Memory Access Analysis

To find out an access to what array was the most expensive, run the Memory Access analysis with the
Analyze dynamic memory objects
option enabled:

Identify Hot Memory Accesses

window for the Memory Access analysis result shows the top memory objects as follows:
Top Memory Objects by Latency
Click the first hotspot object
in the list to switch to the Bottom-up view and then double-click this object highlighted in the grid to open the Source view and see the line allocating this memory object:
You see the allocation for the
variable that is assigned to
, which is in its turn assigned to array
. So, you may conclude that the problematic array is
. Click the
Open Source File Editor
button on the toolbar and have a look at the code again:
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE v[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i,j,k; // Naive implementation for(i=tidx; i<msize; i=i+numt) { for(j=0; j<msize; j++) { for(k=0; k<msize; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } }
You see now the root cause of the problem: the innermost cycle iterates over array
in an inefficient way. On each iteration it jumps over big chunks of memory.

Apply Loop Interchange for Optimization

Apply the loop interchange algorithm to
as follows:
for(i=tidx; i<msize; i=i+numt) { for(k=0; k<msize; k++) { for(j=0; j<msize; j++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }
Compiling and running the new code will result in 1.3-second runtime, which is a significant 20x improvement over original 26 seconds.

What's Next

Re-run the Microarchitecture Exploration analysis on the optimized
code. The
diagram shows a significant increase of the
metric value, from 10.06% to 63.28%:
Microarchitecture Exploration
You can focus on other flagged metrics to identify further areas for improvement, for example: poor port utilization.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at