User Guide

Contents

Run
Offload Modeling
Perspective from Command Line

Intel® Advisor
provides several methods to run the
Offload Modeling
perspective. These methods vary in simplicity and flexibility:
You can run the Python* scripts with Python 3.6 or 3.7 or the
advisor-python
command line interface of the
Intel Advisor
.
The Python script methods
do not
support MPI applications.

Prerequisites

  • Set
    Intel Advisor
    environment variables
    with an automated script.
    The script enables the
    advisor
    CLI,
    advisor-python
    command line tool, and the
    APM
    environment variable, which points to the directory with
    Offload Modeling
    scripts and simplifies their use.
  • For Data Parallel C++, OpenMP* target, OpenCL™ applications: Set up environment variables to offload temporarily your application to a CPU for the analysis.

Use advisor Command Line Interface

This method is the most flexible and can analyze MPI applications.
In the commands below:
  • Replace
    <APM>
    with
    $APM
    on Linux OS or with
    %APM%
    on Windows OS.
  • Options in square brackets (
    [--
    <option>
    ]
    ) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.
You can generate command lines for your application and configuration with one of the following:
  • Run
    collect.py
    with the
    --dry-run
    option from CLI as follows:
    advisor-python
    <APM>
    /collect.py
    <project-dir>
    --dry-run --
    <target-application>
  • Generate command lines from the
    Intel Advisor
    GUI.
Copy the commands to the clipboard and run them one by one from the command line. The commands generated might require you to add certain options and steps (for example, mark up) to complete the flow.
Run the perspective as follows:
  1. Run the Survey analysis to collect basic performance metrics:
    advisor --collect=survey --project-dir=
    <project-dir>
    --stackwalk-mode=online --static-instruction-mix --
    <target-application>
    [
    <target-options>
    ]
    where:
    • --stackwalk-mode=online
      is an option to analyze stacks during collection. The online mode is recommended for profiling applications executed on CPU.
      To profile a Data Parallel C++ (DPC++), C++/Fortran with OpenMP* pragmas, or OpenCL™ application running on a CPU, set the option to
      offline
      to analyze stacks
      after
      collection..
    • --static-instruction-mix
      is an option to collect static instruction mix data. This option is recommended.
  2. Run the Trip Counts and FLOP analysis to analyze loop call count and model data transfers on the target device:
    advisor --collect=tripcounts --project-dir=
    <project-dir>
    --flop --enable-cache-simulation --target-device=
    <target>
    [--stacks] [--data-transfer=
    <mode>
    ] [--profile-jit] --
    <target-application>
    [
    <target-options>
    ]
    where:
    • --flop
      is an option to collect data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms.
    • --enable-cache-simulation
      is an option to enable modeling cache behavior.
    • --target-device=
      <target>
      is a specific target graphics processing unit (GPU) to model cache for. For example,
      gen11_icl
      (default),
      gen12_dg1
      , or
      gen9_gt3
      . See target-device for a full list of possible values.
      Make sure to specify the same target device as for the
      --collect=projection --config=
      <config-file>
      .
    • --stacks
      is an option to enable advanced collection of call stack data.
    • --data-transfer=
      <mode>
      is an option to enable modeling data transfers between host and target devices. Use
      off
      (default) to disable data transfer modeling,
      light
      to model only data transfers,
      medium
      to model data transfers, attribute memory objects, and track accesses to stack memory,
      full
      to enable data reuse analysis as well.
      Use with
      --enable-cache-simulation
      only.
    • --profile-jit
      is an option to analyze DPC++, C++/Fortran with OpenMP pragmas, or OpenCL code regions running on a CPU.
  3. Optional
    : Check for loop-carried dependencies:
    1. Mark loops for the Dependencies analysis to decrease overhead. Use the
      generic
      markup strategy to select only loops profitable for offloading:
      advisor --mark-up-loops --project-dir=
      <project-dir>
      --select markup=gpu_generic --
      <target-application>
      [
      <target-options>
      ]
      For more information about markup options, see Loop Markup to Minimize Overhead.
      The generic markup strategy is recommended if you have an application that does not use DPC++, C++/Fortran with OpenMP pragmas, or OpenCL, and you want to run the Dependencies analysis for it.
    2. Run the Dependencies analysis for the marked loops:
      advisor --collect=dependencies --project-dir=
      <project-dir>
      --loop-call-count-limit=16 [--select=
      <string>
      ] [--filter-reductions] --
      <target-application>
      [
      <target-options>
      ]
      where:
      • --loop-call-count-limit=16
        is the maximum number of call instances to analyze assuming similar runtime properties over different call instances. This value is recommended.
      • --select=
        <string>
        selects loops for the analysis by loop IDs, source locations, criteria such as
        scalar
        ,
        has-issue
        , or
        markup=
        <markup-mode>
        . The recommended argument is
        --select markup=gpu_generic
        to select loops that are recommended to run on a target.
        Use this option if you did not run the
        --mark-up-loops --select=
        <string>
        to select loops or want to run the Dependencies analysis for a different set of loops.
      • --filter-reductions
        is an option to mark all potential reductions with a specific diagnostic.
    Information about loop-carried dependencies is important for modeling performance of scalar loops. See Check How Assumed Dependencies Affect Modeling.
  4. Model application performance with the projection analysis:
    advisor --collect=projection --project-dir=
    <project-dir>
    --config=
    <config>
    [--no-assume-dependencies] [--data-reuse-analysis] [--assume-hide-taxes] [--jit] [--custom-config=
    <path>
    ]
    where:
    • --config=
      <config>
      is a target GPU configuration to model performance for. For example,
      gen11_icl
      (default),
      gen12_dg1
      , or
      gen9_gt3
      . See config for a full list of possible values.
    • --no-assume-dependencies
      is an option to assume that a loop does not have dependencies if a loop dependency type is unknown. The default is
      --assume-dependencies
      . Use
      --no-assume-dependencies
      if your application contains parallel and/or vectorized loops and you did not run the Dependencies analysis.
    • --data-reuse-analysis
      is an option to analyze potential data reuse between code regions when offloaded to a target GPU.
    • --assume-hide-taxes
      is an option to assume that an invocation tax is paid only for the first time a kernel is launched.
    • --custom-config=
      <path>
      is a path to a custom
      .toml
      configuration file with additional modeling parameters. For details, see Advanced Modeling Configurations.
    • --jit
      is an option to model performance of DPC++, C++/Fortran with OpenMP pragmas, or OpenCL code regions running on a CPU.
Example
Collect performance data, check for dependencies for potentially profitable loops, model application performance and data transfers on a Intel® Iris® X
e
MAX graphics (
gen12_dg1
configuration):
advisor --collect=survey --project-dir=./advi --stackwalk-mode=online --static-instruction-mix -- myApplication
advisor --collect=tripcounts --project-dir=./advi --flop --enable-cache-simulation --target-device=gen12_dg1 --stacks --data-transfer=light -- myApplication
advisor --mark-up-loops --project-dir=./advi --select markup=gpu_generic -- myApplication
advisor --collect=dependencies --project-dir=./advi --filter-reductions --loop-call-count-limit=16 -- myApplication
advisor --collect=projection --project-dir=./advi --config=gen12_dg1

Run the collect.py and analyze.py Scripts

collect.py
automates profiling and allows you to run all analysis steps in one command, while
analyze.py
models performance of your application on a target device. This method is simple, moderately flexible, but it
does not
support MPI applications.
In the commands below:
  • Replace
    <APM>
    with
    $APM
    on Linux OS or with
    %APM%
    on Windows OS.
  • Options in square brackets (
    [--
    <option>
    ]
    ) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.
Run the scripts as follows:
  1. Collect application performance metrics with
    collect.py
    :
    advisor-python
    <APM>
    /collect.py
    <project-dir>
    [--collect=
    <collect-mode>
    ] [--config=
    <config-file>
    ] [--markup=
    <markup-mode>
    ] [--data-transfer] [--jit] --
    <target>
    [
    <target-options>
    ]
    where:
    • --collect=
      <collect-mode>
      is an option to specify what data is collected for your application:
      • Use
        basic
        to collect only basic Survey, Trip Counts and FLOP.
      • Use
        refinement
        to collect only Dependencies.
      • Use
        full
        (default) to collect Survey, Trip Counts and FLOP, and Dependencies data.
      See Check How Dependencies Affect Modeling for details when you need to collect dependency data.
    • --config=
      <config-file>
      is a target GPU configuration to model performance for. For example,
      gen11_icl
      (default),
      gen12_dg1
      , or
      gen9_gt3
      .
      Make sure to specify the same configuration file for
      collect.py
      and for
      analyze.py
      .
    • --markup=
      <loop-type>
      is loops to collect Trip Counts and FLOP and/or Dependencies data for. This option decreases collection overhead. By default, it is set to
      generic
      to analyze only loops profitable for offloading.
    • --data-transfer
      enables modeling data transfers between host and device when offloaded to a target.
      Enabled
      by default.
    • --jit
      is an option to model performance of DPC++, C++/Fortran with OpenMP pragmas, or OpenCL code regions running on a CPU.
  2. Model performance of your application on a target GPU device with a selected configuration with
    analyze.py
    :
    advisor-python
    <APM>
    /analyze.py
    <project-dir>
    [--config=
    <config-file>
    ] [--assume-parallel] [--jit]
    where:
    • --config=
      <config-file>
      is a target GPU configuration to model performance for. For example,
      gen11_icl
      (default),
      gen12_dg1
      , or
      gen9_gt3
      .
      Make sure to specify the same configuration file for
      collect.py
      and for
      analyze.py
      .
    • --assume-parallel
      is an option to assume that a loop does not have dependencies if there is no information about the loop dependency type and you did not run the Dependencies analysis (with
      collect.py --collect=basic
      ). For details, see Check How Dependencies Affect Modeling.
    • --jit
      is an option to model performance of DPC++, C++/Fortran with OpenMP pragmas, or OpenCL code regions running on a CPU.
See collect.py Script and analyze.py Script reference for a full list of available options.
Example
Collect performance data and model application performance on a target GPU with the Intel® Iris® X
e
MAX graphics (
gen12_dg1
configuration) on Linux OS:
advisor-python $APM/collect.py ./advi --config=gen12_dg1 –- myApplication
advisor-python $APM/analyze.py ./advi --config=gen12_dg1

Run the run_oa.py Script

This method is the simplest, but less flexible, and it does not support analysis of MPI applications. You can use it to run all collection and modeling steps with one script.
In the command below:
  • Replace
    <APM>
    with
    $APM
    on Linux OS or with
    %APM%
    on Windows OS.
  • Options in square brackets (
    [--
    <option>
    ]
    ) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.
Run the script as follows:
advisor-python
<APM>
/run_oa.py
<project-dir>
[--collect=
<collect-mode>
] [--config=
<config-file>
] [--markup=
<markup-mode>
] [--data-transfer] [--jit] --
<target>
[
<target-options>
]
where:
  • --collect=
    <collect-mode>
    is an option to specify what data is collected for your application:
    • Use
      basic
      to collect only basic Survey, Trip Counts and FLOP.
    • Use
      refinement
      to collect only Dependencies.
    • Use
      full
      (default) to collect Survey, Trip Counts and FLOP, and Dependencies data.
    See Check How Dependencies Affect Modeling for details when you need to collect dependency data.
  • --config=
    <config-file>
    is a target GPU configuration to model performance for. For example,
    gen11_icl
    (default),
    gen12_dg1
    , or
    gen9_gt3
    .
  • --markup=
    <loop-type>
    is loops to collect Trip Counts and FLOP and/or Dependencies data for. This option decreases collection overhead. By default, it is set to
    generic
    to analyze only loops profitable for offloading.
  • --data-transfer
    is an option to enable modeling data transfers between host and device when offloaded to a target.
    Enabled
    by default.
  • --jit
    is an option to model performance of DPC++, C++/Fortran with OpenMP pragmas, or OpenCL code regions running on a CPU.
See run_oa.py Script reference for a full list of available options.
Example
Run the full collection and modeling with the
run_oa.py
script with default
gen11_icl
configuration on Linux OS:
advisor-python $APM/run_oa.py ./advi -- myApplication

View the Results

Intel Advisor
provides several ways to work with the
Offload Modeling
results generated from the command line.
View Results in CLI
After you run Performance Modeling with
advisor --collect=projection
or
analyze.py
, the
result summary
is printed in a terminal or a command prompt. In this summary report, you can view:
  • Description of a
    baseline
    platform where application performance was measured and a target platform for which the application performance was modeled
  • Executive binary name
  • Top metrics for measured and estimated (accelerated) application performance
  • Top regions recommended for offloading to the target and performance metrics per region
For example:
Info: Selected accelerator to analyze: Intel Gen9 GT2 Integrated Accelerator 24EU 1150MHz. Info: Baseline Host: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, GPU: Intel (R) . Info: Binary Name: 'CFD'. Measured CPU Time: 44.858s Accelerated CPU+GPU Time: 15.425s Speedup for Accelerated Code: 3.8x Number of Offloads: 5 Fraction of Accelerated Code: 60% Top Offloaded Regions -------------------------------------------------------------------------------------------------------------------------------------------------- Location | Time on Baseline | Time on Target | Speedup | Bound by | Data Transfer -------------------------------------------------------------------------------------------------------------------------------------------------- [loop in compute_flux_ser at euler3d_cpu_ser.cpp:226] | 36.576s | 9.103s | 4.02x | L3_BW | 12.091MB [loop in time_step_ser at euler3d_cpu_ser.cpp:361] | 1.404s | 0.319s | 4.40x | L3_BW | 10.506MB [loop in compute_step_factor_ser at euler3d_cpu_ser.... | 0.844s | 0.158s | 5.35x | Compute | 4.682MB [loop in main at euler3d.cpp:848] | 1.046s | 0.906s | 1.15x | Dependency | 31.863MB [loop in Intel::OpenCL::TaskExecutor::in_order_execu... | 0.060s | 0.012s | 4.98x | Dependency | 0.303MB --------------------------------------------------------------------------------------------------------------------------------------------------
See Accelerator Metrics reference for more information about the metrics reported.
View Results in GUI
When you run
Intel Advisor
CLI or Python scripts, an
.advixeproj
project is created automatically in the directory specified with
--project-dir
. This project is interactive and stores all the collected results and analysis configurations. You can view it in the
Intel Advisor
GUI.
To open the project in GUI, you can run the following command from a command prompt:
advisor-gui <project-dir>
If the report does not open, click
Show Result
on the Welcome pane.
You first see a
Summary
report that includes the most important information about measured performance on a baseline platform and modeled performance on a target platform, including:
  • Main metrics for the modeled performance of your program that indicates if you should offload your application to a target device.
  • Specific factors that prevent your code from achieving a better performance if executed on a target device in the Offload Bounded by.
  • Top five offloaded loops/functions that provide the highest benefit and top five not offloaded loops/functions with the reason why they were not offloaded.
Offload Modeling Summary in GUI
View an Interactive HTML Report
When you run
Intel Advisor
CLI or Python scripts, an additional set of CSV metric reports and an interactive HTML report is generated in the
<project-dir>
/e
<NNN>
/pp
<NNN>
/data.0
directory. These reports are light-weighted and can be easily shared as they do not require
Intel Advisor
GUI.
The HTML report is similar to the GUI project, but also reports additional metrics. The report contains a list of regions profitable for offloading and performance metrics, like offload data transfer traffic, estimated number of cycles on a target device, estimated speed-up, compute vs memory-bound characterization.
Offload Modeling HTML report
Save a Read-only Snapshot
A snapshot is a read-only copy of a project result, which you can view at any time using the
Intel Advisor
GUI. To save an active project result as a read-only snapshot:
advisor --snapshot --project-dir=
<project-dir>
[--cache-sources] [--cache-binaries] --
<snapshot-path>
where:
  • --cache-sources
    is an option to add application source code to the snapshot.
  • --cache-binaries
    is an option to add application binaries to the snapshot.
  • <snapshot-path
    is a path and a name for the snapshot. For example, if you specify
    /tmp/new_snapshot
    , a snapshot is saved in a
    tmp
    directory as
    new_snapshot.advixeexpz
    . You can skip this and save the snapshot to a current directory as
    snapshot
    XXX
    .advixeexpz
    .
To open the result snapshot in the
Intel Advisor
GUI, you can run the following command:
advisor-gui
<snapshot-path>
You can visually compare the saved snapshot against the current active result or other snapshot results.

Next Steps

See Identify Code Regions to Offload to understand the results. This section is GUI-focused, but you can still use to it for interpretation.
For details about metrics reported, see Accelerator Metrics.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.