User Guide

Contents

analyze.py Options

This script allows you to run an analysis on profiling data and generate
report results
.

Usage

advisor-python <APM>/analyze.py
<project-dir>
[--options]
Replace
<APM>
with
$APM
on Linux* OS or
%APM%
on Windows* OS.

Options

The following table describes options that you can use with the
analyze.py
script.
Option
Description
<project-dir>
Required. Specify the path to the
Intel® Advisor
project directory.
-h
--help
Show all script options.
--version
Display
Intel® Advisor
version information.
-v
<verbose>
--verbose
<verbose>
Specify output verbosity level:
  • 1 - Show only error messages. This is the least verbose level.
  • 2 - Show warning and error messages.
  • 3 (default) - Show information, warning, and error messages.
  • 4 - Show debug, information, warning, and error messages. This is the most verbose level.
This option affects the console output, but does not affect logs and report results.
--assume-dependencies
(default) |
--no-assume-dependencies
Assume that a loop has a dependency if the loop type is not known. When disabled, assume that a loop does not have dependencies if the loop dependency type is unknown.
--assume-hide-taxes
[<loop-id> | <file-name>:<line-number>]
Use an optimistic approach to estimate invocation taxes: hide all invocation taxes except the first one.
You can provide a comma-separated list of loop IDs and source locations to hide taxes for. If you do not provide a list, taxes are hidden for all loops.
--assume-never-hide-taxes
(default)
Use a pessimistic approach to estimate invocation taxes: do not hide invocation taxes.
--assume-ndim-dependency
(default) |
--no-assume-ndim-dependency
When searching for an optimal N-dimensional offload, assume there are dependencies between inner and outer loops.
--assume-parallel
|
--no-assume-parallel
(default)
Assume that a loop is parallel if the loop type is not known.
--assume-single-data-transfer
(default) |
--no-assume-single-data-transfer
Assumed data is transferred once for each offload, and all instances share the data. When disabled, assume each data object is transferred for every instance of an offload that uses it.
This method assumes no data re-use between calls to the same kernel.
This option requires you to enable the following options during the Trip Counts collection:
  • With
    collect.py
    , use
    --collect basic
    or
    --collect full
    .
  • With
    advisor --collect=tripcounts
    , use
    data-transfer=<mode>
    .
--atomic-access-pattern
<pattern>
Select an atomic access pattern. Possible options:
sequential
,
partial_sums_16
,
same
. By default, it is set to
partial_sums_16
.
--check-profitability
(default) |
--no-check-profitability
Check the profitability of offloading regions. Only regions that can benefit from the increased speed are added to a report.
When disabled, add all evaluated regions to a report, regardless of the profitability of offloading specific regions.
--config
<config>
Specify a configuration file by absolute path or name. If you choose the latter, the model configuration directory is searched for the file first, then the current directory.
The following device configurations are available:
gen11_icl
(default),
gen12_tgl
,
gen12_dg1
,
gen9_gt4
,
gen9_gt3
,
gen9_gt2
.
You can specify several configurations by using the option more than once.
--count-logical-instructions
(default) |
--no-count-logical-instructions
Use the projection of x86 logical instructions to GPU logical instructions.
--count-memory-instructions
(default) |
--no-count-memory-instructions
Use the projection of x86 instructions with memory to GPU SEND/SENDS instructions.
--count-mov-instructions
|
--no-count-mov-instructions
(default)
Use the projection of x86 MOV instructions to GPU MOV instructions.
--count-send-latency
{all, first, off}
Select how to model SEND instruction latency.
  • all
    - Assume each SEND instruction has an uncovered latency. This is a
    default
    value for GPU-to-GPU modeling with
    --gpu
    ,
    --profile-gpu
    , or
    --analyze-gpu-kernels-only
    .
  • first
    - Assume only the first SEND instruction in a thread has an uncovered latency. This is a
    default
    value for CPU-to-GPU modeling.
  • off
    - Do not model SEND instruction latency.
--cpu-scale-factor
<integer>
Assume a host CPU that is faster than the original CPU by the specified value.
All original CPU times are divided by the scale factor.
--data-reuse-analysis
|
--no-data-reuse-analysis
(default)
Estimate data reuse between offloaded regions. Disabling can decrease analyze overhead.
This option requires you to enable the following options during the Trip Counts collection:
  • With
    collect.py
    , use
    --collect full
    .
  • With
    advisor --collect=tripcounts
    , use
    data-transfer=full
    .
--data-transfer-histogram
(default) |
--no-data-transfer-histogram
Estimate fine-grained data transfer and latencies for each object transferred and add a memory object histogram to a report.
This option requires you to enable
track-memory-objects
or
data-transfer=medium
or higher (for
advisor
CLI only) during the Trip Counts collection.
--disable-fp64-math-optimization
Disable accounting for optimized traffic for transcendentals on the GPU.
--enable-batching
|
--disable-batching
(default)
Enable job batching for top-level offloads. Emulate the execution of more than one instance simultaneously.
--enable-edram
Enable eDRAM modeling in the memory hierarchy model.
Make sure to use this option with both
collect.py
and
analyze.py
.
--enable-slm
Enable SLM modeling in the memory hierarchy model.
Make sure to use this option with both
collect.py
and
analyze.py
.
--enforce-baseline-decomposition
|
--no-enforce-baseline-decomposition
(default)
Use
the same
local size and SIMD width as measured on the baseline. When disabled, search for an optimal local size and SIMD width to optimize kernel execution time.
Enable the option for the GPU-to-GPU performance modeling.
-e
,
--enforce-offloads
|
--no-enforce-offloads
(default)
Skip the profitability check, disable analyzing child loops and functions, and ensure that the rows marked for offload are offloaded even if offloading child rows is more profitable.
--estimate-max-speedup
(default) |
--no-estimate-max-speedup
Estimate region speedup with relaxed constraints.
Disabling can decrease performance model overhead.
--evaluate-min-speedup
Enable offload fraction estimation that reaches minimum speedup defined in a configuration file. Disabled by default.
--exclude-from-report
<items-to-exclude>
Specify items to exclude from a report. Available items:
memory_objects
,
sources
,
call_graph
,
dependencies
,
strides
.
Use this option if your report contains a lot of memory objects or sources that slow down opening in a browser.
This option affects only data shown in the report and does not affect data collection.
--force-32bit-arithmetics
Force all arithmetic operations to be considered single-precision FPs or int32.
--force-64bit-arithmetics
Force all arithmetic operations to be considered double-precision FPs or int64.
--gpu
(recommended) |
--profile-gpu
|
--analyze-gpu-kernels-only
Model performance only for code regions running on a GPU. Use
one
of the three options.
Make sure to specify this option for both
collect.py
and
analyze.py
.
This is a
preview
feature.
--analyze-gpu-kernels-only
is deprecated and will be removed in futire releases.
--hide-data-transfer-tax
|
--no-hide-data-transfer-tax
(default)
Disable data transfer tax estimation.
By default, the data transfer tax estimation is enabled.
--ignore
<list>
Specify a comma-separated list of runtimes or libraries to ignores time spent in regions from these runtimes and libraries when calculating per-program speedup.
This does not affect estimated speedup of individual offloads.
--include-to-report
<items-to-include>
Specify items to include to a report. Available items:
memory_objects
,
sources
,
call_graph
,
dependencies
,
strides
.
This option affects only data shown in the report and does not affect data collection.
--jit
Enable data collection and analysis for applications with DPC++, OpenMP* target, and OpenCL™ code on a base platform.
--loop-filter-threshold
<threshold>
Specify the loop filter threshold in seconds. The default is 0.02. Loop nests with total time less than the threshold are ignored.
-m
<markup>
--markup
<markup>
Select markup_analyze, affecting which regions to mark up for data collection and analysis.
--model-children
(default) |
--no-model-children
Analyze child loops of the region head to find if some of the loops provide more profitable offload.
--model-extended-math
(default) |
--no-model-extended-math
Model calls to math functions such as
EXP
,
LOG
,
SIN
, and
COS
as extended math instructions, if possible.
--model-system-calls
(default) |
--no-model-system-calls
Analyze regions with system calls inside. The actual presence of system calls inside a region may reduce model accuracy.
--mpi-rank
<mpi-rank>
Model performance for the specified MPI rank if multiple ranks were analyzed.
--ndim-depth-limit
<N>
When searching for an optimal N-dimensional offload, limit the maximum loop depth that can be converted to one offload. The limit must be in the range 1 <=
N
<= 6. The default value is 3.
--no-cachesim
Disable cache simulation during collection. The model assumes 100% hit rate for cache.
Usage decreases analysis overhead.
--no-stacks
Run data analysis without using callstacks data. You can use this option to avoid bad callstacks attributed data at the expense of accuracy.
--non-accel-time-breakdown
Provide a detailed breakdown of non-offloaded parts of offloaded regions.
-o
<output-dir>
--out-dir
<output-dir>
Specify the directory to put all generated files into. By default, results are saved in
<advisor-project>/e
<NNN>
/pp
<MMM>
/data.0
. If you specify an existing directory or absolute path, results are saved in specified directory. The new directory is created if it does not exist.
If you only specify the directory
<name>
, results are stored in
<advisor-project>/e
<NNN>
/pp
<MMM>
/
<name>
.
If you use this options, you might not be able to open the analysis results in the Intel Advisor GUI.
-p
<output-name-prefix>
--out-name-prefix
<output-name-prefix>
Specify a string to add to the beginning output result filenames.
If you use this options, you might not be able to open the analysis results in the Intel Advisor GUI.
--overlap-taxes
|
--no-overlap-taxes
(default)
Enable asynchronous execution to overlap offload overhead with execution time.
When disabled, assume no overlap of execution time and offload overhead.
--refine-repeated-transfer
|
--no-refine-repeated-transfer
(default)
Reduce over-estimation of data transfer when
--no-assume-single-data-transfer
is used. This option counts how many times each data object is modified and limits the number of data transfers based on that result. For example, constant data may be used in each call to a loop, but needs to be transferred to a device only once.
This option requires you to enable the following options during the Trip Counts collection:
  • With
    collect.py
    , use
    --collect full
    .
  • With
    advisor --collect=tripcounts
    , use
    data-transfer=full
    .
--search-n-dim
(default) |
--no-search-n-dim
Enable search for optimal N-dimensional offload.
-l
[<file-name>:<line-number>]
--select-loops
[<file-name>:<line-number>]
Limit the analysis to specified loop nests determined by passing a topmost loop. The parameter must be a comma-separated list of source locations in the following format:
<file-name>:<line-number>
.
--set-dependency
[<IDs/source-locations>]
Assume loops have dependencies if they have IDs or source locations from the specified comma-separated list. If the list is empty, assume all loops have dependencies.
--set-dependency
option takes precedence over
--set-parallel
, so if a loop is listed in both, it is considered as having a dependency.
--set-parallel
[<IDs/source-locations>]
Assume loops are parallel if they have IDs or source locations from a specified comma-separated list. If the list is empty, assume all loops are parallel.
--set-dependency
option takes precedence over
--set-parallel
, so if a loop is listed in both, it is considered as having a dependency.
--set-parameter
<CLI-config>
Specify a single-line configuration parameter to modify in a format
"<group>.<parameter>=<new-value>"
. For example:
"min_required_speed_up=0"
,
"scale.Tiles_per_process=0.5"
. You can use this option more than once to modify several parameters.
Make sure to use this option for both
collect.py
and
analyze.py
with the same value.
--small-node-filter
<threshold>
Specify the total time threshold, in seconds, to filter out nodes in the
program_tree.dot
and
program_tree.pdf
that fall below this value. The default is 0.0.
--threads
<number-of-threads>
Specify the number of parallel threads to use for offload heads.
--track-heap-objects
|
--no-track-heap-objects
Deprecated. Use
--track-memory-objects
.
--track-memory-objects
(default) |
--no-track-memory-objects
Attribute heap-allocated objects to the analyzed loops that accessed the objects. Disabling can decrease collection overhead.
This option requires you to enable the following options during the Trip Counts collection:
  • With
    collect.py
    , use
    --collect basic
    or
    --collect full
    .
  • With
    advisor --collect=tripcounts
    , use
    data-transfer=medium
    or
    data-transfer=full
    .
--use-collect-configs
|
--no-use-collect-configs
(default)
Use configuration files from collection phase in addition to default and custom configuration files.
Examples
  • Run analysis with default configuration on the project in the
    ./advi
    directory. The generated output is saved to the default
    advi/perf_models/mNNNN
    directory.
    advisor-python $APM/analyze.py ./advi
  • Run analysis using the Intel® Iris® X
    e
    MAX graphics (
    gen12_dg1
    configuration) configuration for the specific loops of the
    ./advi
    project. Add both analyzed loops to the report regardless of their offloading profitability. The generated output is saved to the default
    advi/perf_models/mNNNN
    directory.
    advisor-python $APM/analyze.py ./advi --config gen12_dg1 --select-loops [foo.cpp:34,bar.cpp:192] --no-check-profitability
  • Run analysis for a custom configuration on the
    ./advi
    project. Mark up regions for analysis and assume a code region is parallel if its type is unknown. Save the generated output to the
    advi/perf_models/report
    directory.
    advisor-python $APM/analyze.py ./advi --config ./myConfig.toml --markup --assume-parallel --out-dir report

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.