User Guide

Intel® VTune™ Profiler User Guide

ID 766319
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Compiler Switches for Performance Analysis on Linux* Targets

Intel® VTune™ Profiler can analyze most native binaries on Linux target systems. However, the settings below are recommended to make the performance analysis more productive and easier:

Use This Switch

To Do This

-g (highly recommended)

Enable generating the symbol information required to associate addresses with source lines and to properly walk the call stack in user-mode sampling and tracing collection types (Hotspots and Threading).

Release build or -O2 (highly recommended)

Enable maximum compiler optimization to focus the VTune Profiler on real performance problems that cannot be optimized with the compiler.

-shared-intel (Intel® C++ Compiler)

-shared-libgcc (GCC* Compiler)

Enable identifying the libm and C runtime calls as system functions and differentiating them from the user code when a proper filter mode is applied to the VTune Profiler collection result.

-debug inline-debug-info

(Intel C++ Compiler)

Enable the VTune Profiler to identify inline functions and, according to the selectedinline mode, associate the symbols for an inline function with the inline function itself or its caller. This is the default mode for GCC* 4.1 and higher.

NOTE:
The debug inline-debug-info option is enabled by default for the Intel® oneAPI DPC++/C++ Compiler if you compile with optimizations (-O2 or higher) and debug information (-g option).

-D TBB_USE_THREADING_TOOLS

Enable Intel® oneAPI Threading Building Blocks Analysis (oneTBB) for the VTune Profiler. This macro is automatically set if you compile with -D_DEBUG or -DTBB_USE_DEBUG.

Without TBB_USE_THREADING_TOOLS set, the VTune Profiler will not properly identify concurrency issues related to using oneTBB constructs.

-qopenmp (highly recommended)

(Intel C++ Compiler)

Enable the VTune Profiler to identify parallel regions due to OpenMP* pragmas.

-qopenmp-link dynamic

(Intel C++ Compiler)

Enable the Intel Compiler to choose the dynamic version of the OpenMP runtime libraries which has been instrumented for the VTune Profiler. Usually, this option is enabled for the Intel Compiler by default.

-parallel-source-info=2

(Intel C++ Compiler)

Enable/disable source location emission when OpenMP or auto-parallelism code is generated. 2 is the level of source location emission that tells the compiler to emit path, file, routine name, and line information.

--info-for-profiling

Intel oneAPI DPC++ Compiler

Intel Fortran Compiler

Enable generating debug information for GPU analysis of a SYCL application.

Generate debug information for OpenMP* Offload applications compiled by Intel Fortran compiler

-Xsprofile

Intel oneAPI DPC++ Compiler

Enable source-level mapping of performance data for FPGA application analysis.

Avoid These Switches

The following compiler settings are NOT recommended:

Do Not Use This Switch

Because Of This

Debug build or -O0

Changes the performance of your application compared to a release build and may dramatically impact the performance profiling potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem in the release build.

-static

-static-libgcc

Prevents the VTune Profiler from being able to run the user-mode sampling and tracing analysis types. See below for more details.

NOTE:

When you specify the -fast switch with the Intel Compiler, it automatically enables -static.

-static-intel

Prevents the user-mode sampling and tracing analysis types from distinguishing system functions properly. This is the default option for the Intel Compiler.

-qopenmp-link static

Chooses the static version of the OpenMP runtime libraries for the Intel Compiler. This version of the OpenMP runtime library does not contain the instrumentation data required for the VTune Profiler analysis.

-qopenmp_stubs

Prevents OpenMP code from being parallel.

-msse4a, -m3dnow

Generates binaries that use instructions not supported by Intel processors, which may cause unknown behavior when profiling with the VTune Profiler.

-debug [parallel | extended | emit-column | expr-source-pos | semantic-stepping | variable-locations]

VTune Profiler works best with -debug full (the default mode when using -g). Other options including parallel, extended, emit-column, expr-source-pos, semantic-stepping, and variable-locations are not supported by the VTune Profiler. See -debug inline-debug-info for more information.

-coarray

Prevents the Threading analysis from identifying properly the locks that disable scaling in Coarray Fortran.

Compiling for the User-Mode Sampling and Tracing Analysis

For successful user-mode sampling and tracing analysis (Hotspots and Threading) of your executable and all shared libraries, use the following switches to properly walk through the call stack:

  • Use -g to generate the symbol information and enable the source code analysis.

  • Use -fno-omit-frame-pointer to enable the frame pointers analysis.

    NOTE:

    There are other options that may add frame pointers to your binary as a side effect, for example: -fexceptions (default for C++) or -O0. To make sure the executable (and shared libraries) have this information, use the objdump -h <binary> command and make sure you see the .eh_frame_hdr section there.

User-mode sampling and tracing analysis types work better with dynamic versions of the following libraries:

Library

Dynamic Version (Recommended)

Static Version (Not Recommended)

OpenMP Runtime (supplied by the Intel Compiler)

libiomp5.so or libguide40.so

libiomp5.a or libguide4.a

Posix Thread

libpthread.so

libpthread.a

C Runtime

libc.so

libc.a

C++ Runtime

libstdc++.so

libstdc++.a

Intel Libm

libm.so

libm.a

User-mode sampling and tracing collection has the following limitations for analyzing statically linked libraries/functions:

  • The static version of the OpenMP runtime library supplied by the Intel Compiler does not provide the necessary instrumentation for the Threading analysis type.

  • Call Stack mode cannot properly distinguish user code from system functions.

  • User-mode sampling and tracing collection cannot execute unless various C Runtime functions are exported. There are multiple ways to do this; for example, use the -u command of the GCC compiler:

    • -u malloc

    • -u free

    • -u realloc

    • -u getenv

    • -u setenv

    • -u __errno_location

If your application creates Posix threads (either explicitly or via the static OpenMP library or some other static library), you need to explicitly define the following additional functions:

  • -u pthread_key_create

  • -u pthread_key_delete

  • -u pthread_setspecific

  • -u pthread_getspecific

  • -u pthread_spin_init

  • -u pthread_spin_destroy

  • -u pthread_spin_lock

  • -u pthread_spin_trylock

  • -u pthread_spin_unlock

  • -u pthread_mutex_init

  • -u pthread_mutex_destroy

  • -u pthread_mutex_trylock

  • -u pthread_mutex_lock

  • -u pthread_mutex_unlock

  • -u pthread_cond_init

  • -u pthread_cond_destroy

  • -u pthread_cond_signal

  • -u pthread_cond_wait

  • -u _pthread_cleanup_push

  • -u _pthread_cleanup_pop

  • -u pthread_setcancelstate

  • -u pthread_self

  • -u pthread_yield

The easiest way to do this is by creating a file with the above options and passing it to gcc or ld. For example:

gcc -static mysource.cpp @Cdefs @Pdefs

where Cdefs is a file with options for the required C functions and Pdefs is a file with the options for the required POSIX functions.

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201