Compiler Switches for Intel® VTune™ Amplifier XE for Linux*

 

Introduction :
Intel® VTuneTM Amplifier XE for Linux* can analyze most native binaries. However, some settings make analysis easier.

Useful Settings for Intel VTune Amplifier XE for Linux:

Switch

Purpose

-g
(highly recommended)

Intel VTune Amplifier XE uses the symbols to associate addresses to source lines.

Additionally, this is one of two methods needed to properly walk the call stack in "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits) 1

"Release" Build
(i.e: -O2)
(highly recommended)

The time expired to execute a section of code may change if you don't use your normal production switches (Not -O0). Potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem.

-shared-intel
-shared-libgcc2
(recommended)

These switches make it easier for Intel VTune Amplifier XE to run "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)

These settings allows Intel VTune Amplifier XE to differentiate libm and C runtime calls from your code via the Call Stack Mode.

-debug inline-debug-info
(Intel Compiler on Linux)

This switch enables the Intel Compiler for Linux to associate the symbols for inlined functions to the inlined function vs the caller.

This mode is the default for GCC 4.1 and higher

See –fno-inline below for more info

Useful Setting for applications using Intel® Threading Building Blocks:

Switch

Purpose

-D
TBB_USE_THREADING_TOOLS
(recommended)

Defining this enables full support of Intel TBB for Intel VTune Amplifier XE. Note: This macro is automatically set if you compile with -D_DEBUG or -DTBB_USE_DEBUG.

Without TBB_USE_THREADING_TOOLS set, Intel VTune Amplifier XE will not properly identify concurrency issues related to using TBB constructs.


Useful Settings for OpenMP* Applications compiled with the Intel® Compiler for Intel VTune Amplifier XE:

Switch

Purpose

-openmp
(highly recommended)

Without this switch Intel VTune Amplifier XE will not identify parallel regions due to OpenMP pragmas.

-openmp-link dynamic2
(recommended)

This default setting on the Intel Compiler chooses the dynamic version of the OpenMP runtime libraries which has been instrumented for Intel VTune Amplifier XE.


Settings not recommended for use with Intel VTune Amplifier XE:

Switch

Purpose

"Debug" Build

i.e: -O0
(Not Recommended)

Note: Using any switch which changes the performance of your application compared to a "Release" build may dramatically impact the profile that Intel VTune Amplifier XE reports - Potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem when compiled as your "Released" binary.

-tcheck
(do not use)

This setting is an alternative method of instrumentation for Intel® Thread Checker, it will cause overhead altering the performance analysis. Intel VTune Amplifier XE does not use this switch.

-static
-static-libgcc2

These switches can prevent Intel VTune Amplifier XE from being able to run "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)2

Note: On the Intel Compiler when you specify -fast it enables -static.

-static-intel2

This default setting on the Intel Compiler causes Call Stack Mode in "User mode sampling and tracing analysis" to not properly distinguish these functions as System Functions.

-openmp-link static2
(do not use)

In Intel® Compiler 11.0 and Intel® Composer this setting chooses the static version of the OpenMP runtime libraries. This version of the OpenMP runtime library does not contain the necessary instrumentation for Amplifier XE.

-tprofile
(do not use)

This setting is an alternative method of instrumentation for Intel® Thread Profiler, it will cause overhead altering the performance analysis. Intel VTune Amplifier XE does not use this switch.

-openmp_stubs
(do not use)

This setting will prevent OpenMP codes from actually being parallel.

-msse4a, -m3dnow
(do not use)

Binaries which use instructions not supported by Intel Processors may cause unknown behaviors in Intel VTune Amplifier XE.

-debug [parallel | extended | emit-column | expr-source-pos | semantic-stepping | variable-locations]
(not recommended)

 

Intel VTune Amplifier XE works best with -debug full (the default when using -g). Other options including parallel, extended, emit-column, expr-source-pos,  semantic-stepping, & variable-locations are not supported by Intel VTune Amplifier XE.

See –debug inline-debug-info for more info.

-coarray

Concurrency and Locks and Waits Analysis will not properly identify locks which preventing scaling in Coarray Fortran.

-fno-inline
-fno-inline-functions
(Sometimes Useful)

In VTune Amplifier XE Update 5 – a new feature was added that allows viewing inlined functions if the compiler used supports inlined symbols.

Requires:
* GCC 4.1 or later
* Intel Composer XE 2011 SP1 or later + -debug inline-debug-info

If Using older compilers - These switches prevent the compiler from inlining functions - allowing Intel VTune Amplifier XE to associate samples and instrumented APIs to the callee and not the caller - this allows a more complete call stack or to see the source code of samples and Instrumented APIs in functions which are inlined without the switch.

Note: Using any one of these these switches may dramatically impact the performance of your program - Potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem. Use these switches as an aid to understand inlining - but beware of using them to determine the hotspot in a released application

Notes:
1) "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)needs one of two features on the executable and all shared libs in your application to properly walk the call stack:

a) Symbols: Use-g. Note- this option also allows you to view source code

b) Frame pointers: Use -fno-omit-frame-pointer.

Note: There are other options which may add frame pointers to your binary as a side effect, Examples: -fexceptions (which is the default for C++).or -O0 . To make sure the executable (and shared libs) have this information, use the objdump -h <binary> command. You should see .eh_frame_hdr section there.

2) User mode sampling and tracing analysis (Hotspots, Concurrency, and Locks and Waits) works better with dynamic versions of the following libraries:

  • OpenMP Runtime Library as supplied by an Intel Compiler
    (libiomp5.so or libguide40.so)
  • Posix Thread library (libpthread.so)
  • C Runtime Library (libc.so)
  • C++ Runtime Library (libstdc++.so)
  • Intel's Libm library (libm.so)

User mode sampling and tracing analysis (Hotspots, Concurrency, and Locks and Waits) does not work as well with the static version of the following libraries:

  • OpenMP Runtime Library as supplied by an Intel Compiler
    (libiomp5.a or libguide4.a )
  • Posix Thread library (libpthread.a)
  • C Runtime Library (libc.a)
  • C++ Runtime Library (libstdc++.a)
  • Intel's Libm library (libm.a)

Statically linking in library/functions User mode sampling and tracing analysis uses has the following Issues

•a) The static version of the OpenMP runtime library as supplied by an Intel Compiler does not contain the necessary instrumentation for Concurrency, and Locks and Waits.

•b) Call Stack Mode in "User mode sampling and tracing analysis" will not properly distinguish User Code from System Functions.

•c) "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)1 will be unable to execute unless various C Runtime functions are exported. There are multiple ways to do this, one way is to use the -u command of the GCC compiler.

-u malloc
-u free
-u realloc
-u getenv
-u setenv
-u __errno_location

If your application creates Posix Threads (Either explicitly or through the static OpenMP library or some other static library) there are some additional functions that you will need to explicitly define:

-u pthread_key_create
-u pthread_key_delete
-u pthread_setspecific
-u pthread_getspecific
-u pthread_spin_init
-u pthread_spin_destroy
-u pthread_spin_lock
-u pthread_spin_trylock
-u pthread_spin_unlock
-u pthread_mutex_init
-u pthread_mutex_destroy
-u pthread_mutex_trylock
-u pthread_mutex_lock
-u pthread_mutex_unlock
-u pthread_cond_init
-u pthread_cond_destroy
-u pthread_cond_signal
-u pthread_cond_wait
-u _pthread_cleanup_push
-u _pthread_cleanup_pop
-u pthread_setcancelstate
-u pthread_self
-u pthread_yield

The easiest way to do this is by creating a file with the above options and passing it to gcc or ld.

Example:

gcc -static mysource.cpp @Cdefs @Pdefs

Where Cdefs is a file with options for the C functions needed above and Pdefs is a file with the options for the POSIX functions needed above

More Information:

This article addressed the most obvious switches that developers would have concerns over. Most switches will work with Intel VTune Amplifier XE for Linux - but not every switch or switch combination is tested (there are a lot of switches!). If you have information regarding other switches, please add a comment to this article. If you have question regarding a particular switch please submit an issue to the Intel VTune Amplifier XE forum.

Versions:
Intel® VTune Amplifier XE 2011 for Linux*
Intel® C++ and Fortran Compiler for Linux 11.x, 12.x
GNU C/C++ Compiler 3.4.6

For more complete information about compiler optimizations, see our Optimization Notice.