Compiler Switches for Intel® VTune™ Amplifier XE for Linux*

 

Introduction :
Intel® VTune™ Amplifier XE for Linux* can analyze most native binaries. However, some settings make analysis easier.

Useful Settings for Intel VTune Amplifier XE for Linux:

Switch

Purpose

-g
(highly recommended)

Intel VTune Amplifier XE uses the symbols to associate addresses to source lines.

Additionally, this is one of two methods needed to properly walk the call stack in "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits) 1

"Release" Build
(i.e: -O2)
(highly recommended)

The time required to execute a section of code may change if you don't use your normal production switches (Not -O0). This could cause you to analyze and attempt optimization on a section of code that is not a performance problem.

-shared-intel
-shared-libgcc2
(recommended)

These switches make it easier for Intel VTune Amplifier XE to run "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)

These settings allow Intel® VTune™ Amplifier XE to differentiate libm and C runtime calls from your code via the 'call stack' mode.

-debug inline-debug-info
(Intel Compiler on Linux)

This switch enables the Intel® C++ Compiler for Linux* to associate the symbols for inlined functions to the inlined function vs the caller.

This mode is the default for GCC 4.1 and higher

See –fno-inline below for more info

 

Useful Settings for applications using Intel® Threading Building Blocks:

Switch

Purpose

-D
TBB_USE_THREADING_TOOLS
(recommended)

Defining this enables full support of Intel TBB for VTune™ Amplifier XE. Note: This macro is automatically set if you compile with -D_DEBUG or -DTBB_USE_DEBUG.

Without TBB_USE_THREADING_TOOLS set, VTune™ Amplifier XE will not properly identify concurrency issues related to using TBB constructs.


Useful Settings for OpenMP* Applications compiled with the Intel® Compiler for Intel VTune Amplifier XE:

Switch

Purpose

-openmp
(highly recommended)

Without this switch Intel VTune™ Amplifier XE will not identify parallel regions due to OpenMP pragmas.

-openmp-link dynamic2
(recommended)

This default setting on the Intel Compiler chooses the dynamic version of the OpenMP runtime libraries which has been instrumented for VTune™ Amplifier XE.


Settings not recommended for use with Intel® VTune™ Amplifier XE:

Switch

Purpose

"Debug" Build

i.e: -O0
(Not Recommended)

Note: Using any switch which changes the performance of your application compared to a release build may dramatically impact the profile that VTune™ Amplifier XE reports - potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem in the release build.

-tcheck
(do not use)

This setting is an alternative method of instrumentation for Intel® Thread Checker; it will cause overhead, altering the performance analysis. VTune™ Amplifier XE does not use this switch.

-static
-static-libgcc2

These switches can prevent VTune™ Amplifier XE from being able to run "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)2

Note: On the Intel Compiler, when you specify -fast, it enables -static.

-static-intel2

This default setting on the Intel Compiler causes call-stack mode in "User mode sampling and tracing analysis" to not properly distinguish these functions as system functions.

-openmp-link static2
(do not use)

In Intel® Compiler 11.0 and Intel® Composer this setting chooses the static version of the OpenMP runtime libraries. This version of the OpenMP runtime library does not contain the necessary instrumentation for Amplifier XE.

-tprofile
(do not use)

This setting is an alternative method of instrumentation for the Intel® Thread Profiler; it will cause overhead altering the performance analysis. Intel VTune Amplifier XE does not use this switch.

-openmp_stubs
(do not use)

This setting will prevent OpenMP codes from actually being parallel.

-msse4a, -m3dnow
(do not use)

Binaries which use instructions not supported by Intel Processors may cause unknown behaviors in Intel VTune Amplifier XE.

-debug [parallel | extended | emit-column | expr-source-pos | semantic-stepping | variable-locations]
(not recommended)

 

VTune™ Amplifier XE works best with -debug full (the default when using -g). Other options including parallel, extended, emit-column, expr-source-pos,  semantic-stepping, & variable-locations are not supported by Intel® VTune™ Amplifier XE.

See –debug inline-debug-info for more info.

-coarray

Concurrency and Locks and Waits Analysis will not properly identify locks which prevent scaling in Coarray Fortran.

-fno-inline
-fno-inline-functions
(Sometimes Useful)

In VTune Amplifier XE Update 5 – a new feature was added that allows viewing inlined functions when the compiler used supports inlined symbols.

Requires:
* GCC 4.1 or later
* Intel Composer XE 2011 SP1 or later + -debug inline-debug-info

If using older compilers - These switches prevent the compiler from inlining functions, allowing Intel VTune Amplifier XE to associate samples and instrumented APIs to the callee and not the caller.  This allows a more complete call stack or to see the source code of samples and Instrumented APIs in functions which are inlined without the switch.

Note: Using any one of these these switches may dramatically impact the performance of your program - potentially causing you to analyze and attempt optimization on a section of code that is not a performance problem. Use these switches as an aid to understand inlining, but beware of using them to determine the hotspot in a released application

 

Notes:
1) "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)needs one of two features on the executable and all shared libs in your application to properly walk the call stack:

a) Symbols: Use-g. Note- this option also allows you to view source code

b) Frame pointers: Use -fno-omit-frame-pointer.

Note: There are other options which may add frame pointers to your binary as a side effect, Examples: -fexceptions (which is the default for C++).or -O0 . To make sure the executable (and shared libs) have this information, use the objdump -h <binary> command. You should see .eh_frame_hdr section there.

2) User mode sampling and tracing analysis (Hotspots, Concurrency, and Locks and Waits) works better with dynamic versions of the following libraries:

  • OpenMP Runtime Library as supplied by an Intel Compiler
    (libiomp5.so or libguide40.so)
  • Posix Thread library (libpthread.so)
  • C Runtime Library (libc.so)
  • C++ Runtime Library (libstdc++.so)
  • Intel's Libm library (libm.so)

User mode sampling and tracing analysis (Hotspots, Concurrency, and Locks and Waits) does not work as well with the static version of the following libraries:

  • OpenMP Runtime Library as supplied by an Intel Compiler
    (libiomp5.a or libguide4.a )
  • Posix Thread library (libpthread.a)
  • C Runtime Library (libc.a)
  • C++ Runtime Library (libstdc++.a)
  • Intel's Libm library (libm.a)

Statically linking in library/functions User mode sampling and tracing analysis uses has the following Issues

•a) The static version of the OpenMP runtime library as supplied by an Intel Compiler does not contain the necessary instrumentation for Concurrency, and Locks and Waits.

•b) Call Stack Mode in "User mode sampling and tracing analysis" will not properly distinguish User Code from System Functions.

•c) "User mode sampling and tracing analysis" (Hotspots, Concurrency, and Locks and Waits)1 will be unable to execute unless various C Runtime functions are exported. There are multiple ways to do this, one way is to use the -u command of the GCC compiler.

-u malloc
-u free
-u realloc
-u getenv
-u setenv
-u __errno_location

If your application creates Posix Threads (Either explicitly or through the static OpenMP library or some other static library) there are some additional functions that you will need to explicitly define:

-u pthread_key_create
-u pthread_key_delete
-u pthread_setspecific
-u pthread_getspecific
-u pthread_spin_init
-u pthread_spin_destroy
-u pthread_spin_lock
-u pthread_spin_trylock
-u pthread_spin_unlock
-u pthread_mutex_init
-u pthread_mutex_destroy
-u pthread_mutex_trylock
-u pthread_mutex_lock
-u pthread_mutex_unlock
-u pthread_cond_init
-u pthread_cond_destroy
-u pthread_cond_signal
-u pthread_cond_wait
-u _pthread_cleanup_push
-u _pthread_cleanup_pop
-u pthread_setcancelstate
-u pthread_self
-u pthread_yield

The easiest way to do this is by creating a file with the above options and passing it to gcc or ld.

Example:

gcc -static mysource.cpp @Cdefs @Pdefs

Where Cdefs is a file with options for the C functions needed above and Pdefs is a file with the options for the POSIX functions needed above

More Information:

This article addressed the most obvious switches that developers would have concerns over. Most switches will work with Intel VTune Amplifier XE for Linux - but not every switch or switch combination is tested (there are a lot of switches!). If you have information regarding other switches, please add a comment to this article. If you have question regarding a particular switch please submit an issue to the Intel VTune Amplifier XE forum.

Versions:
Intel® VTune Amplifier XE 2011 for Linux*
Intel® C++ and Fortran Compiler for Linux 11.x, 12.x
GNU C/C++ Compiler 3.4.6

有关编译器优化的更完整信息,请参阅优化通知