The Intel® Performance Tuning Utility (Intel® PTU) is a cross-platform performance analysis tool set. Alongside with such traditional features as identifying the hottest modules and functions of the application, tracking call sequences, identifying performance-critical source code, Intel PTU has new, more powerful capabilities of data collection, analysis, and visualization. For experienced users, Intel PTU offers the processor hardware event counters for in-depth analysis of the memory system performance, architectural tuning, and others. It associates performance issues with the source code. If you do not have symbol sources for an analyzed application, Intel PTU represents data with basic block granularity and provides a graph of the function execution flow (control flow graph) to navigate the disassembly. The Intel Performance Tuning Utility is available for both Windows* and Linux* operating systems.
The Intel Performance Tuning Utility offers:
- Statistical Call Graph
- Profiles with low overhead to detect where time is spent in your application
- Event Based Sampling
- Uses the processor’s onboard performance monitoring hardware to get a detailed look into performance issues
- Basic Block Analysis
- Displays hotspots with basic block granularity and generates a control flow graph for advanced analysis of application, even without the source code
- Events over IP graph
- Generates a histogram of performance events distributed over application code
- Loop Analysis
- Identifies loops and recursion in your application to aid optimization
- Result difference
- Compares the results of multiple runs to measure changes in performance
- Data Access Profiling
- Identifies memory hotspots and relates them to code hotspots
- Heap Profiler
- Identifies dynamic memory usage by application. Can help identify memory leaks
The Update 5 of Intel® Performance Tuning Utility 4 introduces the following new features and enhancements:
- Intel® Microarchitecture Code Name Sandy Bridge support including AVX-instructions support in Data Profiling
- Updated events for Intel® Atom™ processor
- New predefined profile configurations
- More flexibility in Result Difference view (module aliasing)
- Enhanced Project and Hotspot analysis configuration
- Start collection paused, resume automatically after a given time delay
- Enable and control trigger-based event multiplexing
- Enhanced Ratio definition formulas: use ratios as operands, Min/Max operations
- Ability to integrate with Intel® Performance Bottleneck Analyzer
- Several bug fixes in collection and analysis
- This version of Intel Performance Tuning Utility is not backward compatible with the previous versions in regard to opening/viewing data collected by the previous versions. In some cases, database re-conversion can help.
- To install Intel PTU features with the purpose of using Statistical Call Graph, Exact Call Graph, and Heap Profiling collections or viewing any collection results you can be a regular user. To enable Sampling, Data Access Profiling collections you must be a system administrator. As soon as Intel PTU is properly installed (see INSTALL.txt for details), all the features can be used by regular users. Enabling Statistical Call Graph collection on Windows host requires system administrator privileges because it is driver-based.
- Enabling Sampling and Data Access Profiling collections requires driver installation and may affect the work of Sampling collector in Intel VTune™ Performance Analyzer on Linux Itanium®-based systems. On these systems it is not guaranteed that this version of Intel PTU and VTune analyzer can share/use the same sampling driver. It is not guaranteed that VTune analyzer can properly read sampling data files(*.tb5) generated by this version of Intel PTU either. If you experience problems with sampling collection or viewing its results in VTune analyzer or Intel PTU, make sure each product uses the driver it is shipped with. See INSTALL.txt to learn how to run a proper driver.
- Many capabilities of the Intel Performance Tuning Utility have "prototype" level of maturity and are expected to develop continuously in further updates and releases. Any feedback is appreciated.
|Intel® Xeon® processor||+|
|Intel® Xeon® DP processor||+||+|
|Intel® Xeon® MP processor||+||+|
|Intel® Pentium® M processor||+|
|Intel® Core™ Duo processor||+|
|Intel® Core™ 2 Duo processor||+||+|
|45nm Intel® Core™ 2 Duo processor||+||+|
|Intel® Xeon® processor 3xxx, 5xxx, 7xxx series||+||+|
|Intel® Itanium® 2 processor||+|
|Intel® Itanium® 2 processor series 9000||+|
|Intel® Itanium® 2 processor series 9300||+|
|Intel® Core™ i7, i5, i3 processors||+||+|
|32 nm Intel® Core™ i7 processor||+||+|
|32 nm Intel® Xeon® processor||+||+|
|Intel® Microarchitecture Code Name Sandy Bridge processor (Intel® Core™ Processor 2xxx Series)||+||+|
To view the full list of currently supported processors, enter:
|Command line collector and viewer||> 500 MB||> 500 MB|
|Loop profiling enabled||> 1GB||> 1GB|
|Graphical User Interface||> 1 GB||> 1 GB|
Disk Space Requirements
|Total (archive file, its extracted files, and all installed components)||300-400 MB|
Operating System Requirements
|Microsoft* Windows* XP Professional Service Pack 2, 3||+|
|Microsoft* Windows* XP Professional x64 Edition Service Pack 2||+|
|Microsoft* Windows* Server 2003 Enterprise Edition Service Pack 1||+|
|Microsoft* Windows* Server 2003 Enterprise x64 Edition Service Pack 2||+|
|Microsoft* Windows* Server 2008||+|
|Microsoft* Windows* Vista* (Ultimate, Enterprise)||+|
|Microsoft* Windows* Vista* Service Pack 1||+|
|Microsoft* Windows* 7 (Ultimate)||+|
|Microsoft* Windows* 7 (Ultimate) Service Pack 1||+||+|
|Red Hat* Fedora* 9 (kernel 2.6.25-14.fc9)||+|
|Red Hat* Fedora* 10 (kernel 126.96.36.199-117.fc10)||+||+|
|Red Hat* Fedora* 11||+||+|
|Red Hat* Fedora* 12||+||+|
|Red Flag* Linux* 5.0 DC Server (kernel 2.6.9-11)||+|
|Red Hat* Enterprise Linux* Advanced Server 3.0 Update 6 (kernel 2.4.21-37)||+|
|Red Hat* Enterprise Linux* Advanced Server 4.0 Update 5 (kernel 2.6.9)||+||+||+|
|Red Hat* Enterprise Linux* Advanced Server 5.3 (2.6.18-128.el5)||+||+||+|
|Red Hat* Enterprise Linux* Advanced Server 5.4||+||+|
|Red Hat* Enterprise Linux* Advanced Server 5.5||+||+|
|SUSE* Linux* Enterprise Server 9 Service Pack 3 (kernel 2.6.5)||+||+||+|
|SUSE* Linux* Enterprise Server 10 (kernel 188.8.131.52-0.8)||+|
|SUSE* Linux* Enterprise Server 10 Service Pack 2 (kernel 184.108.40.206-0.21)||+||+|
|SUSE* Linux* Enterprise Server 10 Service Pack 3||+||+|
|SUSE* Linux* Enterprise Server 11||+||+|
|Turbolinux* 10 (kernel 2.6.9-5.15)||+|
|Ubuntu* 8.10 (kernel 2.6.27-7-generic)||+||+|
The Intel Performance Tuning Utility works with ALL compilers that follow industry standard object code formats. It was tested on applications built with the following compilers:
- GCC* 3.2, 3.3, 3.4, 4.0
- Intel® C++ Compiler 10.1
- Intel® C++ Compiler 11.1
- Microsoft* Visual C++* 2005
- Microsoft* Visual C++* 2008
- Microsoft* Visual C++* 2010
Java Environment Requirements
- For IA-32 and Intel® 64 platforms: Eclipse* 3.4.2, EMF* 2.4.2, and GEF* 3.4.2
- For Itanium®-based platforms: Eclipse* 3.2, EMF* 2.2, and GEF* 3.2
<eclipse_home>/readme/readme_eclipse.html(Running Eclipse chapter) for the list of JVMs supported by Eclipse. The Intel PTU package includes all the components listed above.
- Readme file provides product overview information, lists package content and technical support sites.
- Installation Guide describes the steps required to install Intel Performance Tuning Utility.
- Release Notes lists the systems tested for the compatibility with Intel Performance Tuning Utility, describes known issues and product limitations.
- Command-line help provides short command reference and usage modes. To view the command-line help for the Intel Performance Tuning Utility commands, enter:
- User Guide provides full-scale product description including GUI and command-line reference and usage models.
- Reference Guide provides reference information about instructions, events, and penalties for the supported processors. To access the Reference Guide, go to the Eclipse* Help menu > Help Contents and select the Intel® Performance Tuning Utility book from the table of contents.
Related Products and Services
- The Intel® Academic Community provides training for developers on leading-edge software development technologies. Training consists of online and instructor-led courses covering all Intel architectures, platforms, tools, and technologies.
- The Intel® Compilers enable software to run at top speeds and fully support the latest Intel® processors. Compatible with other tools you use, the Intel compilers integrate into popular development environments and features source and binary compatibility with other widely-used compilers.
- The Intel® High Performance Computing tools increase performance and optimization throughout your development lifecycle.
- The Intel® Math Kernel Library provides developers of scientific and engineering software with a set of linear algebra, fast Fourier transforms and vector math functions optimized for the latest Intel processors.
- The Intel® Integrated Performance Primitives consists of cross-platform tools to build high performance software for several Intel architectures and several operating systems
Known Problems and Limitations
- To see the correct functions (their names and boundaries) in Hotspot and Source views you need to have symbol file on Windows or full symbol information in the Linux executable. Do not pass -s option to ld command and do not run strip command on executable. For best results, use -g compiler option on Linux and /Zi on Windows so that debug information is available.
- Intel Performance Tuning Utility does not work correctly for executables that have non-English symbols in their names and sources (e.g. non-English comments). Do not use non-English symbols in the path to the Intel PTU.
- An application may hang when collection is started under MC (Midnight Commander) on Linux and collector starts application using the -- option. This happens due to application conflicts with MC in console (tty) usage. Make sure you are running collection outside MC.
- You cannot invoke the Source View on the Linux system for the experiment collected on the Windows system.
- Note that this version of the Intel Performance Tuning Utility applies search directories for symbol files (both predefined and user-defined) only when drilling down to Source View. To use symbol files for sampling/stack sampling/data profiling views, locate symbol and binary files to the same directory and use the Re-convert command in GUI or the
--re-convertoption in command line.
GUI Problems and Limitations
- Depending on OS, Eclipse GUI may not work due to the crash in JRE. This was observed on Fedora 12 systems. It is recommended to run Eclipse as follows:
- Depending on GTK+ version installed on the system, Eclipse GUI may not work due to not functional dialog controls. This was observed on Fedora 12, Ubuntu 9.1, Ubuntu 10.04 systems. It is recommended to run Eclipse as follows:
- Intel PTU normally works with collected data size up to 500Mb (database size). Greater data can slow down a reaction of GUI dramatically. Use
-ioption to regulate the size (and level of detail) of Statistical Call Graph collection results.
- Statistical Call Graph results do not contain the sampled event and Sample After Value (SAV) used as collection configuration parameters. Pay attention to this information while setting up Profiler Configuration or refer to the used Profile Configuration.
- Occasionally the message on insufficient JVM memory can be displayed. It happens because the Java* Virtual Machine allocates the fixed amount of memory at Eclipse* start. You can increase the value using the
-vmargs -Xmx512Moption when starting Eclipse. In this example, 512 MB are allocated.
- For Itanium-based systems, the Control Flow Graph (part of the Source View) does not show branches and cycles.
- Linux Intel 64 version of Intel PTU GUI may fail to start on some operating systems (e.g. on SLES* 10). Replace the eclipse directory under the Intel 64 version of Intel PTU with the eclipse directory from the IA-32 version of Intel PTU.
- Statistical Call Graph GUI view may work incorrectly for the stacks with more than 1000 items. Such stacks are usually generated in case of recursive calls.
Statistical Call graph collection Problems and Limitations
- Intel Performance Tuning Utility supports only time-based Statistical Call Graph profiling on Linux and measures processor usage only. For example, if your application does a lot of I/O operations, this is not visible in the results. Timer interval is limited by OS timer granularity and cannot be lower than configured in OS.
- Intel Performance Tuning Utility Statistical Call Graph cannot profile statically linked executables.
- The Loop profiler analyzes modules compiled for the native architecture only. For example, if modules are compiled for IA-32 architecture it is not possible to detect loops on Intel 64 architecture.
- You can collect Statistical Call Graph data only on one hardware event on Windows.
- Intel Performance Tuning Utility Statistical Call graph (SCG) on Linux platform depends on unwinding information (unwind table) encoded within the executable or usage of frame pointers. By default GCC and Intel C compiler do not generate unwind table for C programs. However they use EBP as a frame pointer in each function which is enough for stack unwinding. In case of using optimization options compilers prefer to generate ESP based frames and unwinding information that is necessary to correctly perform stack walking. Unwinding information is located in .debug_frame, .eh_frame sections; SCG stack walking algorithm uses unwind info from .eh_frame section. The .eh_frame_hdr section is segment section and available via program header table; this section helps to find location of .eh_frame section in address space of loaded program. Without .eh_frame_hdr section SCG stack walking algorithm can not find location of .eh_frame since it does not parse raw binary on disc.
There are known issues with stack unwinding on IA-32 and Intel 64 architectures Linux platforms and reasons connected with incomplete unwind information:
- Program is compiled with optimization options for example -fomit-frame-pointer on GCC or -O2 on Intel C Compiler. As result, compiler can generate ESP based frames and in case of incomplete or absents unwind information stack unwinding does not work.
- Some versions of GCC compilers do not generate
.eh_frame_hdrsection on IA-32 architecture even if
-fexceptionoption is used.
- Compiler may generate invalid FDE's (Frame Description Entry) in unwind table for some address ranges. As result Intel Performance Tuning Utility provides incorrect return address when unwinding from this address range.
- Function on stack may be skipped. Sample may fall in the prolog of the hot function right before the initialization of the frame pointer. If unwind info for the given sample is absent and the pervious frame is a EBP based frame, caller of the hotspot is skipped during the stack unwinding.
- The caller function of hotspot from glibc (memset, memcpy) or math library (sin, pow) may be skipped on Linux IA-32. Unfortunately glibc and math library have incomplete unwinding information for many optimized functions. At the same time those functions do not use frame pointers. As a result, caller function is skipped when unwinding the stack since frame pointer is pointing to the function above the caller of the hotspot.
-fasynchronous-unwind-tablesoption for GCC and the
-fexceptionsoption for Intel C compiler. To make that sure your executable (and shared libs) have this information, use the
objdump -h <binary>command. You should see
.eh_frame_hdrsection there. For C++ programs exception handling tables are generated by default, however if you switched off exception handling by using the
-fno-exceptionsoption you will need to force generation of exception handling tables or frame pointers. To do this in GCC use
-fpoptions, in ICC you may use only the
If it does not help reduce optimization level (in case it is possible).
- Intel Performance Tuning Utility Statistical Call Graph on Windows platform depends on FPO (Frame Pointer Omission) data located within PDB file or usage of frame pointers. Stack unwinding can be improved if PDB file exists. Use
symchkutility (part of Debugging Tools for Windows package) to load PDB file for system binary and add location of PDB file in
Example of loading PDB file for kernel32.dll:
symchk /s srv*c:\symbols*http://msdl.microsoft.com/download/symbols C:\winnt\system32\kernel32.dll
If it does not help reduce optimization level (in case it is possible).
- The displayed number of samples for functions in Statistical Call graph results may be incorrect in some cases. The known problem is that in the Caller/Callee view self and total samples for recursive functions could be incorrect.
- Statistical Call Graph feature might not work on Linux systems starting with the kernel 2.6.12 (all modern systems are affected)
Sampling collection problems and limitations
- When launching Intel PTU on Red Hat* operating systems, you may receive error messages like the following: cannot restore segment prot after reloc: Permission denied. This might happen due to the conflict with the SELinux security feature. Intel PTU supports SELinux only if the current policy is Targeted. To avoid conflicts, disable SELinux as follows:
- in local
/etc/sysconfig/selinuxfile, set SELINUX=disabled
grub.conffiles, add selinux=0 kernel argument
- in local
- On some Intel® Core™ i7 processor-based systems with C-states enabled, sampling may cause system hanging due to a known hardware issue (see errata AAJ134 in http://download.intel.com/design/processor/specupdt/320836.pdf). To avoid this, disable C-states in BIOS (e.g., via the Cn(ACPI Cn) report to OS option) before sampling with Intel PTU on Intel® Core™ i7 processor-based systems [#DPD200149603].
- Opening the event configuration dialog for sampling collection may take long time especially for Intel Itanium 2 processor. This happens due to large number of events and duplication of events modifier for each event in XML file passed from command line to the GUI.
- Sampling may not work on Intel Itanium 2 processor with old Linux kernels (for example, TurboLinux10). You may request updates from Linux distribution maker.
Heap Profiler, exact Call Graph, Call Count problems and limitations
- Call Graph/Call Count and Heap profiler analyze modules compiled for the native architecture only. For example, if modules are compiled for IA-32 architecture it is not possible to detect loops on Intel 64 architecture.
- No Java-based application profiling.
- Call Graph/Call Count and Heap profiler cannot profile self-modifying code.
- Call Graph, Call Count and Heap Profiler may not work on applications which contain SSE4 instructions.
- On Windows systems, you cannot profile multi-process applications using the Heap Profiler, exact Call Graph, or Call Count tools.
- Heap Profiler does not stop collection after clicking the stop button under Eclipse*. To finish the data collection close the application under profiling.
- Heap Profiler does not produce data if the application under profiling is terminated with CTRL+C.
- On Linux systems running on IA-32 architecture, Heap Profiler may show incomplete stack for application memory allocations/deallocations if it is launched with
--exact=nooption on the application compiled with the
- On Windows you cannot run Call Graph/Call Count and Heap Profiler analysis on systems protected by the McAfee Host Intrusion Prevention* antivirus software. Make sure you disable this software first.
- The trace mode of Heap Profiler can generate gigabytes of results. Make sure you have enough disk space.
- 'Trace Children' mode in the Heap Profiler on Linux handles only situations when executable is called after fork. Profiling of application calling fork without executable or an executable without fork will not produce results.
- The non-exact (or fast) mode of Heap Profiler is available on Linux operating systems on IA-32 architecture only.
- The non-exact mode of the Heap Profiler depends on exception handling information encoded within the executable. To ensure your executable (and shared libs) have this information use the
>objdump –h <binary>command. You should see
.eh_frame_hdrthere. Several GCC compilation options affect presence/content of this section. Possible solutions are:
–fnoexceptionGCC option turns off generation of exception related code and exception handling tables. If you use this option, enable generation of unwind tables using the
- GCC has a bug that shows up when
-fomit-frame-pointerswitch is used. For some reason, GCC removes
.eh_frame_hdr. To workaround this bug you will need to use the
- No Red Hat Fedora 10, Ubuntu 8.10 (and newer Linux systems) and Microsoft Windows 7 support.
Data Access Profiler problems and limitations
- There is a list of systems below where the data profiling is possible:
Processor Windows Linux IA-32
Intel® Core™ 2 Duo processor + + - + 45nm Intel® Core™ 2 Duo processor + + - + Intel® Core™ i7, i5, i3 processors + + - + Intel® Itanium® 2 processor series 9000 + +
- On the systems with Intel® Core™ 2 Duo processor some memory load instructions may use the same register as source and destination, for example
mov [rax], ax. If samples fall on such instructions they are ignored by data profiling view because it is impossible to calculate data address of the load in this case.
Feedback and Technical Support
Your feedback is very important to us. To point to an issue and receive a technical answer for the tools provided in this product, visit the web site where you got the package. You can learn about the discussion forum possibilities from that web page. We do not provide technical support for the tools inside this product.
Diagnostic and Logging
While running the Intel Performance Tuning Utility logs the experiment workflow. Log files are created in a directory assigned as a directory for temporary data for current user. For example, to reach the log location type
cd %TEMP%/ptu-log-%USERNAME% or type
%TEMP%/ptu-log-%USERNAME% in the explorer address bar and press enter.
ptu-log-<username> contains history of all commands executed in the file
history.txt and folders with command processing details. To provide the response team with information about a problem, it is recommended to archive the experiment and
ptu-log-<username> directories, and send it along with the problem report to the response team for further investigation.
Intel PTU uses SWTChart 0.6.0 which is free open source software distributed under Eclipse Public License v1.0 (EPL). SWTChart is used in the binary form only, no source modifications were made. Its binaries and the complete source code can be downloaded from http://swtchart.org.