What should I tune first? - Quickly locate code taking a lot of time
Hotspots analysis gives you a sorted list of the functions using a lot of CPU time. This is where tuning will give you the biggest benefit. Click [+] for the call stacks. Double click to see the source.
Analyze Results Faster: See the Profiling Data on your Source
A double click from the function list takes you to the hottest spot in the function.
Profile Python* and Mixed Python, C++, C, and Fortran
Profile pure Python or Python with native code extensions. Get accurate data from low-overhead sampling techniques that don’t slow your code like the heavy instrumentation in many Python profilers. Get source line detail including call stacks. Using native extensions to improve performance? Unlike Python-only profilers, you can profile and tune native C, C++, and Fortran, too.
Quickly See Three Keys to Performance on Modern Processors
Use the high-performance computing (HPC) analysis to get a fast overview of three critical metrics for modern hardware performance: CPU utilization (for both thread and MPI parallelism), memory access, and FPU utilization (FLOPS). Then drill down with more in-depth analysis on each one
Threaded Performance Is Critical in Today’s Multicore World
Intel® VTune™ Amplifier has built-in understanding of parallel programming models―including Intel® Threading Building Blocks and OpenMP* 4.0―makes it easy to see and understand multithreading concepts such as task beginning and ending, synchronization, and wait time. Locks and waits analysis (first image below) is one example of how this is useful. Visualization on the timeline (second image below) lets you easily see lock contention (lots of yellow transitions), load imbalance, and inadvertent serialization―all common causes of poor parallel performance.
Quickly Find Common Causes of Slow Threaded Code with Locks and Waits Analysis
Waiting too long on a lock while the cores are underutilized during the wait is a common cause of slow performance in parallel programs. Profiles like basic hotspots and locks and waits use a software collector that works on both Intel® processors and compatible processors.
Find the Answer Faster: Mine the Data with Timeline Filtering
Select a time range in the timeline to filter out data (e.g., application startup) that masks the information you need. When you select and filter in the timeline, the grid that lists functions using a lot of CPU time updates to show the list filtered for the selected time. Yellow lines above show transitions. A high density of transitions may indicate lock contention and poor parallel performance. Turn off CPU time marking to diagnose issues with spin locks. And see just when threads are running or waiting and quickly spot inadvertent serialization.
Easy Profiling of Remote Systems: License Only Required on Host, not Target.
You can easily collect data on your current host or a remote system. Or collect data using the command line on the remote system and import the data for analysis locally. Collectors not installed on the remote Linux system? No problem, Intel VTune Amplifier can do that for you.
For the best performance, avoid Virtual Network Computing’s (VNC) slow graphics. Run the UI locally on Windows*, Linux* or macOS*. Import data from the remote target. No license is required for collecting data, which makes for a simple, lightweight install on remote Linux* or Windows* systems. A license is required to view or analyze the data collected.
Tune Drivers: Get High Resolution with Low Overhead
Intel® processors have an on-chip performance monitoring unit (PMU). In addition to basic hotspots analysis that works on both Intel and compatible processors, Intel VTune Amplifier has an "advanced hotspots" analysis that uses the PMU to collect data with very low overhead. System-wide analysis lets you analyze drivers. Increased resolution (~1 ms versus ~10 ms) can find hot spots in small functions that run quickly.
Bandwidth and Memory Analysis Made Easy
Use the Memory Access analysis to identify memory-related issues, like:
- Bandwidth-limited accesses. Quickly see a timeline of DRAM and Intel QPI bandwidth for your program. The consumers of memory bandwidth will generally vary as your program runs. By viewing the bandwidth in a graph, you can see where in your application memory usage spikes. Filter by selecting the area in the timeline where the spike occurs and see only the code that was active at that time. This lets you isolate the individual contributors to bandwidth consumption and tune effectively.
- Identify the code source and memory objects that are using bandwidth. As a general rule, a structure of arrays is more cache friendly than an array of structures. But it all depends upon how your program is accessing the data. Quickly identify data structures that can be reorganized to consume less bandwidth.
For Linux targets, Memory Access analysis can be configured to attribute performance events to memory objects (data structures). You can see the parts of your code that are contributing to memory issues. Sorting results by average latency helps to prioritize your tuning efforts for maximum impact.
Opportunities Highlighted For Faster, Easier Analysis
The cell is highlighted in pink when there is a potential tuning opportunity. Hover to get suggestions.
Easier, More Effective OpenMP* and MPI Multirank Tuning
The summary report quickly gets you the top four answers you need to effectively improve OpenMP* performance. Additional details for each region are available by clicking the links.
Quickly See How to Improve OpenMP* Performance
Detailed data for each OpenMP* region highlights tuning opportunities.
Easier Multi-Rank Analysis of MPI and OpenMP*
Intel VTune Amplifier’s summary view is enriched with a table of the top MPI ranks that will benefit from improved OpenMP performance
For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. The lower the communication spin time, the more the rank was executing (versus spinning) and the more impact OpenMP tuning will have on the application elapsed time. Use Intel® Trace Analyzer and Collector to tune MPI and select ranks with low communication spin times for further analysis in Intel VTune Amplifier. Intel VTune Amplifier can be installed on a cluster.
Storage Device Analysis for Hard Disk Drives (HDD), Serial ATA (SATA), or Non-Volatile Memory Express Solid-State Drives (NVMe SSD)
Are You I/O Bound or CPU Bound? Explore imbalance between I/O operations (async and sync) and compute. See when the CPU is waiting for I/O and see storage accesses mapped to the source code.
Easier OpenCL™ and GPU Profiling.
When tuning OpenCL on newer processors, the GPU Architecture Diagram makes it easier to understand GPU hardware metrics.
Analyze GPU and Platform Data
On newer Intel processors, optionally collect GPU and platform data for tuning OpenCL and media applications. Correlate GPU and CPU activities. New: Simplified hotspot analysis setup and detection of and OpenCL 2.0 shared virtual machine (VM) type.
No Special Compilers: Use your Regular Build
Use a production build with symbols from your normal compiler. Low collection overhead means accurate results you can count on.
Automate Using the Command Line
Use the included command line to automate regression analysis. It also permits a light weight install on remote systems for simple remote collection.
Tune drivers, kernel modules, and multiprocess apps.
Auto Detect DirectX* Frames
Got a slow spot in your Windows game play? You don't want to know where you are spending a lot of time; you want to know where you are spending a lot of time and the frame rate is slow. Intel VTune Amplifier can automatically detect DirectX* frames and filter results to show you what is happening in slow frames. Not using DirectX? Just define the critical region using the API and frame analysis becomes a powerful tool for analyzing latency.
Low Overhead Java* Profiling
Analyze Java or mixed Java and native code. Results are mapped to the original Java source. Unlike some Java profilers that instrument the code, Intel VTune Amplifier uses low overhead statistical sampling with either a hardware or software collector. Hardware collection has extremely low overhead because it uses the on-chip performance monitoring hardware.
Analyze User Tasks
The task annotation API is used to annotate your source so Intel VTune Amplifier can display which tasks are executing. For example, if you label the stages of your pipeline, they will be marked in the timeline and hovering will reveal details. This makes profiling data much easier to understand.
Tune for Intel® Xeon Phi™ Products
Hardware profiling is supported for Intel® Xeon Phi™ products and can be launched from the graphic user interface. It can collect advanced hotspots and advanced event data and has time markers for correlation of data across multiple cards.
New Features with Every Update
Visit What’s New? for a more complete list of the newest features. Check back occasionally, since we constantly add new features in product updates. One year of updates is included with your initial purchase or support renewal.