HPC Performance Characterization View
Use the HPC Performance Characterization viewpoint to estimate CPU usage, memory efficiency, and floating-point utilization for compute-intensive or throughput applications. Compute-intensive or throughput applications should use hardware resources efficiently for the duration of their elapsed time. Use the
HPC Performance Characterization analysis as a starting point for optimizing application performance and runtime.
Follow these steps to interpret the performance data provided in the HPC Performance Characterization viewpoint:
Use the
Analyzing an OpenMP* and MPI Application
tutorial to review basic steps for tuning a hybrid application. The tutorial is available from the Intel Developer Zone at
https://software.intel.com/en-us/itac-vtune-mpi-openmp-tutorial-lin. You can also find a webinar about HPC Performance Characterization analysis at
https://software.intel.com/en-us/videos/hpc-applications-need-high-performance-analysis.
1. Define a Performance Baseline
Start with exploring the
Summary window that provides general information on your application execution. Key areas for optimization include the elapsed time and floating-point operation per second counts (single precision, double precision, and legacy x87). Red text indicates an area of potential optimization. Hover over a flag to learn more about how to improve your code.
Use the Elapsed Time and GFLOPS values as a baseline for comparison of versions before and after optimization.

2. Determine Optimization Opportunities
Review the
Summary
window to find the key optimization opportunities for your application. Performance metrics that can be improved are marked in red. Issues identified could include Effective Physical Core Utilization, Memory Bound, Vectorization, or a combination of these. The following sections provide suggested next steps for each performance aspect:
CPU Utilization

- Explore theEffective Physical Core Utilizationmetric as a measure of the parallel efficiency of the application. A value of 100% means that the application code execution uses all available physical cores. If the value is less than 100%, it is worth looking at the second level metrics to discover reasons for parallel inefficiency.
- Learn about opportunities to use the logical cores. In some cases, using logical cores leads to application concurrency increases and overall performance improvements.
- For some Intel® processors, such as Intel® Xeon Phi™ or Intel Atom®, or systems where Intel Hyper-Threading Technology (Intel HT Technology) is OFF or absent, the metric breakdown between physical and logical core utilization is not available. In these cases, a singleEffective CPU Utilizationmetric is displayed to show parallel execution efficiency.
- For applications that do not use OpenMP or MPI runtime libraries:
- Review theEffective CPU Utilization Histogram, which displays the Elapsed Time of your application, broken down by CPU utilization levels.
- Use the data in theBottom-upandTop-down Treewindows to identify the most time-consuming functions in your application by CPU utilization. Focus on the functions with the largest CPU time and low CPU utilization level as your candidates for optimization (for example, parallelization).
- For applications with Intel OpenMP*:
- Compare the serial time to the parallel region time. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible. Look at serial hotspots to define candidates for further parallelization.
- Review theOpenMP Potential Gainto estimate the efficiency of OpenMP parallelization in the parallel part of the code. The Potential Gain metric estimates the elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving OpenMP parallelism. If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to theBottom-upwindow employing anOpenMP Regiondominant grouping and the region of interest selection.
- Consider running Threading analysis when there are multiple locks used in one parallel construct to find the performance impact of a particular lock.
- For MPI applications:Review the MPI Imbalance metric that shows the CPU time spent by ranks spinning in waits on communication operations, normalized by number of ranks on the profiling node. The metric issue detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI Busy wait time by ranks is not significant, then the rank on with the minimal time most likely lies on the critical path of application execution. In this case, review the CPU utilization metrics by this rank.
- For hybrid MPI + OpenMP applications:The sub-sectionMPI Rank on Critical Pathshows OpenMP efficiency metrics like Serial Time (outside of any OpenMP region), Parallel Region time, and OpenMP Potential Gain. If the minimal MPI Busy Wait time is significant, it can be a result of suboptimal communication schema between ranks or imbalance triggered by another node. In this case, use Intel® Trace Analyzer and Collector for in depth analysis of communication schema.
GPU Utilization
GPU utilization metrics display when:
- Your application makes use of a GPU.
- Your system is configured to collect GPU data. See Set Up System for GPU Analysis.
Under
Elapsed Time
, the
GPU
section presents an overview of how your application offloads work to the GPU.

- TheTimemetric indicates if the GPU was idle at any point during data collection. A value of 100% implies that your application offloaded work to the GPU throughout the duration of data collection. Anything lower presents an opportunity to improve GPU utilization.
- TheIPC Ratemetric indicates the average number of instructions per cycle processed by the two FPU pipelines of Intel ®Integrated Graphics. To have your workload fully utilize the floating-point capability of the GPU, the IPC Rate should be closer to 2.
Next, look into
GPU Utilization when Busy
. This section can help you understand if your workload can use the GPU more efficiently.

Ideally, your GPU utilization should be 100%. If
GPU Utilization when Busy
is <100%, there were cycles where the GPU was stalled or idle.
- EU Statebreaks down the activity of GPU execution units. Check here to see if they were stalled or idle when processing your workload.
- Occupancyis a measure of the efficiency of scheduling the GPU thread. A value below 100% recommends that you tune the sizes of the work items in your workload. Consider running the GPU Offload Analysis. This provides an insight into computing tasks running on the GPU as well as additional GPU-related performance metrics.
If your application offloads code via Intel OpenMP*, check the
Offload Time
section:
- TheOffload Timemetric displays the total duration of the OpenMP offload regions in your workload. IfOffload Timeis below 100%, consider offloading more code to the GPU.
- TheCompute,Data Transfer, andOverheadmetrics help you understand what constitutes theOffload Time. Ideally, the Compute portion should be 100%. If theData Transfercomponent is significant, try to transfer less data between the host and the GPU.
In the
Top OpenMP Offload Regions
section, review the breakdown of offload and GPU metrics by OpenMP offload region. Focus on regions that take up a significant portion of the
Offload Time
.
For applications that use OpenMP offload, the
Bottom-up
window displays additional information.

- Group byOpenMP Offload Region. In this grouping, the grid displays:
- OpenMP Offload Timemetrics
- Instance Count
- GPUmetrics
- The timeline view displays ruler markers that indicate the span ofOpenMP Offload RegionsandOpenMP Offload Operationswithin those regions.
Memory Bound
- A highMemory Boundvalue might indicate that a significant portion of execution time was lost while fetching data. The section shows a fraction of cycles that were lost in stalls being served in different cache hierarchy levels (L1, L2, L3) or fetching data from DRAM. For last level cache misses that lead to DRAM, it is important to distinguish if the stalls were because of a memory bandwidth limit since they can require specific optimization techniques when compared to latency bound stalls.VTuneshows a hint about identifying this issue in the DRAM Bound metric issue description. This section also offers the percentage of accesses to a remote socket compared to a local socket to see if memory stalls can be connected with NUMA issues.Profiler
- For Intel® Xeon Phi™ processors formerly code named Knights Landing, there is no way to measure memory stalls to assess memory access efficiency in general. Therefore Back-end Bound stalls that include memory-related stalls as a high-level characterization metric are shown instead. The second level metrics are focused particularly on memory access efficiency.
- A highL2 Hit BoundorL2 Miss Boundvalue indicates that a high ratio of cycles were spent handing L2 hits or misses.
- TheL2 Miss Boundmetric does not take into account data brought into the L2 cache by the hardware prefetcher. However, in some cases the hardware prefetcher can generate significant DRAM/MCDRAM traffic and saturate the bandwidth. TheDemand MissesandHW Prefetchermetrics show the percentages of all L2 cache input requests that are caused by demand loads or the hardware prefetcher.
- A highDRAM Bandwidth BoundorMCDRAM Bandwidth Boundvalue indicates that a large percentage of the overall elapsed time was spent with high bandwidth utilization. A highDRAM Bandwidth Boundvalue is an opportunity to run the Memory Access analysis to identify data structures that can be allocated in high bandwidth memory (MCDRAM), if it is available.
- TheBandwidth Utilization Histogramshows how much time the system bandwidth was utilized by a certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High, Medium and Low. The thresholds are calculated based on benchmarks that calculate the maximum value. You can also set the threshold by moving sliders at the bottom of the histogram. The modified values are applied to all subsequent results in the project.
- Switch to theBottom-upwindow and review theMemory Boundcolumns in the grid to determine optimization opportunities.
- If your application is memory bound, consider running a Memory Access analysis for deeper metrics and the ability to correlate these metrics with memory objects.
Vectorization
Vectorization and GFLOPS metrics are supported on Intel® microarchitectures formerly code named Ivy Bridge, Broadwell, and Skylake. Limited support is available for Intel® Xeon Phi™ processors formerly code named Knights Landing. The metrics are not currently available on 4
th
Generation Intel processors. Expand the
Details
section on the analysis configuration pane to view the processor family available on your system.
- The Vectorization metric represents the percentage of packed (vectorized) floating point operations. 0% means that the code is fully scalar while 100% means the code is fully vectorized. The metric does not take into account the actual vector length used by the code for vector instructions. As a result, if the code is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization metric still shows 100%.Low vectorization means that a significant fraction of floating point operations are not vectorized. Use Intel® Advisor to understand possible reasons why the code was not vectorized.The second level metrics allow for rough estimates of the size of floating point work with particular precision and see the actual vector length of vector instructions with particular precision. Partial vector length can provide information about legacy instruction set usage and show an opportunity to recompile the code with modern instruction set, which can lead to additional performance improvement. Relevant metrics might include:
- Instruction Mix
- FP Arithmetic Instructions per Memory Read or Write
- TheTop Loops/Functions with FPU Usage by CPU Timetable shows the top functions that contain floating point operations sorted by CPU time and allows for a quick estimate of the fraction of vectorized code, the vector instruction set used in the loop/function, and the loop type.
- For Intel® Xeon Phi™ processors (formerly code named Knights Landing), the following FPU metrics are available instead of FLOP counters:
- SIMD Instructions per Cycle
- Fraction of packed SIMD instructions versus scalar SIMD Instructions per cycle
- Vector instructions for loops set based on static analysis
Intel® Omni-Path Fabric Usage
Intel® Omni-Path Fabric (Intel® OP Fabric) metrics are available for analysis of compute nodes equipped with Intel OP Fabric interconnect. They help to understand if MPI communication has bottlenecks connected with reaching interconnect hardware limits. The section shows two aspects interconnect usage: bandwidth and packet rate. Both bandwidth and packet rate split the data into outgoing and incoming data because the interconnect is bi-directional. A bottleneck can be connected with one of the directions.

- Outgoing and Incoming Bandwidth Boundmetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect bandwidth limit.
- Bandwidth Utilization Histogramshows how much time the interconnect bandwidth was utilized by a certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High, Medium, and Low.
- Outgoing and Incoming Packet Ratemetrics shows the percent of elapsed time that an application spent in communication closer to or reaching interconnect packet rate limit.
- Packet Rate Histogramshows how much time the interconnect packet rate was reached by a certain value and provides thresholds to categorize packet rate as High, Medium, and Low.
3. Analyze Source
Double-click the function you want to optimize to view its related source code file in the Source/Assembly window. You can open the code editor directly from the
Intel® VTune™
and edit your code (for example, minimizing the number of calls to the hotspot function).
Profiler
4. Analyze Process/Thread Affinity
If the results show inefficient core utilization or NUMA effects, it can be helpful to know if and how threads are pinned to processor cores.
The thread pinning or affinity can be applied by parallel runtimes (such as MPI), by using environment variables, or by using APIs from parallel runtimes or the operating system. Use the knob
Collect thread affinity
in the
VTune
GUI or
Profiler
-knob collect-affinity=true
in the command line to activate affinity collection for the HPC Performance Characterization analysis. With this option enabled it is possible to generate a thread affinity command line report that shows thread pinning to sockets, physical cores, and logical cores. Note that affinity information is collected at the end of the thread lifetime, so the resulting data may not show the whole issue for dynamic affinity that is changed during the thread lifetime.
A preview HTML report is available to see process/thread affinity along with thread CPU execution and remote accesses. Use the following command to generate the preview HTML report:
vtune -report affinity -format=html -r <result_dir>
This is a
PREVIEW FEATURE
. A preview feature may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases.

5. Explore Other Analysis Types
- Run a Memory Access analysis to view more detail about cache bound and memory bound issues affecting the performance of your application.
- Use the Intel Advisor to analyze the application for vectorization optimization.