Intel® VTune™ Amplifier

HPC Performance Characterization View

Use the HPC Performance Characterization viewpoint to estimate CPU usage, memory efficiency, and floating-point utilization for compute-intensive or throughput applications. Compute-intensive or throughput applications should use hardware resources efficiently for the duration of their elapsed time. Use the HPC Performance Characterization analysis as a starting point for optimizing application performance and runtime.

To interpret the performance data provided in the HPC Performance Characterization viewpoint, you may follow the steps below:

  1. Define a Performance Baseline

  2. Determine Optimization Opportunities

  3. Analyze Source

  4. Analyze Process/Thread Affinity
  5. Explore Other Analysis Types

Tip

Use the Analyzing an OpenMP* and MPI Application tutorial to review basic steps for tuning a hybrid application. The tutorial is available from the Intel Developer Zone at https://software.intel.com/en-us/itac-vtune-mpi-openmp-tutorial-lin. A webinar explaining the HPC Performance Characterization analysis is also available at https://software.intel.com/en-us/videos/hpc-applications-need-high-performance-analysis.

1. Define a Performance Baseline

Start with exploring the Summary window that provides general information on your application execution. Key areas for optimization include the elapsed time and floating-point operation per second counts (single precision, double precision, and legacy x87). Red text indicates an area of potential optimization. Hover over a flag to learn more about how to improve your code.

Use the Elapsed Time and GFLOPS values as a baseline for comparison of versions before and after optimization.

2. Determine Optimization Opportunities

Review the Summary window to find the key optimization opportunities for your application. Performance metrics that can be improved are marked in red. Issues identified could include Effective Physical Core Utilization, Memory Bound, Vectorization, or a combination of these. The following sections provide suggested next steps for each performance aspect:

CPU Utilization

Memory Bound

Vectorization

Note

Vectorization and GFLOPS metrics are supported on Intel® microarchitectures formerly code named Ivy Bridge, Broadwell, and Skylake. Limited support is available for Intel® Xeon Phi™ processors formerly code named Knights Landing. The metrics are not currently available on 4th Generation Intel processors. Expand the Details section on the analysis configuration pane to view the processor family available on your system.

Intel® Omni-Path Fabric Usage

Intel® Omni-Path Fabric (Intel® OP Fabric) metrics are available for analysis of compute nodes equipped with Intel OP Fabric interconnect. They help to understand if MPI communication has bottlenecks connected with reaching interconnect hardware limits. The section shows two aspects interconnect usage: bandwidth and packet rate. Both bandwidth and packet rate split the data into outgoing and incoming data because the interconnect is bi-directional. A bottleneck can be connected with one of the directions.

3. Analyze Source

Double-click the function you want to optimize to view its related source code file in the Source/Assembly window. You can open the code editor directly from the Intel® VTune™ Amplifier and edit your code (for example, minimizing the number of calls to the hotspot function).

4. Analyze Process/Thread Affinity

If the results show inefficient core utilization or NUMA effects, it can be helpful to know if and how threads are pinned to processor cores.

The thread pinning or affinity can be applied by parallel runtimes (such as MPI), by using environment variables, or by using APIs from parallel runtimes or the operating system. Use the knob Collect thread affinity in the VTune Amplifier GUI or -knob collect-affinity=true in the command line to activate affinity collection for the HPC Performance Characterization analysis. With this option enabled it is possible to generate a thread affinity command line report that shows thread pinning to sockets, physical cores, and logical cores. Note that affinity information is collected at the end of the thread lifetime, so the resulting data may not show the whole issue for dynamic affinity that is changed during the thread lifetime.

A preview HTML report is available to see process/thread affinity along with thread CPU execution and remote accesses. Use the following command to generate the preview HTML report:

amplxe-cl -report affinity -format=html -r <result_dir>

Note

This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases. Please send your feedback to parallel.studio.support@intel.com or to intelsystemstudio@intel.com.

5. Explore Other Analysis Types

See Also