Metrics Reference

This section provides a complete list of metrics supported by Application Performance Snapshot with their descriptions. If data for a metric is available in the statistics files, it will be displayed in the analysis summary on the command line and in the HTML report. Note that some of the metrics are platform-specifc, some are available only if the application utilizes MPI or OpenMP*.

Elapsed Time

Execution time of specified application in seconds.

SP GFLOPS

Number of single precision giga-floating point operations calculated per second. All double operations are converted to two single operations. SP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.

CPI (Cycles per Instruction Retired) Rate

The amount of time each executed instruction took measured by cycles. A CPI of 1 is considered acceptable for high performance computing (HPC) applications, but different application domains will have varied expected values. The CPI value tends to be greater when there is long-latency memory, floating-point, or SIMD operations, non-retired instructions due to branch mispredictions, or instruction starvation at the front end.

CPU Utilization

This metric helps evaluate the parallel efficiency of your application. It estimates the utilization of all the logical CPU cores in the system by your application. 100% utilization means that your application keeps all the logical CPU cores busy for the entire time that it runs. Note that the metric does not distinguish between useful application work and the time that is spent in parallel runtimes.

Serial Time

Time spent by the application outside any OpenMP region in the master thread during collection. This directly impacts application Collection Time and scaling. High values might signal a performance problem to be solved via code parallelization or algorithm tuning.

MPI Time

Time spent inside the MPI library. Values more than 15% might need additional exploration on MPI communication efficiency. This might be caused by high wait times inside the library, active communications, non-optimal settings of the MPI library. See MPI Imbalance metric to see if the application has load balancing problem.

MPI Imbalance

Mean unproductive wait time per process spent in the MPI library calls when a process is waiting for data.

OpenMP Imbalance

The metric indicates the percentage of elapsed time that your application wastes at OpenMP* synchronization barriers because of load imbalance.

Memory Stalls

This metric indicates how memory subsystem issues affect the performance. It measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. The metric value can indicate that a significant fraction of execution pipeline slots could be stalled due to demand memory load and stores. See the second level metrics to define if the application is cache- or DRAM-bound and the NUMA efficiency.

Cache Stalls

This metric indicates how often the machine was stalled on L1, L2, and L3 cache. While cache hits are serviced much more quickly than hits in DRAM, they can still incur a significant performance penalty. This metric also includes coherence penalties for shared data.

DRAM Stalls

This metric indicates how often the CPU was stalled on the main memory (DRAM) because of demand loads or stores.

DRAM Bandwidth

The metrics in this section indicate the extent of high DRAM bandwidth utilization by the system during elapsed time. They include:

  • Average Bandwidth - Average memory bandwidth used by the system during elapsed time.
  • Peak - Maximum memory bandwidth used by the system during elapsed time.
  • Bound - The portion of elapsed time during which the utilization of memory bandwidth was above a 70% threshold value of the theoretical maximum memory bandwidth for the platform.
Some applications can execute in phases that use memory bandwidth in a non-uniform manner. For example, an application that has an initialization phase may use more memory bandwidth initially. Use these metrics to identify how the application uses memory through the duration of execution.

NUMA: % of Remote Accesses

In non-uniform memory architecture (NUMA) machines, memory requests missing last level cache may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. It is recommended to keep as much frequently accessed data local as possible. This metric indicates the percentage of remote accesses, the lower the better.

FPU Utilization

This metric represents how intensively your program uses the floating point unit (FPU). A value of 100% means that the FPU is fully loaded. A fully loaded FPU will retire a vector instruction with full capacity and fused multiply-add (FMA) where it is supported by HW on every cycle of the application execution.

SP FLOPs per Cycle

Number of single precision (SP) floating point operations (FLOPs) per clocktick. This metric shows the efficiency of both vector code generation and execution. The maximum number of FLOPs per cycle depends on your hardware platform. For FMA-capable FPU the maximum number of FLOPs per cycle doubles the maximum vector capacity. All double operations are converted to two single operations.

Vector Capacity Usage

This metric represents how the application code vectorization relates to the floating point computations. A value of 100% means that all floating point instructions are vectorized with the full vector capacity.

FP Instruction Mix

% of Packed FP Instr

This metric represents the percentage of all packed floating point instructions. Also includes percentage of 128-bit, 256-bit and 512-bit instructions as separate metrics.

% of Scalar FP Instr

This metric represents the percentage of scalar floating point instructions.

FP Arith/Mem Rd Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory read instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution.

FP Arith/Mem Wr Instr. Ratio

This metric represents the ratio between arithmetic floating point instructions and memory write instructions. A value less than 0.5 might indicate unaligned data access for vector operations, which can negatively impact the performance of vector instruction execution. The metric value might indicate unaligned access to data for vector operations.

Back-end Stalls

(Available for 2nd Generation Intel® Xeon Phi™ Processors (formerly code named Knights Landing))

Superscalar processors can be conceptually divided into the "front-end", where instructions are fetched and decoded into the operations that constitute them and the "back-end", where the required computation is performed. During each cycle, the front-end generates up to two of these operations, places them into pipeline slots and moves them through the back-end. The actual number of retired pipeline slots containing useful work rarely equals this maximum. This can be because the back-end was not prepared to accept more operations of a certain kind ("Back-end bound" execution). Back-end bound execution may be due to long-latency operations or other contention for execution resources like too many operations being directed to a single execution port.

L2 Hit Bound

Percentage of CPU cycles spent on data fetches that miss the L1 but hit the L2. This metric includes coherence penalties for shared data. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider the same performance tuning as you would apply for an L2-missing workload.

L2 Miss Bound

Percentage of CPU cycles spent waiting for L2 load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the L2, or better exploit hardware prefetchers.

SIMD Instructions per Cycle

(Available for 2nd Generation Intel® Xeon Phi™ Processors (formerly code named Knights Landing))

This metric represents how intensively your program uses the FPU. Fused multiply-add (FMA) is counted as two instructions.

SIMD Instruction Mix

% of Packed SIMD Instr

This metric represents the percentage of all packed floating point instructions.

% of Scalar SIMD Instr

The metric represents the percentage of scalar SIMD instructions.

Intel® Omni-Path Fabric Interconnect Bandwidth and Packet Rate

(Available for compute nodes equipped with Intel® Omni-Path Fabric (Intel® OP Fabric) and with the Intel® VTune™ Amplifier drivers installed)

Average interconnect bandwidth and packet rate per compute node, broken down by outgoing and incoming values. High values close to the interconnect limit might lead to higher latency network communications. The interconnect metrics are available for Intel Omni-Path Fabric when the Intel VTune Amplifier driver is installed.

For more complete information about compiler optimizations, see our Optimization Notice.