Intel® VTune™ Amplifier

Summary Report

Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Amplifier automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.

Use the following syntax to generate the Summary report from a preexisting result:

$ amplxe-cl -report summary -result-dir <result_path>

The summary report output depends on the collection type:

User-mode Sampling and Tracing Collection Summary Report

For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:

Example 1: User-Mode Sampling Hotspots Summary

This example generates the summary report for the r000hs Hotspots analysis result on Windows*:

> amplxe-cl -report summary -r r000hs   
      
Elapsed Time: 1.857s
CPU Time: 10.069s
    Effective Time: 10.069s
    Idle: 0.000s
    Poor: 1.294s
    Ok: 6.381s
    Ideal: 2.395s
    Over: 0s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 9
Paused Time: 0s

Top Hotspots
Function   Module      CPU Time
---------  ----------  --------
multiply1  matrix.exe   10.069s

Collection and Platform Info
    Application Command Line: C:\temp\samples\en\C++\matrix_vtune_amp_xe\matrix\vc14\Win32\Release\matrix.exe 
    Operating System: Microsoft Windows 10
    Computer Name: my-computer
    Result Size: 5 MB 
    Collection start time: 09:41:57 06/09/2018 UTC
    Collection stop time: 09:41:58 06/09/2018 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel® Processor code named Skylake
        Frequency: 4.008 GHz
        Logical CPU Count: 8

Example 2: Threading Summary

This example generates a summary report for the Threading analysis result r003tr. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:

$ amplxe-cl -report summary -r r003tr
Summary
-------
Average Concurrency:  1.073
Elapsed Time:         13.911
CPU Time:             11.031
Wait Time:            64.468
Average CPU Usage:    0.768

To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.

Hardware Event-based Sampling Collection Summary Report

For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):

For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:

Example 3: Hardware Event-Based Sampling Hotspots Summary

This example generates the summary report for the r001hs Hotspots analysis (hardware event-based sampling mode) result on Windows* OS.

> amplxe-cl -report summary -r r001hs
Elapsed Time: 3.986s
    CPU Time: 1.391s
    CPI Rate: 0.860
    Wait Time: 65.023s
    Inactive Time: 14.819s
    Total Thread Count: 25
    Paused Time: 0s

Hardware Events
Hardware Event Type                  Hardware Event Count  Hardware Event Sample Count  Events Per Sample
-----------------------------------  --------------------  ---------------------------  -----------------
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE             24,832,593                            8  1000030          
CPU_CLK_UNHALTED.REF_TSC                    3,471,208,416                          120  24000000         
CPU_CLK_UNHALTED.REF_XCLK                      43,877,874                           14  1000030          
CPU_CLK_UNHALTED.THREAD                     3,903,569,890                          127  24000000         
FP_ARITH_INST_RETIRED.SCALAR_DOUBLE           943,046,424                           14  20000030         
INST_RETIRED.ANY                            4,536,715,682                          140  24000000         
UOPS_EXECUTED.THREAD                        5,282,967,942                           72  20000030         
UOPS_RETIRED.RETIRE_SLOTS                   5,587,595,565                           76  20000030         
Collection and Platform Info
    Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe C:\samples\tachyon\dat\balls.dat
    Operating System: Microsoft Windows 10
    Computer Name: My Computer
    Result Size: 13 MB 
    Collection start time: 12:12:52 24/07/2018 UTC
    Collection stop time: 12:13:03 24/07/2018 UTC
    Collector Type: Event-based sampling driver
    CPU
        Name: Intel® Processor code named Skylake ULT
        Frequency: 2.496 GHz
        Logical CPU Count: 4

Use the Elapsed Time metric as your performance baseline to estimate your optimizations.

Example 4: HPC Performance Characterization Summary

This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:

$ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Elapsed Time: 23.182s
GFLOPS: 14.748
Effective Physical Core Utilization: 58.0%
    Effective Logical Core Utilization: 13.920 Out of 24 logical CPUs
    Serial Time: 0.069s (0.3%)
    Parallel Region Time: 23.113s (99.7%)
        Estimated Ideal Time: 14.010s (60.4%)
        OpenMP Potential Gain: 9.103s (39.3%)
Memory Bound: 0.446
    Cache Bound: 0.175
    DRAM Bound: 0.216
    NUMA: % of Remote Accesses: 38.3%
FPU Utilization: 2.7%
    GFLOPS: 14.748
        Scalar GFLOPS: 4.801
        Packed GFLOPS: 9.947
Collection and Platform Info
    Application Command Line: ./sp.B.x
    User Name: vtune
    Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P
RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w
ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/"  REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.
2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2"
    Computer Name: nntvtune235
    Result Size: 1 GB
    Collection start time: 19:04:30 13/06/2017 UTC
    Collection stop time: 19:04:53 13/06/2017 UTC
    Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
    Frequency: 2.694 GHz
    Logical CPU Count: 24
    CPU
        Name: Intel® Xeon® E5/E7 v2 Processor code named Ivytown
        Frequency: 2.694 GHz
        Logical CPU Count: 24

Example 5: Memory Access Summary

This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:

$ amplxe-cl -report summary -r r001macc

Elapsed Time: 7.917s
    CPU Time: 6.473s
    Memory Bound: 21.9% of Pipeline Slots
     | The metric value is high. This may indicate that a significant fraction
     | of execution pipeline slots could be stalled due to demand memory load
     | and stores. Explore the metric breakdown by memory hierarchy, memory
     | bandwidth information, and correlation by memory objects.
     |
        L1 Bound: 8.0% of Clockticks
         | This metric shows how often machine was stalled without missing the
         | L1 data cache. The L1 cache typically has the shortest latency.
         | However, in certain cases like loads blocked on older stores, a load
         | might suffer a high latency even though it is being satisfied by the
         | L1.
         |
        L2 Bound: 3.0% of Clockticks
        L3 Bound: 5.0% of Clockticks
         | This metric shows how often CPU was stalled on L3 cache, or contended
         | with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
         | improves the latency and increases performance.
         |
        DRAM Bound: 4.1% of Clockticks
            DRAM Bandwidth Bound: 0.4% of Elapsed Time
            Memory Latency: 0.000
    Loads: 10,137,704,122
    Stores: 3,208,896,264
    LLC Miss Count: 1,750,105
    Average Latency (cycles): 11
    Total Thread Count: 21
    Paused Time: 0s
System Bandwidth
    Max DRAM System Bandwidth: 15 GB 

Bandwidth Utilization
Bandwidth Domain  Platform Maximum  Observed Maximum  Average Bandwidth  % of Elapsed Time with High BW Utilization(%)
----------------  ----------------  ----------------  -----------------  ---------------------------------------------
DRAM, GB/sec      15                          11.300              2.836                                           0.4%
Collection and Platform Info
    Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat" 
    Operating System: Microsoft Windows 10
    Computer Name: My Computer
    Result Size: 31 MB 
    Collection start time: 09:33:44 07/06/2017 UTC
    Collection stop time: 09:33:52 07/06/2017 UTC
    CPU
        Name: Intel® Processor code named Skylake ULT
        Frequency: 2.496 GHz
        Logical CPU Count: 4

The Bandwidth Utilization section in the summary report shows the following metrics:

This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).

See Also