Summary Report

Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Amplifier automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.

Use the following syntax to generate the Summary report from a preexisting result:

$ amplxe-cl -report summary -result-dir <result_path>

The summary report output depends on the collection type:

User-mode Sampling and Tracing Collection Summary Report

For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:

  • Collection and Platform Information

  • CPU Information

  • Summary per basic analysis metrics

Example 1: User-Mode Sampling Hotspots Summary

This example generates the summary report for the r000hs Hotspots analysis result on Windows*:

> amplxe-cl -report summary -r r000hs   
      
Elapsed Time: 1.857s
CPU Time: 10.069s
    Effective Time: 10.069s
    Idle: 0.000s
    Poor: 1.294s
    Ok: 6.381s
    Ideal: 2.395s
    Over: 0s
Spin Time: 0s
Overhead Time: 0s
Total Thread Count: 9
Paused Time: 0s

Top Hotspots
Function   Module      CPU Time
---------  ----------  --------
multiply1  matrix.exe   10.069s

Collection and Platform Info
    Application Command Line: C:\temp\samples\en\C++\matrix_vtune_amp_xe\matrix\vc14\Win32\Release\matrix.exe 
    Operating System: Microsoft Windows 10
    Computer Name: my-computer
    Result Size: 5 MB 
    Collection start time: 09:41:57 06/09/2018 UTC
    Collection stop time: 09:41:58 06/09/2018 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Processor code named Skylake
        Frequency: 4.008 GHz
        Logical CPU Count: 8

Example 2: Threading Summary

This example generates a summary report for the Threading analysis result r003tr. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:

$ amplxe-cl -report summary -r r003tr
Summary
-------
Average Concurrency:  1.073
Elapsed Time:         13.911
CPU Time:             11.031
Wait Time:            64.468
Average CPU Usage:    0.768

To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.

Hardware Event-based Sampling Collection Summary Report

For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):

  • Collection and Platform information
  • Microarchitecture Exploration metrics
  • CPU information
  • GPU information
  • Summary per basic analysis metrics
  • Event summary
  • Uncore Event summary

For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:

  • Use the -report-knob show-issues=false option when generating the report, for example: $ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false

  • Use the -format=csv option to view the report in the CSV format, for example: $ amplxe-cl -report summary -r r001hpc -format=csv

Example 3: Hardware Event-Based Sampling Hotspots Summary

This example generates the summary report for the r001hs Hotspots analysis (hardware event-based sampling mode) result on Windows* OS.

> amplxe-cl -report summary -r r001hs
Elapsed Time: 3.986s
    CPU Time: 1.391s
    CPI Rate: 0.860
    Wait Time: 65.023s
    Inactive Time: 14.819s
    Total Thread Count: 25
    Paused Time: 0s

Hardware Events
Hardware Event Type                  Hardware Event Count  Hardware Event Sample Count  Events Per Sample
-----------------------------------  --------------------  ---------------------------  -----------------
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE             24,832,593                            8  1000030          
CPU_CLK_UNHALTED.REF_TSC                    3,471,208,416                          120  24000000         
CPU_CLK_UNHALTED.REF_XCLK                      43,877,874                           14  1000030          
CPU_CLK_UNHALTED.THREAD                     3,903,569,890                          127  24000000         
FP_ARITH_INST_RETIRED.SCALAR_DOUBLE           943,046,424                           14  20000030         
INST_RETIRED.ANY                            4,536,715,682                          140  24000000         
UOPS_EXECUTED.THREAD                        5,282,967,942                           72  20000030         
UOPS_RETIRED.RETIRE_SLOTS                   5,587,595,565                           76  20000030         
Collection and Platform Info
    Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe C:\samples\tachyon\dat\balls.dat
    Operating System: Microsoft Windows 10
    Computer Name: My Computer
    Result Size: 13 MB 
    Collection start time: 12:12:52 24/07/2018 UTC
    Collection stop time: 12:13:03 24/07/2018 UTC
    Collector Type: Event-based sampling driver
    CPU
        Name: Intel(R) Processor code named Skylake ULT
        Frequency: 2.496 GHz
        Logical CPU Count: 4

Use the Elapsed Time metric as your performance baseline to estimate your optimizations.

Example 4: HPC Performance Characterization Summary

This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:

$ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Elapsed Time: 23.182s
GFLOPS: 14.748
Effective Physical Core Utilization: 58.0%
    Effective Logical Core Utilization: 13.920 Out of 24 logical CPUs
    Serial Time: 0.069s (0.3%)
    Parallel Region Time: 23.113s (99.7%)
        Estimated Ideal Time: 14.010s (60.4%)
        OpenMP Potential Gain: 9.103s (39.3%)
Memory Bound: 0.446
    Cache Bound: 0.175
    DRAM Bound: 0.216
    NUMA: % of Remote Accesses: 38.3%
FPU Utilization: 2.7%
    GFLOPS: 14.748
        Scalar GFLOPS: 4.801
        Packed GFLOPS: 9.947
Collection and Platform Info
    Application Command Line: ./sp.B.x
    User Name: vtune
    Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P
RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w
ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/"  REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.
2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2"
    Computer Name: nntvtune235
    Result Size: 1 GB
    Collection start time: 19:04:30 13/06/2017 UTC
    Collection stop time: 19:04:53 13/06/2017 UTC
    Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
    Frequency: 2.694 GHz
    Logical CPU Count: 24
    CPU
        Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
        Frequency: 2.694 GHz
        Logical CPU Count: 24

Example 5: Memory Access Summary

This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:

$ amplxe-cl -report summary -r r001macc
Elapsed Time: 7.917s
    CPU Time: 6.473s
    Memory Bound: 21.9% of Pipeline Slots
     | The metric value is high. This may indicate that a significant fraction
     | of execution pipeline slots could be stalled due to demand memory load
     | and stores. Explore the metric breakdown by memory hierarchy, memory
     | bandwidth information, and correlation by memory objects.
     |
        L1 Bound: 8.0% of Clockticks
         | This metric shows how often machine was stalled without missing the
         | L1 data cache. The L1 cache typically has the shortest latency.
         | However, in certain cases like loads blocked on older stores, a load
         | might suffer a high latency even though it is being satisfied by the
         | L1.
         |
        L2 Bound: 3.0% of Clockticks
        L3 Bound: 5.0% of Clockticks
         | This metric shows how often CPU was stalled on L3 cache, or contended
         | with a sibling Core. Avoiding cache misses (L2 misses/L3 hits)
         | improves the latency and increases performance.
         |
        DRAM Bound: 4.1% of Clockticks
            DRAM Bandwidth Bound: 0.4% of Elapsed Time
            Memory Latency: 0.000
    Loads: 10,137,704,122
    Stores: 3,208,896,264
    LLC Miss Count: 1,750,105
    Average Latency (cycles): 11
    Total Thread Count: 21
    Paused Time: 0s
System Bandwidth
    Max DRAM System Bandwidth: 15 GB 

Bandwidth Utilization
Bandwidth Domain  Platform Maximum  Observed Maximum  Average Bandwidth  % of Elapsed Time with High BW Utilization(%)
----------------  ----------------  ----------------  -----------------  ---------------------------------------------
DRAM, GB/sec      15                          11.300              2.836                                           0.4%
Collection and Platform Info
    Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat" 
    Operating System: Microsoft Windows 10
    Computer Name: My Computer
    Result Size: 31 MB 
    Collection start time: 09:33:44 07/06/2017 UTC
    Collection stop time: 09:33:52 07/06/2017 UTC
    CPU
        Name: Intel(R) Processor code named Skylake ULT
        Frequency: 2.496 GHz
        Logical CPU Count: 4

The Bandwidth Utilization section in the summary report shows the following metrics:

  • Platform Maximum: Expected maximum bandwidth for the system. This value can be automatically estimated using micro-benchmark at the start of analysis or hard-coded based on theoretical bandwidth limits.

  • Observed Maximum: Maximum bandwidth observed during the analysis. If the value is close to the Platform Maximum, your workload is probably bandwidth-limited.

  • Average Bandwidth: Average bandwidth utilization during the analysis.

  • % of Elapsed Time with High BW Utilization: Percentage of Elapsed time spent heavily utilizing system bandwidth.

This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).

See Also

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)