Similar to the Summary window, available in GUI, the summary report provides overall performance data of your target. Intel® VTune™ Amplifier automatically generates the summary report when data collection completes. To disable this report, use the no-summary option in your command when performing a collect or collect-with action.
Use the following syntax to generate the Summary report from a preexisting result:
$ amplxe-cl -report summary -result-dir <result_path>
The summary report output depends on the collection type:
User-mode Sampling and Tracing Collection Summary Report
For User-Mode Sampling and Tracing Collection results, the summary report includes the following sections:
Collection and Platform Information
Summary per basic analysis metrics
Example 1: User-Mode Sampling Hotspots Summary
This example generates the summary report for the r000hs Hotspots analysis result on Windows*:
> amplxe-cl -report summary -r r000hs
Elapsed Time: 1.857s CPU Time: 10.069s Effective Time: 10.069s Idle: 0.000s Poor: 1.294s Ok: 6.381s Ideal: 2.395s Over: 0s Spin Time: 0s Overhead Time: 0s Total Thread Count: 9 Paused Time: 0s Top Hotspots Function Module CPU Time --------- ---------- -------- multiply1 matrix.exe 10.069s Collection and Platform Info Application Command Line: C:\temp\samples\en\C++\matrix_vtune_amp_xe\matrix\vc14\Win32\Release\matrix.exe Operating System: Microsoft Windows 10 Computer Name: my-computer Result Size: 5 MB Collection start time: 09:41:57 06/09/2018 UTC Collection stop time: 09:41:58 06/09/2018 UTC Collector Type: Event-based counting driver,User-mode sampling and tracing CPU Name: Intel(R) Processor code named Skylake Frequency: 4.008 GHz Logical CPU Count: 8
Example 2: Threading Summary
This example generates a summary report for the Threading analysis result r003tr. The summary portion of the report shows that the multithreaded target spent 64 seconds waiting, with an average concurrency of only 1.073:
$ amplxe-cl -report summary -r r003tr
Summary ------- Average Concurrency: 1.073 Elapsed Time: 13.911 CPU Time: 11.031 Wait Time: 64.468 Average CPU Usage: 0.768
To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance report.
Hardware Event-based Sampling Collection Summary Report
For Hardware Event-based Sampling Collection results, the summary report includes the following information (if available):
- Collection and Platform information
- Microarchitecture Exploration metrics
- CPU information
- GPU information
- Summary per basic analysis metrics
- Event summary
- Uncore Event summary
For some analysis types, the command-line summary report provides an issue description for metrics that exceed the predefined threshold. If you want to skip issues in the summary report, do one of the following:
Example 3: Hardware Event-Based Sampling Hotspots Summary
This example generates the summary report for the r001hs Hotspots analysis (hardware event-based sampling mode) result on Windows* OS.
> amplxe-cl -report summary -r r001hs
Elapsed Time: 3.986s CPU Time: 1.391s CPI Rate: 0.860 Wait Time: 65.023s Inactive Time: 14.819s Total Thread Count: 25 Paused Time: 0s Hardware Events Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample ----------------------------------- -------------------- --------------------------- ----------------- CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 24,832,593 8 1000030 CPU_CLK_UNHALTED.REF_TSC 3,471,208,416 120 24000000 CPU_CLK_UNHALTED.REF_XCLK 43,877,874 14 1000030 CPU_CLK_UNHALTED.THREAD 3,903,569,890 127 24000000 FP_ARITH_INST_RETIRED.SCALAR_DOUBLE 943,046,424 14 20000030 INST_RETIRED.ANY 4,536,715,682 140 24000000 UOPS_EXECUTED.THREAD 5,282,967,942 72 20000030 UOPS_RETIRED.RETIRE_SLOTS 5,587,595,565 76 20000030 Collection and Platform Info Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe C:\samples\tachyon\dat\balls.dat Operating System: Microsoft Windows 10 Computer Name: My Computer Result Size: 13 MB Collection start time: 12:12:52 24/07/2018 UTC Collection stop time: 12:13:03 24/07/2018 UTC Collector Type: Event-based sampling driver CPU Name: Intel(R) Processor code named Skylake ULT Frequency: 2.496 GHz Logical CPU Count: 4
Use the Elapsed Time metric as your performance baseline to estimate your optimizations.
Example 4: HPC Performance Characterization Summary
This command generates the summary report for the HPC Performance Characterization analysis result and skips issue descriptions:
$ amplxe-cl -report summary -r r001hpc -report-knob show-issues=false
Elapsed Time: 23.182s GFLOPS: 14.748 Effective Physical Core Utilization: 58.0% Effective Logical Core Utilization: 13.920 Out of 24 logical CPUs Serial Time: 0.069s (0.3%) Parallel Region Time: 23.113s (99.7%) Estimated Ideal Time: 14.010s (60.4%) OpenMP Potential Gain: 9.103s (39.3%) Memory Bound: 0.446 Cache Bound: 0.175 DRAM Bound: 0.216 NUMA: % of Remote Accesses: 38.3% FPU Utilization: 2.7% GFLOPS: 14.748 Scalar GFLOPS: 4.801 Packed GFLOPS: 9.947 Collection and Platform Info Application Command Line: ./sp.B.x User Name: vtune Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7. 2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2" Computer Name: nntvtune235 Result Size: 1 GB Collection start time: 19:04:30 13/06/2017 UTC Collection stop time: 19:04:53 13/06/2017 UTC Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown Frequency: 2.694 GHz Logical CPU Count: 24 CPU Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown Frequency: 2.694 GHz Logical CPU Count: 24
Example 5: Memory Access Summary
This command generates the summary report for the Memory Access analysis result collected on Windows and shows issue descriptions:
$ amplxe-cl -report summary -r r001macc
Elapsed Time: 7.917s CPU Time: 6.473s Memory Bound: 21.9% of Pipeline Slots | The metric value is high. This may indicate that a significant fraction | of execution pipeline slots could be stalled due to demand memory load | and stores. Explore the metric breakdown by memory hierarchy, memory | bandwidth information, and correlation by memory objects. | L1 Bound: 8.0% of Clockticks | This metric shows how often machine was stalled without missing the | L1 data cache. The L1 cache typically has the shortest latency. | However, in certain cases like loads blocked on older stores, a load | might suffer a high latency even though it is being satisfied by the | L1. | L2 Bound: 3.0% of Clockticks L3 Bound: 5.0% of Clockticks | This metric shows how often CPU was stalled on L3 cache, or contended | with a sibling Core. Avoiding cache misses (L2 misses/L3 hits) | improves the latency and increases performance. | DRAM Bound: 4.1% of Clockticks DRAM Bandwidth Bound: 0.4% of Elapsed Time Memory Latency: 0.000 Loads: 10,137,704,122 Stores: 3,208,896,264 LLC Miss Count: 1,750,105 Average Latency (cycles): 11 Total Thread Count: 21 Paused Time: 0s System Bandwidth Max DRAM System Bandwidth: 15 GB Bandwidth Utilization Bandwidth Domain Platform Maximum Observed Maximum Average Bandwidth % of Elapsed Time with High BW Utilization(%) ---------------- ---------------- ---------------- ----------------- --------------------------------------------- DRAM, GB/sec 15 11.300 2.836 0.4% Collection and Platform Info Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat" Operating System: Microsoft Windows 10 Computer Name: My Computer Result Size: 31 MB Collection start time: 09:33:44 07/06/2017 UTC Collection stop time: 09:33:52 07/06/2017 UTC CPU Name: Intel(R) Processor code named Skylake ULT Frequency: 2.496 GHz Logical CPU Count: 4
The Bandwidth Utilization section in the summary report shows the following metrics:
Platform Maximum: Expected maximum bandwidth for the system. This value can be automatically estimated using micro-benchmark at the start of analysis or hard-coded based on theoretical bandwidth limits.
Observed Maximum: Maximum bandwidth observed during the analysis. If the value is close to the Platform Maximum, your workload is probably bandwidth-limited.
Average Bandwidth: Average bandwidth utilization during the analysis.
% of Elapsed Time with High BW Utilization: Percentage of Elapsed time spent heavily utilizing system bandwidth.
This information is provided for all kinds of bandwidth domains you have in the result (DRAM, MCDRAM, QPI, and so on).