Use Application Performance Snapshot for a quick view into a shared memory application's use of available hardware (CPU, FPU, and Memory). Application Performance Snapshot analyzes your application's time spent in OpenMP imbalance, memory access efficiency, and FPU usage. After analysis, it displays basic performance enhancement opportunities for systems using Intel® platforms. Use this tool as a first step in application performance analysis to get a simple snapshot of key optimization areas and learn about profiling tools that specialize in particular aspects of application performance.
Application Performance Snapshot is available as a free product download from the Intel® Developer Zone at https://software.intel.com/performance-snapshot and is also available pre-installed as part of Intel® Parallel Studio or Intel® VTune™ Amplifier.
Before running the tool, you need to set up your environment:
Open a command prompt.
Set the appropriate environment variables to run the tool.
Pre-installed with Intel Parallel Studio: Run <install-dir>\amplxe-vars.bat, where <install-dir> is the location where Intel® VTune™ Amplifier is installed.
"C:\Program Files (x86)\IntelSWTools\VTune Amplifier 2018\amplxe-vars.bat"
- Downloaded from the Intel Developer Zone: Add the path for the directory to which you extracted the tool to the command line session environment: set PATH=%PATH%;<install-dir>.
Analyzing Shared Memory Applications
Run the following command:
aps.bat <my app> <app parameters>
where <my app> is the location of your application and <app parameters> are your application parameters.
Application Performance Snapshot launches the application and runs the data collection.
If it is the first time you are running the tool, it installs the appropriate drivers prior to beginning data collection.
Use the -u option to uninstall the driver. If you use both Application Performance Snapshot and Intel VTune Amplifier, uninstalling the driver can impact VTune Amplifier data collection.
After the analysis completes, a report appears in the command window. You can also open a HTML report with the same information in a supported browser. The path to the HTML report is included in the command window.
Analyze the data shown in the report. See the metric descriptions below or hover over a metric in the HTML report for more information.
Determine appropriate next steps based on result analysis. Common next steps may include application tuning or using another performance analysis tool for more detailed information, such as Intel VTune Amplifier or Intel Advisor.
Quick Metrics Reference
The following metrics are collected with Application Performance Snapshot. Additional detail about each of these metrics is available in the Intel VTune Amplifier online help.
Elapsed Time: Execution time of specified application in seconds.
SP GFLOPS: Number of single precision giga-floating point operations calculated per second. All double operations are converted to two single operations. SP GFLOPS metrics are only available for 3rd Generation Intel® Core™ processors, 5th Generation Intel processors, and 6th Generation Intel processors.
Cycles per Instruction Retired (CPI): The amount of time each executed instruction took measured by cycles. A CPI of 1 is considered acceptable for high performance computing (HPC) applications, but different application domains will have varied expected values. The CPI value tends to be greater when there is long-latency memory, floating-point, or SIMD operations, non-retired instructions due to branch mispredictions, or instruction starvation at the front end.
CPU Utilization: Estimate of the utilization of all logical CPU cores on the system by your application. Use this metric to help evaluate the parallel efficiency of your application. A utilization of 100% means that your application keeps all of the logical CPU cores busy for the entire time that it runs. Note that the metric does not distinguish between useful application work and the time that is spent in parallel runtimes.
Memory Bound: The percentage of potential processor execution pipeline slots lost while the application was fetching data. Stalls while fetching data are usually caused by load instructions causing execution to stall until the load is completed. In less common cases, a stall can be caused when incomplete stores imply back-pressure on the pipeline, which causes it to stall. Any value over 20% requires additional investigation.
FPU Utilization: The effective FPU usage while the application was running. Use the FPU Utilization value to evaluate the vector efficiency of your application. The value is calculated by estimating the percentage of operations that are performed by the FPU. A value of 100% means that the FPU is fully loaded. Any value over 50% requires additional analysis. FPU metrics are only available for 3rd Generation Intel Core processors, 5th Generation Intel processors, and 6th Generation Intel processors.
Documentation and Resources
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2016-2017 Intel Corporation
This software and the related documents are Intel copyrighted materials, and your use of them is governed by the express license under which they were provided to you (License). Unless the License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or transmit this software or the related documents without Intel's prior written permission.
This software and the related documents are provided as is, with no express or implied warranties, other than those that are expressly stated in the License.