Intel® Advisor 2021.3
- :Offload Modeling
- GPU-to-GPU performance modeling (feature preview)TheOffload Modelingperspective introduces a newGPU-to-GPU performance model. With this model, you can analyze your Data Parallel C++ (DPC++), OpenMP* target, or OpenCL™ application running on a graphics processing unit (GPU) and model its performance on a different GPU platform. Use this workflow to understand how you can improve your application performance and check if you can get a higher speedup if you offload the application to a different GPU platform.The GPU-to-GPU performance modeling is based on the following:
- Compute throughput model estimates time by compute throughput based on GPU kernel instruction mix with respect to GPU compute throughput capabilities and workload decomposition.
- Memory throughput model estimates cache and memory traffic based on a target GPU configuration. Based on this data, the model also estimates time by cache/memory bandwidth.
- Memory latency model estimates latency of read memory instructions based on the number of the instructions in the kernel.
- Atomic throughput model estimates time by atomic throughput based on hardware counter of atomic accesses on the baseline device.
- Data transfer model estimated offload overhead for transferring data between host and GPU devices.
- New recommendations for effectively offloading your code from CPU to GPUTheOffload Modelingperspective introduces recommendations for offloading code regions to a GPU, performance bottleneck analytics, and actionable recommendations to resolve them when you offload your code from a CPU to a GPU.The recommendations are reported in a newRecommendationspane in theAccelerated Regionsreport and include the following:
- Recommendations foroffloading code regionswith the modeled performance summary
- Recommendation forDPC++/OpenMP reduction pattern optimizationfor code regions recommended for offloading
- Recommendation foralgorithmic constraints optimizationfor code regions recommended for offloading
- Recommendations for code regionsnot recommended for offloadingwith reasons why the region is not expected to have high speedup, suggesting you refactoring the code
- GPU Roofline:
- Expandable per-kernel instances for the GPU RooflineTheGPU Roofline Insightsperspective introduces a new kernel visualization feature that breaks down a kernel into instances grouped by workload parameters (global and local sizes).If the kernel was executed with different workloads or work groups, you can compare performance characteristics for different executions.The feature is shown in the following panes of the GPU Roofline report:
- In theGPUgrid, you can expand a source kernel to see its sub-rows. Each sub-row is group of kernel instances executed with the same kernel properties.
- In theGPU Rooflinechart, if the kernel was executed with different properties, it has a+plus icon near the kernel dot. The parent dot corresponds to the source compute task. Click the plus icon to expand it and see kernel instances that visualize performance dependency on workload parameters.
- Select the source compute task or the instance task from the grid or from the chart to see the details metrics in theDetailspane.
- Potential integer operations (INTOP) extended with logical operationsWhen measuring the number of integer operations for the GPU Roofline,Intel Advisorcounts logical operations, such as AND, OR, XOR, as potential integer operations. It reflects the actual performance of the profiled application on the GPU Roofline chart showing all the hotspots with logical operations closer to a performance boundary.For more information about operations counted for the GPU Roofline, see Examine Bottlenecks on GPU Roofline Chart.
- New GPU Roofline interpretation hints in the kernel detailsIntel Advisorprovides hints for memory-bound codes to increase application performance and remove memory subsystems bottlenecks.See Examine Kernel Details for details.
- Memory metric grid improvements for the GPU RooflineIn the GPU Roofline report, memory columns of theGPUgrid now provide better and clearer view of memory metrics:
See Accelerator Metrics for details.
- Memory metrics are grouped by memory subsystem.
- A new CARM column group includes CARM traffic and L3 cache line utilization metrics.
Intel® Advisor 2021.2
- New Source view for theOffload ModelingandGPU Roofline InsightsperspectivesTheOffload Modelingand GPU Roofline Insights reports now include a full-screenSourceview with syntax highlighting in a separate tab. Use it to explore application source code and related metrics.For theGPU Roofline Insightsperspective, theSourceview also includes theAssemblerview, which you can view side-by-side with the source.To switch to theSourceview, double-click a kernel from the main report.
- New Details pane with in-depth GPU kernel analytics for theGPU Roofline InsightsperspectiveThe GPU Roofline Regions report now includes a newDetailspane, which provides in-depth kernel execution metrics for a single kernel, such as execution time on GPU, work size and SIMD width, a single-kernel Roofline highlighting the distance to the nearest roof (performance limit), floating-point and integer operation summary, memory and cache bandwidth, EU occupancy, and instruction mix summary.
- Offload Modeling:
- Data transfers estimations with data reuse on GPUTheOffload Modelingperspective introduces a new data reuse analysis, which provides more accurate estimations of data transfer costs.Data reuse analysis detects groups of regions that can reuse the same memory objects on GPU. It also shows which kernels can benefit from data reuse and how it impacts application performance. This can decrease the data transfer tax because when two or more kernels use the same memory object, it needs to be transferred only once.
- Command line use cases for each Intel Advisor perspectiveSeveral new topics explain how to run eachIntel Advisorperspective from command line. Use these topics to understand what steps you should run for each perspective, recommended options to consider at each step, and different ways available to view the results. See the following topics:
- Guidance on how to check if you need to run the Dependencies analysis for the Offload Modeling perspectiveInformation about loop-carried dependencies might be very important to decide if a loop can be profitable to run on a GPU. Intel Advisor can use different resources to get this information, including the Dependencies analysis. The analysis adds a high overhead to your application and is optional for the Offload Modeling workflow. A new topic shows a recommended strategy that you can use to Check How Assumed Dependencies Affect Modeling and decide if you need to run the Dependencies analysis.
Intel® Advisor 2021.1
- Data Parallel C++ (DPC++):
- Implemented support for Data Parallel C++ (DPC++) code performance profiling on CPU and GPU targets.
- Implemented support for oneAPI Level Zero specification for DPC++ applications.
- Introduced a new and improvedIntel Advisoruser interface (UI) that includes:
To switch back to the old UI, set theADVISOR_EXPERIMANTAL=advixe_gui.
- New look-and-feel for multiple tabs and panes, for example,Workflowpane andToolbars
- Offload ModelingandGPU Rooflineworkflows integrated in GUI
- New notion ofperspective, which is a complete analysis workflow that you can customize to manage accuracy and overhead trade-off. Each perspective collects performance data, but processes and presents it differently so that you could look at it from different points of view depending on your goal.Intel AdvisorincludesOffload Modeling,GPU Roofline Insights,Vectorization and Code Insights,CPU / Memory Roofline Insights, andThreadingperspectives.
- Renamed executables and environment scripts:
See the Command Line Interface for details and sample command lines.The previous command line interface and executables are supported for backward compatibility.
- advixe-clis renamed toadvisor.
- advixe-guiis renamed toadvisor-gui.
- advixe-pythonis renamed toadvisor-python.
- advixe-vars.[c]shandadvixe-vars.batare renamed toadvisor-vars.[c]shandadvisor-vars.batrespectively.
- Offload Modeling:
- Introduced the Offload Modeling perspective (previously known as Offload Advisor) that you can use to prepare your code for efficient GPU offload even before you have a hardware. Identify parts of code can be efficiently offloaded to a target device, estimate potential speedup, and locate bottlenecks.
- Introduced data transfer analysis as an addition to the Offload Modeling perspective. The analysis reportsdata transfer costsestimated for offloading to a target device, estimatedamount of memoryyour application uses per memory level, andhintsfor data transfer optimizations.
- Introduced strategies to manage kernel invocation taxes (or kernel launch taxes) when modeling performance: do not hide invocation taxes, hide all invocation taxes except the first one, hide a part of invocation taxes. For more information, see Manage Invocation Taxes.
- Added support for modeling application performance for the Intel® Iris® Xe MAX graphics.
- Introduced Memory-Level Roofline feature (previously known as Integrated Roofline, tech preview feature). Memory-Level Roofline collects metrics for all memory levels and allows you to identify memory bottlenecks at different cache levels (L1, L2, L3 or DRAM).
- Added a limiting memory level roof to the Roofline guidance and recommendations, which improves recommendation accuracy.
- Added a single-kernel Roofline guidance for all memory levels with dots for multiple levels of a memory subsystem and limiting roof highlighting to theCode Analyticspane.
- Introduced a GPU Roofline Insights perspective. GPU Roofline visualizes actual performance of GPU kernels against hardware-imposed performance limitations. Use it to identify the main limiting factor of your application performance and get recommendations for effective memory vs. compute optimization. GPU Roofline report supports float and integer data types and reports metrics for all memory levels.
- Added support for profiling GPU workloads that run on the Intel® Iris® XeMAX graphics and building GPU Roofline for them.
- Flow Graph Analyzer:
- Added rules to the Static Rule-check engine to determine issues with unnecessary copies during the creation of buffers, host pointer accessor usage in a loop, multiple build/compilations for the same kernel when invoked multiple times.
- Introduced a PDF version of theIntel AdvisorUser Guide. ClickDownload as PDFat the top of this page to use the PDF version.
- Introduced a new user guide structure that focuses on the new UI and reflects the usage flow to improve usability.