Explore Performance Gain from GPU-to-GPU Modeling (Preview)
- In theProgram Metricspane, compare theTime on Baseline GPUandTime on Target GPUand examine theAverage Speedupto understand if GPU kernels in your application have a better performance on the target GPU.Time on Baseline GPUincludesonlyexecution time of GPU kernels and ignores the CPU parts of your application.Time on Target GPUincludes estimated execution time for GPU kernels on the target and offload taxes.In the right-side pie chart, review the time on target GPU components and see where the GPU kernels spend most of the time: executing on the target GPU (Estimated Time on GPU), transferring data between the host device and the target GPU (Data Transfer Tax), or scheduling kernels on the target GPU (Kernel Launch Tax).
- In theOffloads Bounded bypane, examine what the GPU kernels are potentially bounded by on the target GPU. The parameters with the highest percentage mean that this is where the GPU kernels spend the most time. Review the detailed metrics for these parameters in other tabs to understand if you need to optimize your application for this.
- In theTarget Device Configurationpane (in the top-right), examine the target GPU parameters that were used to model the GPU kernel performance. You can also use the sliders to adjust the parameters and create your custom configuration.To model performance for the custom configuration, save the configuration settings from the report and rerun the performance modeling step with theanalyze.pyfor the new configuration file. For details about using the custom configurations, go to theConfigurationtab and review the comments.
- In theTop offloadedpane, review the top five GPU kernels with the highest absolute offload gain (in seconds) estimated on the target GPU. The gain is calculated as(Time measured on the baseline GPU - Time estimated on the target GPU). This pane showsallGPU kernels in your application and might also show kernels with the estimated speedup less than 1.For each kernel in the pane, you can review the speedup, time on the baseline and the target GPUs, main bounded-by parameters, and estimated amount of data transferred.
- In themetrics table, examine the detailed performance metrics for the GPU kernels. TheMeasuredcolumn group shows metricsmeasuredon the baseline GPU. Other column groups show metricsestimatedfor the target GPU. You can expand column groups to see more metrics.For example, to find a potential bottleneck, you can examine theOffload Informationcolumn group focusing on theBounded byandTotal Execution Time bymetrics. For details about the bounding factor, scroll right to the column group corresponding to the value reported in theBounded bycolumn, for example,L3 Cache,DRAM, orLLC. Expand the column group and examine theTotalcolumn. The utilization is calculated as a<name>Bandwidth Utilizationrelationof average memory level bandwidth to its peak bandwidth. High value means that the kernel does not use well this memory level and it is the potential bottleneck.You can also review the following data to find bottlenecks:
You can also focus on the most interesting data to analyze your problem using sort and filter controls:
- If you see high cache or memory bandwidth utilization (for example, in theL3 Cache,SLM,LLCcolumn groups), consider optimizing cache/memory traffic to improve performance.
- If you see high latency in theOffload Informationcolumn group, consider optimizing cache/memory latency by scheduling enough parallel work for this kernel to increase thread occupancy.
- If you see high data transfer tax in theOverhead, consider optimizing data transfer taxes or using unified shared memory (USM).
- To filter data in the column, hover over a column title and click the menu icon or click the right-sideCustom filterbutton. In the filter tab, deselect values you want to hide from the table or specify filter criteria. For example, you can select to see only specific kernels of interest and hide all other kernels using the filter to theHierarchycolumn.
- To configure the table metrics, click the right-sideColumn configuratorbutton and select columns to show in the table and/or deselect columns or column groups to hide from the table. For example, if you want to analyze how well your application uses memory resources on the target GPU, you can select to show only memory-related column groups.
- In the right-sideSourcepane, see the source code associated with a kernel, if available. You need to select a kernel from the metrics table to see the source.
- In the right-sideMemory objectspane, see the details about memory objects transferred between the host device and a target GPU for a kernel. You need to select a kernel from the metrics table to see the memory objects data. Examine this pane if you see a high data transfer tax for a kernel. The pane includes two parts:
- The memory object table lists all memory objects accessed by the kernel with details about each object, such as size, transfer direction (only to the host, only to the target, from the host to the target and back), object type. You can also apply filter to the table columns to focus on the most interesting data.Intel Advisor uses this data to estimate data transfer traffic and data transfers for each kernel.
- The memory objects histogram shows a distribution of memory objects by size. If you see a lot of small-sized objects, this may result in high latency for the kernel. High latency might cause a high data transfer tax.