The Intercept Layer for OpenCL* Applications is an open-source tool that you can use to profile oneAPI designs at a system-level. Although it is not part of the Intel® oneAPI Base Toolkit installation, it is freely available on GitHub*.
This tool serves the following purpose:
Intercept host calls before they reach the device to gather performance data and log host calls.
Provide data to visualize the calls through time and separate them into queued, submitted, and execution sections to better understand the execution.
Identify gaps (using visualization) in the runtime that may be leading to inefficient execution and throughput drops.
The Intercept Layer for OpenCL* Applications tool has a different purpose than the Intel® FPGA Dynamic Profiler for DPC++, which provides information about the kernels themselves and helps optimize the hardware. Together, you can use these tools to optimize both host and device-side execution.
The Intercept Layer has different options for capturing different aspects of the host run, and these options are described in its documentation. Call-logging and device timeline features are used to print information about the calls made by the host during execution.
You can view visualizations of this data in the following methods:
Use JSON files generated by the Intercept Layer for OpenCL Applications that contain device timeline information. You can open these JSON files in the Google* Chrome trace event profiling tool, which provides a visualization of the data.
Use the Intercept Layer for OpenCL Applications' python script that parses the timeline information into a Microsoft* Excel file, where it is presented both in a table format and in a bar graph.
Use the visualized data to identify gaps in the runtime where events are waiting for something else to finish executing. While it is not possible to eliminate all the gaps, you might be able to eliminate gaps caused by dependencies that can be avoided.
allows host-data processing and host transfers to the device-side buffer to occur in parallel with the kernel execution on the FPGA device. This parallelization is useful when the host performs any combination of the following actions between consecutive kernel runs:
Writes to the device buffer
By running host and device actions in parallel, execution gaps between kernels are removed as they no longer have to wait for the host to finish its operation. You can clearly see the benefits of double-buffering with the visualizations provided by the Intercept Layer output.