Developer Guide

Just-In-Time Compilation in DPC++

The Intel
®
oneAPI DPC++ Compiler converts a DPC++ program into an intermediate language called SPIR-V and stores that in the binary produced by the compilation process. The advantage of producing this intermediate file instead of the binary is that this code can be run on any hardware platform by translating the SPIR-V code into the assembly code of the platform at runtime. This process of translating the intermediate code present in the binary is called JIT compilation (just-in-time compilation). JIT compilation can happen on demand at runtime. There are multiple ways in which this JIT compilation can be controlled. By default, all the SPIR-V code present in the binary is translated upfront at the beginning of the execution of the first offloaded kernel.
#include <CL/sycl.hpp> #include <array> #include <chrono> #include <iostream> // Array type and data size for this example. constexpr size_t array_size = (1 << 16); typedef std::array<int, array_size> IntArray; void VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void VectorAdd2(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::noinit); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void InitializeArray(IntArray &a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; } int main() { sycl::default_selector d_selector; IntArray a, b, sum; InitializeArray(a); InitializeArray(b); sycl::queue q(d_selector, sycl::property::queue::enable_profiling{}); std::cout << "Running on device: " << q.get_device().get_info<sycl::info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; auto start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); auto end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; return 0; }
When the program above is compiled using the command below (assuming that the name of the source file is
example.cpp
):
dpcpp -O3 -o example example.cpp
and run, the output generated will show that the first call to
VectorAdd1
takes much longer than the calls to other kernels in the program due to the cost of JIT compilation, which gets invoked when
vectorAdd1
is executed for the first time.
The overhead of JIT compilation at runtime can be avoided by ahead-of-time (AOT) compilation (it is enabled by appropriate switches on the compile-line). With AOT compile, the binary will contain the actual assembly code of the platform that was selected during compilation instead of the SPIR-V intermediate code. The advantage is that we do not need to JIT compile the code from SPIR-V to assembly during execution, which makes the code run faster. The disadvantage is that now the code cannot run anywhere other than the platform for which it was compiled.
The example above can be compiled on a Gen9 GPU using the following command with AOT code-generation:
dpcpp -O3 -o example example.cpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend=spir64_gen-unknown-unknown-sycldevice "-device skl"
When this compiled program is run, it can be seen from the output that the time it takes to execute all the calls to the kernels takes the same amount of time, unlike before where the first kernel takes a lot more time because of JIT compilation.
If the application contains multiple kernels, one can force eager JIT compilation or lazy JIT compilation using compile-time switches. Eager JIT compilation will invoke the JITter on all the kernels in the binary at the beginning of execution, while lazy JIT compilation will enable the JITter only when the kernel is actually called during execution. In situations where certain kernels are not called, this has the advantage of not translating code that is never actually executed, which avoids unnecessary JIT compilation. This mode can be enabled during compilation using the following option:
-fsycl-device-code-split=<value>
where value is
  • per_kernel
    : generates code to do JIT compilation of a kernel only when it is called
  • per_source
    : generates code to do JIT compilation of all kernels in the source file when any of the kernels in the source file are called
  • off
    : the default, which does eager JIT compilation of all kernels in the application
If the above program is compiled with this option:
dpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_kernel
and run, then from the timings of the kernel executions it can be seen that the first invocations of
VectorAdd1
and
VectorAdd2
take longer, while the second invocations will take less time because they do not pay the cost of JIT compilation.
In the example above, we can put
VectorAdd1
and
VectorAdd2
in separate files and compile them with and without the
per_source
option to see the impact on the execution times of the kernels. When compiled with
dpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_source
and run, the execution times of the kernels will show that the JIT compilation cost is paid at the first kernel invocation, while the subsequent kernel invocations do not pay the JIT compilation cost. But when the program is compiled with
dpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp
and run, the execution times of the kernels will show that the JIT compilation cost is paid upfront at the first invocation of the kernel and all subsequent kernels do not pay the cost of JIT compilation.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.