Developer Guide

Using the timers

The standard C++ chrono library can be used for tracking times with varying degrees of precision in DPC++. The following example shows how to use chrono timer class to time a kernel execution from host side.
#include <iostream> using namespace sycl; // Array type and data size for this example. constexpr size_t array_size = (1 << 16); typedef std::array<int, array_size> IntArray; double VectorAdd(queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { range<1> num_items{a.size()}; buffer a_buf(a); buffer b_buf(b); buffer sum_buf(sum.data(),num_items); event e = q.submit([&](handler &h) { // Input accessors auto a_acc = a_buf.get_access<access::mode::read>(h); auto b_acc = b_buf.get_access<access::mode::read>(h); // Output accessor auto sum_acc = sum_buf.get_access<access::mode::write>(h); h.parallel_for(num_items, [=](id<1> i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); return(e.template get_profiling_info<info::event_profiling::command_end>() - e.template get_profiling_info<info::event_profiling::command_start>()); } void InitializeArray(IntArray &a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; } int main() { default_selector d_selector; IntArray a, b, sum; InitializeArray(a); InitializeArray(b); queue q(d_selector,property::queue::enable_profiling{}); std::cout << "Running on device: " << q.get_device().get_info<info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; double t = VectorAdd(q, a, b, sum); std::cout << "Vector add successfully completed on device - took " << t << " u-secs\n"; return 0; }
It must be noted that this timing is purely from the host side. The actual execution of the kernel on the device may start much later after the submission of the kernel by the host. DPC++ provides a profiling capability which allows one to keep track of the time it took to execute kernels.
#include <CL/sycl.hpp> #include <array> #include <iostream> using namespace sycl; // Array type and data size for this example. constexpr size_t array_size = (1 << 16); typedef std::array<int, array_size> IntArray; double VectorAdd(queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { range<1> num_items{a.size()}; buffer a_buf(a); buffer b_buf(b); buffer sum_buf(sum.data(),num_items); event e = q.submit([&](handler &h) { // Input accessors auto a_acc = a_buf.get_access<access::mode::read>(h); auto b_acc = b_buf.get_access<access::mode::read>(h); // Output accessor auto sum_acc = sum_buf.get_access<access::mode::write>(h); h.parallel_for(num_items, [=](id<1> i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); return(e.template get_profiling_info<info::event_profiling::command_end>() - e.template get_profiling_info<info::event_profiling::command_start>()); } void InitializeArray(IntArray &a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; } int main() { default_selector d_selector; IntArray a, b, sum; InitializeArray(a); InitializeArray(b); queue q(d_selector,property::queue::enable_profiling{}); std::cout << "Running on device: " << q.get_device().get_info<info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; double t = VectorAdd(q, a, b, sum); std::cout << "Vector add successfully completed on device - took " << t << " nano-secs\n"; return 0; }
When these examples are run, it is quite possible that the time with the chrono timer is much larger than the time with the DPC++ profiling class. This is due to the fact that the DPC++ profiling does not include any data transfer times between the host and the offload device.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.