Developer Guide

Using Standard Library Functions in DPC++ Kernels

Some, but not all, standard C++ functions can be called inside DPC++ kernels. See Chapter 18 (Libraries) of Data Parallel C++ for an overview of supported functions. A simple example is provided here to illustrate what happens when an unsupported function is called from a DPC++ kernel. The following program generates a sequence of random numbers using the
rand()
function:
// Compile: // dpcpp -D{HOST|CPU|GPU} -std=c++17 -fsycl external_rand.cpp -o external_rand #include <CL/sycl.hpp> #include <iostream> #include <random> #define N 5 extern SYCL_EXTERNAL int rand(void); int main(int argc, char **argv) { #if defined HOST sycl::queue Q(sycl::host_selector{}); #elif defined CPU sycl::queue Q(sycl::cpu_selector{}); #elif defined GPU sycl::queue Q(sycl::gpu_selector{}); #endif std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << std::endl; // Attempt to use rand() inside a DPC++ kernel auto test1 = sycl::malloc_shared<float>(N, Q.get_device(), Q.get_context()); srand((unsigned)time(NULL)); Q.parallel_for(N, [=](auto idx) { test1[idx] = (float)rand() / (float)RAND_MAX; }).wait(); // Show the random number sequence for (int i = 0; i < N; i++) std::cout << test1[i] << std::endl; // Cleanup sycl::free(test1, Q.get_context()); }
The program can be compiled to execute the DPC++ kernel on host (i.e., the SYCL
host_selector
), the CPU (i.e.,
cpu_selector
), or GPU (i.e.,
gpu_selector
) devices. It compiles without errors on all three devices, and runs correctly on the CPU, but fails when run on the GPU:
$ dpcpp -DHOST -std=c++17 -fsycl external_rand.cpp -o external_rand $ ./external_rand Running on: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz 0.572586 0.691008 0.451763 0.793325 0.000884 $ dpcpp -DCPU -std=c++17 -fsycl external_rand.cpp -o external_rand $ ./external_rand Running on: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz 0.141417 0.821271 0.898045 0.218854 0.304283 $ dpcpp -DGPU -std=c++17 -fsycl external_rand.cpp -o external_rand $ ./external_rand Running on: Intel(R) Graphics Gen9 [0x3e96] terminate called after throwing an instance of 'cl::sycl::compile_program_error' what(): The program was built for 1 devices Build program log for 'Intel(R) Graphics Gen9 [0x3e96]': error: undefined reference to `rand()' error: backend compiler failed build. -11 (CL_BUILD_PROGRAM_FAILURE) Aborted
The failure occurs during just-in-time (JIT) compilation because of an undefined reference to
rand()
. Even though this function is declared
SYCL_EXTERNAL
, there’s no SYCL equivalent to the
rand()
function on the GPU device.
Fortunately, the DPC++ library contains alternatives to many standard C++ functions, including those to generate random numbers. The following example shows equivalent functionality using the Intel
®
oneAPI DPC++ Library (oneDPL) and the Intel
®
oneAPI Math Kernel Library (oneMKL):
#include <CL/sycl.hpp> #include <iostream> #include <oneapi/dpl/random> #include <oneapi/mkl/rng.hpp> int main(int argc, char **argv) { unsigned int N = (argc == 1) ? 20 : std::stoi(argv[1]); if (N < 20) N = 20; // Generate sequences of random numbers between [0.0, 1.0] using oneDPL and // oneMKL sycl::queue Q(sycl::gpu_selector{}); std::cout << "Running on: " << Q.get_device().get_info<sycl::info::device::name>() << std::endl; auto test1 = sycl::malloc_shared<float>(N, Q.get_device(), Q.get_context()); auto test2 = sycl::malloc_shared<float>(N, Q.get_device(), Q.get_context()); std::uint32_t seed = (unsigned)time(NULL); // Get RNG seed value // oneDPL random number generator on GPU device clock_t start_time = clock(); // Start timer Q.parallel_for(N, [=](auto idx) { oneapi::dpl::minstd_rand rng_engine(seed, idx); // Initialize RNG engine oneapi::dpl::uniform_real_distribution<float> rng_distribution; // Set RNG distribution test1[idx] = rng_distribution(rng_engine); // Generate RNG sequence }).wait(); clock_t end_time = clock(); // Stop timer std::cout << "oneDPL took " << float(end_time - start_time) / CLOCKS_PER_SEC << " seconds to generate " << N << " uniformly distributed random numbers." << std::endl; // oneMKL random number generator on GPU device start_time = clock(); // Start timer oneapi::mkl::rng::mcg31m1 engine( Q, seed); // Initialize RNG engine, set RNG distribution oneapi::mkl::rng::uniform<float, oneapi::mkl::rng::uniform_method::standard> rng_distribution(0.0, 1.0); oneapi::mkl::rng::generate(rng_distribution, engine, N, test2) .wait(); // Generate RNG sequence end_time = clock(); // Stop timer std::cout << "oneMKL took " << float(end_time - start_time) / CLOCKS_PER_SEC << " seconds to generate " << N << " uniformly distributed random numbers." << std::endl; // Show first ten random numbers from each method std::cout << std::endl << "oneDPL" << "\t" << "oneMKL" << std::endl; for (int i = 0; i < 10; i++) std::cout << test1[i] << " " << test2[i] << std::endl; // Show last ten random numbers from each method std::cout << "..." << std::endl; for (int i = N - 10; i < N; i++) std::cout << test1[i] << " " << test2[i] << std::endl; // Cleanup sycl::free(test1, Q.get_context()); sycl::free(test2, Q.get_context()); }
The necessary oneDPL and oneMKL functions are included in
<oneapi/dpl/random>
and
<oneapi/mkl/rng.hpp>
, respectively. The oneDPL and oneMKL examples perform the same sequence of operations: get a random number seed from the clock, initialize a random number engine, select the desired random number distribution, then generate the random numbers. The oneDPL code performs device offload explicitly using a DPC++ kernel. In the oneMKL code, the
mkl::rng
functions handle the device offload implicitly.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.