Learn SYCL* in an Hour (Maybe Less)

Get the Latest on All Things CODE

author-image

By

In this piece, I will introduce all the key things to know to program in C++ with SYCL*.

I obviously will cover only the bare essentials; therefore, I’m not teaching everything that you would find in a 500-page book on SYCL. If you tell me that surprises you, I will assume you are joking.

We will learn the basics (1-2-3) and have a small working program that we can use to explore more as we wish.

What Is SYCL?

C++ with SYCL offers us the ability to use accelerators from a C++ program, regardless of vendor (e.g., NVIDIA, AMD, Intel, etc.) or architecture (e.g., GPU, CPU, FPGA, DSP, etc.). To do so, we just need a C++ compiler that supports SYCL and a runtime that supports our accelerator (most often a driver, such as an OpenCL* runtime from the vendor). SYCL 2020 (the current standard) allows for support beyond just OpenCL. This freedom has allowed SYCL implementations to find the best path to hardware in numerous ways, including PTX for NVIDIA, ROCm/HIP for AMD, OpenCL for many vendors, SPIR-V* for Intel, OpenMP* for multiple vendors, etc. The Intel® compiler uses some of these paths. A SYCL compiler project (AdaptiveCpp), led by Heidelberg University, is well known for pioneering a number of innovative paths as well. There are many options for getting SYCL support for various accelerators.

SYCL Is C++

SYCL is designed for C++ and will feel very comfortable to C++ programmers. Nevertheless, SYCL can be learned and used with minimal C++ knowledge.

Follow Along with Instructions

In order to follow along and get the most out of the experience, we have instructions under “Learn SYCL in an Hour (Maybe Less)”. The instructions tell you how to access Intel® Developer Cloud freely to use a system already configured with multiple GPUs and the installed software we need. It has the information on where to fetch the code examples from GitHub*.

There Are Three Keys to SYCL

SYCL solves three problems for us by providing these capabilities:

  1. Find what accelerator(s) are available at runtime.
  2. Share data with accelerator(s).
  3. Offload computational work to accelerator(s).

I will also mention that SYCL has many additional features that we will want and appreciate, including support for C++ error handling even for offload computations, and built-in support for reduction operations. However, these are things we can learn later, as needed after we solidly understand the three keys.

Finding/Choosing Accelerator(s)

Our goal in finding/choosing an accelerator is to get a connection to an accelerator so we can move on to sharing data and offloading code. In SYCL terms, this means getting a queue.

A queue connects us to a specific accelerator. We can create as many queues as we like, and different queues may point to the same accelerator if we like. In the example program, I show off and try to fill an array of queues with handles to every accelerator on the machine. That might mean it only gets one, or it might be many. Following along with my instructions on the Intel Developer Cloud, you will probably get four (at least it will as I write this).

SYCL offers us a lot of control to find and select from the accelerator(s) that are available at runtime. We should start off simple and simply say:

sycl::queue q;

That will give us the handle q, which we happily use for all our data sharing and offloading needs. In this simple case, the SYCL runtime will simply pick an accelerator for us (hopefully the best one available at runtime).

Of course, I would normally get rid of the sycl:: by using a namespace sycl early on. However, for examples, I leave them explicit to help highlight where we are using SYCL while we are learning.

It is important to note that SYCL will always have a device available. This is incredibly useful for writing a simple program that will always work. In a system with no accelerators, the host (a CPU in all the implementations that I have seen) will be used.

If we want to know what device we connected to, we can simply print the name out:

std::cout << "Running on "
          << q.get_device().get_info<sycl::info::device::name>();

Sharing Data with Accelerator(s)

Sharing data is easy with SYCL. We can use USM (Unified Shared Memory) with memory allocation that look much like mallocs, and memory allocated in that way is magically shared between the host and accelerators. This is usually only supported with accelerators that support USM in hardware. That is not much of an issue, as all modern GPUs, CPUs, and FPGAs can support USM. SYCL also supports explicit buffers that are also magically shared between the host and accelerators, but without allowing regular pointers to work across hosts and accelerators. For now, I will just recommend using USM unless you know you want to use buffers.

In the example, the program uses buffers for the job that computes digits of pi.

That code looks like this:

std::array<int, 200> d4;
sycl::buffer outD4(d4);	// this is a sycl buffer

If you want to convert to use USM, do this:

Comment out the two lines shown before for d4, and use this:

// this is a sycl USM memory allocation
auto d4 = (int *)sycl::malloc_shared( sizeof(int)*200, myQueue2 );

And get rid of the accessors (outAccessor and myD4f) since USM can just use pointers. You can replace the declarations with these macros instead (being lazy and not wanting to edit the actual usages):

#define outAccessor d4
#define myD4 d4

Offloading Computational Works to Accelerator(s)

We can simply write some code and say, “Run this code on the accelerator.” This is a simple Hello World! that runs on an accelerator.

q.submit([&](sycl::handler& cg) {
           auto os = sycl::stream{128, 128, cg};
           cg.single_task(
                            [=]() { os << "Hello World!\n"; });
         });

The submit says we have work to offload. The single_task is used to specify a single thing to run; in this case, a print of Hello World! The single_task can specify a function to offload, but often we just specify it inline using a C++ lambda function (as I have done in this case).

Since accelerators are used to gain performance through parallelization, we need something a little more involved if we want to do anything for higher performance. This is when we need to know that SYCL emphasizes a style of programming that invokes a kernel in parallel. This is the same programming style you find in CUDA and OpenCL. The idea is simple: We write a kernel that can be simple serial code to operate on one piece of data, and then we invoke it in parallel such that a kernel is invoked on each data element independently.

In the sample code, we actually do a blur in parallel. Understanding that code is not particularly hard, and once we understand it SYCL really starts to make sense. Here is a simpler look at going parallel by simply making Hello World! run in parallel:

// this is the entire program

#include <sycl/sycl.hpp>
int main(int argc, char* argv[]) {
  sycl::queue q;
  std::cout << "Running on "
            << q.get_device().get_info<sycl::info::device::name>()
            << "\n";
  q.submit([&](sycl::handler& cg) {
             auto os = sycl::stream{1024, 1024, cg};
             cg.parallel_for(10, [=](sycl::id<1> myid)
                 {
                   os << "Hello World! My ID is " << myid << "\n";
                 });

         });
}

Running on my laptop, under WSL, it printed this:

Running on Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Hello World! My ID is {5}
Hello World! My ID is {0}
Hello World! My ID is {7}
Hello World! My ID is {2}
Hello World! My ID is {1}
Hello World! My ID is {8}
Hello World! My ID is {6}
Hello World! My ID is {9}
Hello World! My ID is {3}
Hello World! My ID is {4}

The Example Program

Now you know plenty to go off and explore. Keep in mind that a queue connects to an accelerator, and uses of the queue will then be either to set up data sharing or to offload computations. That should be enough to puzzle out the sample program and start changing it to experiment and learn more.

The sample program that I provided online is my own mess. It is intended only to show off the basics with absolutely no regard for writing an effective parallel program. I think all the other learning resources for SYCL in the world try to show well-considered parallel programming examples. I wanted to be different and inspire fun exploring to get started.

The sample program does three different jobs. Each job will run on one of three different accelerators when available, or they will happily run all on the host if that is all we have, or whatever mix of accelerators we find at runtime.

I pulled together this code so that I could show a variety of things that are easy to change and understand. I believe that has high value when starting out. You will even discover that part of the code has if-defs to change between using buffers and USM for sharing for one of the examples if you wander over to the Exercise_02_... subdirectory.

Two notable extras in the code that I tossed in to encourage more playing while learning valuable techniques:

  • I create an array of queues and load them up with all the accelerators that I can find. Then I take the first, second, or third accelerators for the three different jobs (modulo the actual number of accelerators we found at runtime). Do not be thrown by the seeming complexity of this silly code: It is not doing much of anything different than simply creating a queue with sycl::queue q;. It does hint at more fun you can have in selecting your favorite accelerators at run time, and possible matching them to custom algorithms if you like. So much fun available — but not necessary to learn SYCL.
  • I set up the queues with profiling selected. (I assume it is available, which it usually is, but this limits where we can run the program unless we spend a little more effort and add logic to only use profiling on devices that support it.) The profiling option lets us gather information about the actual run times on the accelerator. This is more valuable than wall-clock time when tuning our kernels because it helps reduce noise caused by other code if all we are doing is tuning the kernel itself. Together, kernel timing and wall-clock timing give us great data while we tune an application.

Once you are comfortable with SYCL, we can learn much more from the resources listed at sycl.tech, including a book that I co-authored, some examples and tutorials, and much more.

Summary

We learned two key things: (1) SYCL fundamentally addresses three things for using accelerators from C++, and (2) we have a silly little program we can get lost in for hours changing and experimenting with as our first SYCL program.

I believe it Is highly valuable to know as many programming models/languages as we can so we can use the appropriate tool for the job. If our job is writing C++ that uses accelerators, and the kernel style of expressing parallelism makes sense for our algorithm, and we want our application to be highly portable across vendors and/or architectures, then knowing SYCL could well be valuable.

Learning SYCL is not hard at all. Mastering effective parallel programming … Well, that is a whole different matter.😊

Happy coding!