Partner Training for Intel® oneAPI

Use these courses to get up to speed on oneAPI Data Parallel C++ (DPC++) code and how to use oneAPI Toolkits and components to achieve cross-platform, heterogenous compute.

Get Started

  1. Choose a course from the list in the order that works for you.
  2. Complete the quiz at the end of each course.
  3. Pass each quiz with a minimum score of 80%.
  4. Receive your certification to train and consult on oneAPI.
Title Requirement
Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model Mandatory

Intel® DevCloud Tutorial


Migrate Your Existing CUDA Code to DPC++ Code Mandatory
DPC++ Program Structures Mandatory
DPC++ New Features Mandatory
Develop in a Heterogeneous Environment with Intel® oneAPI Math Kernel Library Optional
Intel® oneAPI Threading Building Blocks: Optimizing for NUMA Architectures Optional
Customize Your Workloads with FPGAs Optional

Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model

The drive for compute innovation is as old as computing itself, with each advancement built upon what came before. In 2019 and 2020, a primary focus of next-gen compute innovation has been to enable increasingly complex workloads to run on multiple architectures, including CPUs, GPUs, FPGAs, and AI accelerators.

Historically, writing and deploying code for a CPU and a GPU or other accelerator has required separate code bases, libraries, languages, and tools. oneAPI was created to solve this challenge.

Kent Moffat, software specialist and Intel senior product manager, presents:

  • An overview of oneAPI —what it is, what it includes, and why it was created
  • How this initiative, driven by Intel, simplifies development through a common tool set that enables more code reuse
  • How developers can immediately take advantage of oneAPI in their development, from free toolkits to the Intel® DevCloud environment

Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model


Intel® DevCloud Tutorial

Develop, run, and optimize your Intel® oneAPI solution in the Intel® DevCloud—a free development sandbox to learn about and program oneAPI cross-architecture applications. Get full access to the latest Intel CPUs, GPUs, and FPGAs, Intel® oneAPI Toolkits, and the new programming language, Data Parallel C++ (DPC++).

Some of the lessons and training materials use the Intel DevCloud as a platform to host the training and to practice what you've learned.

What is the Intel® DevCloud?


Migrate Your Existing CUDA* Code to DPC++ Code

In this video, Intel senior software engineers, Sunny Gogar and Edward Mascarenhas, show you how to use the Intel DPC++ Compatibility Tool to perform a one-time migration that ports both kernels and API calls. In addition, you will learn the following:

  • An overview of the DPC++ language—its origins and benefits to developers
  • A description of the Intel DPC++ Compatibility Tool and how it works
  • Real-world examples to get you grounded on the migration concept, process, and expectations
  • A hands-on demo using Jupyter* Notebook to show the serial steps involved, including what a complete migration to DPC++ looks like, as well as cases where manual porting is required to port CUDA all the way to DPC++ code

Intel DPC++ Compatibility Tool


Intel® oneAPI Data Parallel C++ Program Structures

This module introduces DPC++ program structure and focuses on important SYCL* classes to write basic DPC++ code to offload to accelerator devices.

  • Explain the SYCL fundamental classes
  • Use device selection to offload kernel workloads
  • Decide when to use basic parallel kernels and NDRange kernels
  • Create a host Accessor
  • Build a sample DPC++ application through hands-on lab exercises


Video Transcript
Slide 1:
This video walk throughs the DPC++ Program structure module which is part of the oneAPI essentials modules.

Slide 2:
Agenda: We will cover the basic SYCL classes, we will see what device offloading is and see different types of device selectors, we will talk about the SYCL buffers, accessors, command group handlers, we will do a basic DPC++ code anatomy, talk about parallel kernels and ND range kernels, and finally talk about synchronization in a DPC++ program.

Slide 3:
Learning objectives: By the end of this lecture, attendees would  understand: the basic SYCL classes, what is device offloading and how to use device selection to offload work to device, the usage of basic parallel kernels and ND range kernels, synchronization using host accessors and buffer destruction and finally build a sample DPC++ application through hands on exercise.

Slide 4: 
DPC++: Let me talk about Data parallel C++, So, it is a standard based on the cross architectural language. So it's a new language opens standard that’s created mainly based on C++ ISO standards and then we take advantage of the existing Kronos SYCL standard which already has data parallelism and heterogeneous computing support in there. On top of it we've added some community extensions to extend the existing standard to make use of some advanced hardware capabilities and to simplify the languages.
So, what is the data parallel C++? 
It's mainly C++ plus SYCL, plus some community extensions. It's based on modern C++ and then it also incorporates SYCL which helps in a heterogenous programming.

Slide 5: 
DPC++ Hello world: Let me introduce to a basic DPC++ hello program. It’s a single source which means Host and Device code are in the same source as shown in the code.
And we are using familiar C++ and added library functionalities as:
• A queue that defines the work target to device
• malloc_shared will allocate memory that can be accessed on host and device 
• Parallel_for will submit the task for parallel execution on device

Slide 6:
We will be talking about some of the basic SYCL classes in the following slides.

Slide 7: 
Let’s talk about the Device class. Device class helps to query information about the device that you want to offload the work to. You can get information about device like name, vendor, compute units, memory size and more from this class.

Slide 8:
Device selector class chooses the runtime selection of device to use for executing.
Below sample code shows standard selection like default_selector where the system selects the best available device for you or you can target a particular device like GPU using gpu_selector and CPU using cpu_selector.

Slide 9:
Let’s talk about Queue class. 
Queue is employed to submit command groups where a work is submitted and to be executed on the devices. Command groups are submitted using the submit member function.

Slide 10:
Each queue maps to one device for example, GPU_selector{} for targeting GPU or CPU_selector{} for targeting CPU. You can also target a vendor specific device by creating a custom_selector{} class derived from device_selector

Slide 11:
Kernel class dispatches methods and data for executing code on the device, 
Kernel invocations are executed in parallel by calling parallel_for () function and the kernels are executed for each element of the range.
Let’s go to the jupyter notebooks and run some basic samples to see more details about this.
The same concepts are explained here but we got the working code examples. Let’s go to the device selector section and run this example.
We selected GPU_selector here and let’s run this. We see that using the GPU_selector we are selecting the shown device.

Slide 12:
The main take away from this slide s Application scope and command group scope contains the code that executes on the host. Kernel scope is the code that executes on the device. The full capabilities of C++ are available at application and command group scope. At kernel scope there are limitations in accepted C++.

Slide 13:
In the following slides We introduce the concept of a data parallel kernels, discuss in detail about different kinds of parallel kernels. Parallel Kernels allows multiple instances of operations execute in parallel. Use to offload a for-loop to device and execute iterations in parallel. We call the function parallel_for for this purpose.

Slide 14:
We got two ways to write a basic parallel kernel using range & id, or range & item (item gets you more info). Range class is used to describe the iteration space of parallel execution and id class is used to index an individual instance of a kernel in a parallel execution.
if you need the range value in your kernel code, then you can use item class instead of id class, which you can use to query for the range as shown here. item class represents an individual instance of a kernel function, exposes additional functions to query properties of the execution range.

Slide 15
Basic Parallel Kernels are easy way to parallelize a for-loop but if you want performance optimization at hardware level use ND-Range kernel. ND_range kernels also expresses parallelism, but the difference is it enables low level performance tuning by providing access to local memory and mapping executions to compute units on hardware. 
If you see in the image, the entire iteration space is divided into groups called work-groups, work-items and this grouping will allow control of resource usage and load balance work distribution.

Slide 16:
In the nd_range kernels nd_range & nd_item classes are used, where we specify work-group size to load balance and tune for performance
An nd_range represents a grouped execution range using two instances of the range
class: one denoting the global execution range using Work-group, and another denoting the local execution for each work-group.
An nd_item is the ND-range form like Item class we talked before, this also gives the execution range of the kernel and the item's index within that range.

Slide 17:
Let’s talk about Buffer Model.
Buffers are 1,2 or 3-dimensional array that is shared between host and devices. 
Accessors access buffer data in the host or inside the kernel and communicate data dependencies between the application and different kernels. 
If two kernels use the same buffer, the second kernel needs to wait for the completion of the first kernel to avoid race conditions. 

Slide 18:
Include the header cl/sycl.hpp to write a DPC++ program
Also include sycl namespace to make it programmer friendly

Slide 19:
In this slide we will do a code walkthrough of a simple DPC++ application. We follow steps 1 to 6 to understand the building blocks of the DPC++ program.
In the first step we create the device q. A default q with no constructor uses default selector.
Second step we create buffers for a, b and c
In the third step we submit a command group for execution. This part of the code happens in asynchronous mode
We also create accessors to access the buffer data on the device
In the step 5 we send a kernel that executes on the device. This is a lambda function
And the step 6 is the actual kernel that we write that is to be executed on the device
All the kernel invocations are executed in parallel and for each element of the range and got access to the ID that we talked before in the parallel kernel functions

Slide 20:
Custom device_selector class with user heuristic can be employed to set different priority for devices.
Code shows how a vendor’s gpu selection can be given priority. We will do a simple lab session on this

Slide 21:
We now will do a Hands-on lab for complex multiplication in the Jupyter notebook. we follow the steps outlined in the notebook to create a custom device selector and how to pass in a custom class objects and its computation that you offload to the device in a DPC++ application

As mentioned, we will follow steps 1 2 and 3 in the below code
We go to STEP 1 and uncomment the below code
Here we created a custom device selector and we are setting the rating for the devices.
STEP 2: we uncomment the below line to set the Write Accessor
STEP 3: Uncomment the below line to call the complex_mul function that computes the multiplication. Let’s run and see the results
Let’s transition back to slide.

Slide 23:
Let’s talk about asynchronous execution in a SYCL application. SYCL application got two parts, host code and the graph of kernel executions. Host code submits work to the device and the kernel executions, and the data movements executes asynchronously.

Slide 24:
A fundamental concept in the SYCL execution model is a graph of nodes. Each node in
this graph contains an action to be performed on a device, with the most common action
being a data parallel device kernel invocation
The example code shows the host code and the graph of kernel executions. The sample here enqueues kernel to the graph and keeps going but the graph is executed asynchronously to the host program.

Slide 25:
In this slide we see an example where we got 4 kernels submitted. Kernel 2 task is dependent on the completion of Kernel 1 task and waits till the first queue is completed. Kernel 4 task depends on completion of task 2 and task 3 are complete and will till task 2 and 3 are complete.

Slide 26:
In the following slides we talk about different types of synchronization
Host accessor: Creating host accessor is a blocking call and synchronizes the data back to the host. 
Buffer destruction is the other way of synchronizing the data.

Slide 27:
This slide shows example code how to create a host accessor and synchronize the data. Buffer ‘buf’ owns the data stored in vector ‘v’. we create the host accessor ‘b’, which is a blocking call and synchronizes the data back to the host.

Slide 28:
The other way to synchronize data is buffer creation and this happens in a separate scope and once the execution advances beyond this function scope buffer is destructed automatically and the data is synchronized.
Let’s run these examples live in the jupyter notebook 
Let’s go to the host accessor section and run the examples

Slide 29:
In this module you learned basic SYCL classes, device selection, how to use the device selector classes, how to create a custom device selector derived from the device selector class, understand the difference between basic parallel kernels and nd_range kernels, the DPC++ program structure like buffers, accessors, command group handler with code anatomy and finally how to synchronize data between host and the device code using host accessors and buffer destruction


New Features of Data Parallel C++

This module introduces some of the new extensions added to DPC++ like Unified Shared Memory (USM), in-order queues, and Sub-Groups. This module will be updated when new extensions are added to the public releases.

  • Use new Data Parallel C++ (DPC++) features, such as Unified Shared Memory, to simplify programming
  • Understand implicit and explicit ways of moving memory using USM
  • Solve data dependency between kernel tasks in an optimal way
  • Understand the advantages of using Sub-Groups in Data Parallel C++ (DPC++)
  • Take advantage of Sub-Group collectives in NDRange kernel implementation
  • Use Sub-Group Shuffle operations to avoid explicit memory operations


Video Transcript
● This is Data Parallel C++ New Features Module
● In this module we will learn what’s new in the Data Parallel C++ Language

Slide 2
● In this module, we will cover some of the new features in Data Parallel C++ like Unified Shared Memory and Sub-Groups
● We will also look at some hands-on code samples for USM and handling data dependency when using USM
● We will look at some example of Sub-group collectives and shuffle operations

Slide 3
● Learning Objectives for this Module
● Use new DPC++ features like Unified Shared Memory to simplify heterogeneous programming
● Understand advantages of using Sub-Groups in DPC++

Slide 4
● A quick recap of What is Data Parallel C++
● Data Parallel C++ is C++ plus SYCL* standard plus some extensions
● It is based on modern C++
● Standards bases and cross-architecture
● Uses SYCL standard for data parallelism and heterogeneous programming

Slide 5
● DPC++ extends the SYCL standard
● With 2 main goals, to enhance productivity and enhance performance
● Enhance Productivity by simplifying programming, reduce verbosity and programmer burden
● Enhance Performance by giving programmers control over program execution and enable hardware specific features
● DPC++ is open collaboration feeding into the SYCL standard
● Open source implementation
● The extensions aims to become core SYCL or Khronos* extensions.

Slide 6
● Some of the new features we will look at are Unified Shared Memory and Sub-Groups
● The main goals of DPC++ New Features are to simplify programming and achieve performance by exposing hardware features

Slide 7
● Lets look at Unified Shared Memory or U-S-M
● Unified Shared Memory is a pointer-based approach for memory model in DPC++
● Something that is familiar to C++ programmers

Slide 8
● But why use Unified Shared Memory?
● SYCL standard provides a buffer memory abstraction which is power and elegantly expresses data dependencies
● However
● For programmers new to heterogeneous programming or SYCL, replacing all pointer and arrays with SYCL buffers can be a burden to programmers, and requires mastery of SYCL buffer concepts
● USM provides pointer-based alternative in DPC++, it is complementary to buffers
● Simplifies porting to an accelerator
● Gives programmers the desired level of control for data movement between host and device

Slide 9
● Lets look at a developer view of Unified Shared memory
● On the left is how you would usually look at memory from GPU and CPU poin of view, with USM allocation the CPU and GPU can reference the same memory object.

Slide 10
● Lets look at a simple example using unified shared memory
● The code initializes an array on the host and is modified on the device and result is available on host
● [Click] You can see that USM allocation is done using malloc_shared function
● _shared tells that allocation is shared between host and a device, passing parameter q will tell which device can access this memory allocation
● [Click] Host will have access to this data array and can initialize
● [Click] The same data array can be modified on the device code submitted using parallel_for
● [Click] Once device computation is complete, the same data array can be access on the host again to print.
● This is a simple USM example, you can see that it makes dealing with memory simple when programming for heterogeneous devices
● We can do a quick comparison for same example using SYCL buffers

Slide 11
● Here we have the same example but uses SYCL buffers method
● [Click] Here we allocate memory on host
● [Click] Host initializes the array
● [Click] We have to create a SYCL buffer
● [Click] In the command group submitted to device, we define an accessor to the buffer specifying properties
● [Click] the accessor is modified
● [Click] the buffer is copied back to host on destruction
● [Click] the host can print the output
● As you can see the code is a lot more complicated and has a lot of new concepts that is not familiar to a C++ programmer
● USM just simplifies programming using familiar pointer concepts for heterogeneous programming

Slide 12
● Here is a quick look at how USM allocation is done
● [Click] Use malloc_shared and pass q parameter which has information about the device that will have access to the memory 
● You can also use a more familiar C/C++ style malloc call
● [Click] Its link a malloc call, but instead of malloc, we use malloc_shared. And we pass an extra parameter, sycl::queue, which tells which device the allocation is shared with

Slide 13
● Now lets look at the different types of USM
● USM supports both explicit and implicit models for managing memory
● There are 3 types, device allocation, host allocation and shared allocation.
● Device allocation is used when explicit control of data movement is required
● Host and shared allocation move the data implicitly 
● Lets look at both the examples 

Slide 14
● The example shows how USM explicit data movement works
● [Click] Here we use malloc_device to allocate memory on device
● The host will not have access to this array
● [Click] We have to use memcpy method to copy memory on host to the device allocation
● [Click] We can then do the computation on device, using the device allocation
● [Click] once the computation is complete we have explicitly copy back the memory to host using memcpy again
● This method allows you to control when you want data moved from host to device and device to host.

Slide 15
● Here is an example for USM implicit movement.
● [click] Notice that we use malloc_shared and we dont do any memcpy
● [click] The data array is directly accessed on the device code.
● [click] Memory is implicitly copied to device and back to host after computation is complete.

Slide 16
● When to use USM?
● How do you decide whether to use USM or SYCL buffers?
● SYCL Buffers are powerful and elegant
● If you are familiar with SYCL buffer programming model, use it
● USM provides a familiar pointer based alternative
● Useful when porting C++ code to DPC++ by minimizing changes
● Use shared allocation to get functionality quickly, but note that shared allocation is not intended to provide peak performance out of the box
● Use explicit USM allocation for more controlled data movement.

Hands-on exercise Implicit and Explicit USM

Slide 17
● Next lets look at how to handle data dependency between different kernel tasks.
● If 2 or more task are submitted to device using the same USM allocation, then we have to make sure that 2 kernel functions do not overwrite the same memory allocation
● There are various way to handle data dependency between kernel functions, either using wait events, depends_on method and others
● Lets look at some examples

Slide 18
● [click] The example below has 3 tasks submitted to device , 
● [click] all 3 tasks modify the same USM allocation.
● [click] after every task, wait() is used to ensure data dependency is maintained
● Note that using wait() will block execution on host
● Using wait() works, but there are better ways to handle dependency

Slide 19
● Another way to specify data dependency between tasks is using depends_on method to let command group handler know that specified event should be complete before specified task can execute.
● The example here is same as before
● [click] all 3 tasks modify the same USM allocation.
● [click] The event from first task is used to in second task using depends_on method, which will wait execution of second task until first task
● [click] same is done to specify dependency between task 2 and 3.

Slide 20
● Another new way to specify data dependency for tasks is using the new in_order queue property
● [click] Again we have the same example as before with 3 tasks
● [click] we specify a in_order() queue property
● This will make sure all the 3 tasks executed sequentially
● Note that execution will not overlap even if the tasks have no data dependency

Slide 21
● Lets look at another case where there are 2 USM allocations, the first 2 tasks use 2 different data allocations and the third task has dependency on first 2
● [click] there are 2 USM allocations - data1 and data2
● [click] task 1 modifies data1, task2 modifies data2, these 2 executions can overlap
● [click] the third task specifies dependency using depends_on method using a list of events

Slide 22
● There is also a simplified way of specifying dependency as a parameter of parallel_for
● [click] same example as before
● [click] 2 USM allocations
● [click] dependency events are specified as a parameter of parallel_for
● This makes the code much more simple

Hands-on for data dependency
● Show 4 examples
● That is USM

Slide 24
● Next new feature is Sub-Groups
● Sub-groups enable programming to lower-level hardware
● On many modern hardware platforms, a subset of work-items within a work-group are executed simultaneously or with additional scheduling guarantees.
● These subset of work-items are called sub-groups
● Leveraging sub-groups will help to map execution to lower-level hardware and may help in achieving higher performance.

Slide 25
● ND-Range Kernel execution will help group executions that map to hardware resources.
● The entire work-items are grouped into work-groups, and work-items in each work-group are grouped into sub-groups as shown in the pictures
● Grouping executions will help to optimized computation on hardware and tune application for performance.

Slide 26
● Here are the definitions
● Work-item represents the individual instance of a kernel function
● Work-group - the entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single  compute unit on hardware
● Sub-group - a subset of work-items within a work-group that are executed simultaneously and may be mapped to vector hardware.

Slide 27
● Here is an example of how work-group and sub-group are mapped to graphics hardware, in this case an Intel Gen11 Graphics hardware,
● Work-group is mapped to a compute unit
● Sub-group is mapped to vector hardware

Slide 29
● Why are sub-groups important ?
● Work-items in a sub-group can communicate directly using shuffle operations, without explicit memory operation
● Work-items in a sub-group can synchronize using sub-group barriers and guarantee memory consistency using memory fences
● Work-items in a sub-group have access to sub-group collectives, providing fast optimized implementations of commonly used functions

Slide 30
● Lets look at the sub_group class
● The sub_group handler can be obtained from the nd_item using the get_sub_group() method
● Once you have the sub_group handler, you can query for more information about the sub-group and apply any other operations.
● In the code we have a nd_range kernel defined and we use the nd_item to get the sub-group handler

Slide 31
● The code shows how to query for sub-group information which is useful to optimize computation
● You can query for sub-group size, number of sub-groups within the work-group, index of sub-group within the work-group

Slide 32
● One of the most useful features of sub-groups is the ability to communicate directly with individual work-items without explicit memory operations
● This will enable removing work-group local memory usage and avoid unnecessary repeated access to global memory
● The code shows how elements can be swapped using shuffle_xor method on subgroup.

Slide 33
● Another important feature of sub-groups is the ability to make use of highly optimized implementations of common functions
● These implementations help in increasing developer productivity and give ability to generate highly optimized  code for target devices.
● The code show how libraries can be used on sub-groups.

● Show sub-groups in notebooks
● And thats how subgroups work

Slide 35
● Summary
● DPC++ is based on modern C++ and SYCL standards
● Extends SYCL with new features
● New features are developed through community project
● Available on github
● Feel free to open a issue or submit a pull request

Slide 36
● Thats the end of this module on Data Parallel C++ New Features.


Develop in a Heterogeneous Environment with Intel® oneAPI Math Kernel Library

Peter Caday, math algorithm engineer at Intel, discusses how oneMKL enables developers to program with GPUs beyond the traditional CPU-only support.

Topics include:

  • An overview of how to improve your math library experience by developing once for GPUs and CPUs
  • How industry-leading oneMKL enables developers to program with GPUs beyond the traditional CPU-only support
  • A walk-through of a GPU-specific example of oneMKL API call from the Data Parallel C++ (DPC++) language to demonstrate the new, streamlined development process for linear algebra, random number generators, and more

Developing in a Heterogeneous Environment with Intel® oneAPI Math Kernel Library


Intel® oneAPI Threading Building Blocks: Optimizing for NUMA Architectures

Threading Building Blocks (TBB) is a high-level C++ template library for parallel programming that was originally developed as a composable, scalable solution for multicore platforms. Separately, in the realm of high-performance computing, multisocket Non-Uniform Memory Access (NUMA) systems are typically used with OpenMP*.

Increasingly, many independent software components require parallelism within a single application, especially in AI and video processing and rendering domains. In such environments, performance may degrade without allowing for composability with other components.

The result is that many developers have pulled TBB into NUMA environments—a complex task for even the most seasoned programmers.

Intel is working to simplify the approach. This training:

  • Explores the basic features of NUMA systems
  • Explains the causes of performance degradation on the system with several NUMA nodes
  • Explains how to eliminate exceptions that appear on NUMA systems using TBB interfaces

Intel® oneAPI Threading Building Blocks: Optimizing for NUMA Architectures


Customize Your Workloads with FPGAs

This course teaches you how to configure FPGAs into custom solutions to speed up key workloads using Intel oneAPI Toolkits. At the end of this course, you will be able to:

  • Write DPC++ code to target an FPGA
  • Understand the flow to target DPC++ code to an FPGA
  • Understand how your code is compiled into an FPGA design incorporating a Custom Compute Pipeline
  • Understand and be able to write your kernel scope code as a task
  • Examine an FPGA optimization report and analyze many performance bottlenecks
  • List several techniques to optimize your kernel scope code

For this course, please contact your Intel® representative to schedule instructor-led training.

Informações de produto e desempenho


Os compiladores da Intel podem ou não otimizar para o mesmo nível de microprocessadores não Intel no caso de otimizações que não são exclusivas para microprocessadores Intel. Essas otimizações incluem os conjuntos de instruções SSE2, SSE3 e SSSE3, e outras otimizações. A Intel não garante a disponibilidade, a funcionalidade ou eficácia de qualquer otimização sobre microprocessadores não fabricados pela Intel. As otimizações que dependem de microprocessadores neste produto são destinadas ao uso com microprocessadores Intel. Algumas otimizações não específicas da microarquitetura Intel são reservadas para os microprocessadores Intel. Consulte os Guias de Usuário e Referência do produto aplicáveis para obter mais informações sobre os conjuntos de instruções específicos cobertos por este aviso.

Revisão do aviso #20110804