Partner Training Program for oneAPI

Use these courses to get up to speed on oneAPI Data Parallel C++ (DPC++) code and how to use oneAPI toolkits and components to achieve cross-platform, heterogeneous compute.

A Unified, Standards-Based Programming Model

Take advantage of a software model that is flexible, familiar, and portable.

Essentials Training

Title Requirement
Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model Mandatory
Intel® DevCloud Tutorial Mandatory
Migrate Your Existing CUDA Code to DPC++ Code Mandatory
DPC++ Program Structures Mandatory
DPC++ New Features Mandatory
Develop in a Heterogeneous Environment with Intel® oneAPI Math Kernel Library Optional
Intel® oneAPI Threading Building Blocks: Optimize for NUMA Architectures Optional
Customize Your Workloads with FPGAs Optional

Introducing oneAPI: A Unified, Cross-Architecture Performance Programming Model

The drive for compute innovation is as old as computing itself, with each advancement built upon what came before. In 2019 and 2020, a primary focus of next-gen compute innovation has been to enable increasingly complex workloads to run on multiple architectures, including CPUs, GPUs, FPGAs, and AI accelerators.

Historically, writing and deploying code for a CPU and a GPU or other accelerator has required separate code bases, libraries, languages, and tools. oneAPI was created to solve this challenge.

Kent Moffat, software specialist and Intel senior product manager, presents:

  • An overview of oneAPI —what it is, what it includes, and why it was created
  • How this initiative, driven by Intel, simplifies development through a common tool set that enables more code reuse
  • How developers can immediately take advantage of oneAPI in their development, from free toolkits to the Intel® DevCloud environment

Intel® DevCloud Tutorial

Develop, run, and optimize your Intel® oneAPI solution in the Intel® DevCloud—a free development sandbox to learn about and program oneAPI cross-architecture applications. Get full access to the latest Intel CPUs, GPUs, and FPGAs, Intel® oneAPI Toolkits, and the new programming language, DPC++.

Some of the lessons and training materials use the Intel DevCloud as a platform to host the training and to practice what you've learned.

Migrate Your Existing CUDA* Code to DPC++ Code

In this video, Intel senior software engineers, Sunny Gogar and Edward Mascarenhas, show you how to use the Intel DPC++ Compatibility Tool to perform a one-time migration that ports both kernels and API calls. In addition, you will learn the following:

  • An overview of the DPC++ language—its origins and benefits to developers
  • A description of the Intel DPC++ Compatibility Tool and how it works
  • Real-world examples to get you grounded on the migration concept, process, and expectations
  • A hands-on demo using Jupyter* Notebook to show the serial steps involved, including what a complete migration to DPC++ looks like, as well as cases where manual porting is required to port CUDA all the way to DPC++ code

Intel DPC++ Compatibility Tool

Intel® oneAPI Data Parallel C++ Program Structures

This module introduces DPC++ program structure and focuses on important SYCL* classes to write basic DPC++ code to offload to accelerator devices.

  • Explain the SYCL fundamental classes
  • Use device selection to offload kernel workloads
  • Decide when to use basic parallel kernels and NDRange kernels
  • Create a host Accessor
  • Build a sample DPC++ application through hands-on lab exercises

New Features of Data Parallel C++

This module introduces some of the new extensions added to DPC++ like Unified Shared Memory (USM), in-order queues, and Sub-Groups. This module will be updated when new extensions are added to the public releases.

  • Use new DPC++ features, such as Unified Shared Memory, to simplify programming
  • Understand implicit and explicit ways of moving memory using USM
  • Solve data dependency between kernel tasks in an optimal way
  • Understand the advantages of using Sub-Groups in DPC++
  • Take advantage of Sub-Group collectives in NDRange kernel implementation
  • Use Sub-Group Shuffle operations to avoid explicit memory operations

Develop in a Heterogeneous Environment with Intel® oneAPI Math Kernel Library

Peter Caday, math algorithm engineer at Intel, discusses how oneMKL enables developers to program with GPUs beyond the traditional CPU-only support.

Topics include:

  • An overview of how to improve your math library experience by developing once for GPUs and CPUs
  • How industry-leading oneMKL enables developers to program with GPUs beyond the traditional CPU-only support
  • A walk-through of a GPU-specific example of oneMKL API call from the DPC++ language to demonstrate the new, streamlined development process for linear algebra, random number generators, and more

Intel® oneAPI Threading Building Blocks: Optimizing for NUMA Architectures

Threading Building Blocks (TBB) is a high-level C++ template library for parallel programming that was originally developed as a composable, scalable solution for multicore platforms. Separately, in the realm of high-performance computing, multisocket Non-Uniform Memory Access (NUMA) systems are typically used with OpenMP*.

Increasingly, many independent software components require parallelism within a single application, especially in AI and video processing and rendering domains. In such environments, performance may degrade without allowing for composability with other components.

The result is that many developers have pulled TBB into NUMA environments—a complex task for even the most seasoned programmers.

Intel is working to simplify the approach. This training:

  • Explores the basic features of NUMA systems
  • Explains the causes of performance degradation on the system with several NUMA nodes
  • Explains how to eliminate exceptions that appear on NUMA systems using TBB interfaces

Intel® AI Analytics Toolkit

Title Requirement
Introduction to the Intel® AI Analytics Toolkit Mandatory
Intel® Optimized AI Frameworks Mandatory
Introduction to Intel® oneAPI Deep Neural Network Library (oneDNN) Optional
Introduction to Intel® oneAPI Collective Communications Library (oneCCL) Optional

Introduction to the Intel® AI Analytics Toolkit

This course details the toolkit components, and shows how to install them to get started. Components include Intel® Optimization for TensorFlow*, PyTorch* Optimized for Intel® Technology, XGBoost, and scikit-learn*.

  • Identify the personas targeted
  • Learn the basics of machine learning packages optimized by Intel
  • Learn about available code samples and how to run them
  • Learn how the toolkit improves performance using the latest Intel hardware

Intel® Optimized AI Frameworks

Learn about the accelerations to TensorFlow and PyTorch, and how to use them specifically for Intel Xeon Scalable processors.

  • Review the Intel AI software portfolio
  • Understand what oneDNN is and how it's used to accelerate DL frameworks
  • Identify the environment variables to tune at execution time
  • Get an overview of Intel® Deep Learning Boost (Intel® DL Boost) and data type considerations
  • Learn about low-precision inference
  • Understand Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI), DLBoost, INT8, and BFloat16

Introduction to Intel® oneAPI Deep Neural Network Library (oneDNN)

This session introduces how to use various oneDNN binary releases inside oneAPI toolkits, and how to port a oneDNN example from pure CPU support to both CPU and GPU support by using DPC++.

  • Learn how to compile a oneDNN sample with different releases via batch jobs on the Intel® DevCloud for oneAPI
  • Learn how to program oneDNN with a simple sample
  • Learn how to collect Intel® VTune™ Profiler data for CPU and GPU runs and compare performance results

Introduction to Intel® oneAPI Collective Communications Library (oneCCL)

This session introduces how to use different oneCCL binary releases inside oneAPI toolkits, and how to port a oneCCL example from pure CPU support to both CPU and GPU support. Learn how to gather performance data on oneCCL by using the Intel® VTune™ Profiler.

  • Learn how to program and compile a oneCCL sample with different releases via batch jobs on the Intel® DevCloud for oneAPI
  • Learn how to port a oneCCL sample from a CPU-only version to CPU and GPU version by using DPC++
  • Learn how to collect Intel® VTune™ Profiler data for CPU and GPU runs

Intel® Distribution of OpenVINO™ Toolkit

Title Requirement
Introduction to the Intel Distribution of OpenVINO Toolkit Mandatory
Get Started with the Intel Distribution of OpenVINO Toolkit on Intel® DevCloud Mandatory
Introduction to the Post-Training Optimization Tool Mandatory
Write Once and Deploy Inference across Intel® Architectures Mandatory
Introduction to the Deep Learning Workbench Mandatory

Introduction to the Intel Distribution of OpenVINO Toolkit

This toolkit was designed specifically to help developers deploy AI-powered solutions across the heterogeneous landscape—combinations of CPU, GPU, VPU, FPGA—with write once, deploy anywhere flexibility.

In this course, you will:

  • Understand cross-architecture deployment of your applications and solutions with little to no code rewriting
  • Understand the Model Optimizer and Inference Engine Workflow
  • Learn the various plugins and heterogeneous execution modes available within the Inference Engine
  • Learn performance optimization for improved throughput using the Intel Distribution of OpenVINO™ toolkit

Get Started with the Intel Distribution of OpenVINO Toolkit on Intel® DevCloud

The Intel® DevCloud for the Edge comes preinstalled with the Intel Distribution of OpenVINO toolkit to help developers run inference on a range of compute devices.

In this course, you will:

  • Understand the workflow of Model Optimizer and Inference Engine on a classification sample application
  • Learn how to submit a job request on a chosen hardware 
  • Deploy a classification sample on Intel DevCloud for the Edge

Introduction to the Post-Training Optimization Tool

Explore this tool that is included with the Intel Distribution of OpenVINO toolkit. Get an overview of the toolkit, its features, and techniques used for model optimization. Learn how the Post-Training Optimization Tool is applied where a sample model is quantized from 32-bit precision to 8-bit precision using Jupyter* Notebook.

  • Understand quantization techniques
  • Understand optimization flow with the Intel Distribution of OpenVINO toolkit
  • Understand the Post-Training Optimization Tool and its features
  • Experience how to quantize a model with the tool

Write Once and Deploy Inference across Intel® Architectures

In this webinar, technical consulting engineer Munara Tolubaeva showcases the Intel Distribution of OpenVINO toolkit and its core role in AI application and solution development.

  • How to use the toolkit to develop and deploy AI deep learning applications across Intel® architecture—CPUs, CPUs with Intel® Processor Graphics, Intel® Movidius™ VPUs, and FPGAs
  • Cross-architecture deployment of your applications and solutions with little to no code rewriting
  • Innovative improvements from the hardware and software stacks

Introduction to the Deep Learning Workbench

This course demonstrates how different command line components of the toolkit can be used in a web GUI- based Deep Learning Workbench environment.

  • Understand OpenVINO and its functionalities at a high level
  • Understand Deep Learning Workbench capabilities
  • Understand how to launch and start using the Deep Learning Workbench environment

High-Performance Computing

Title Requirement
Improve MPI Application Performance with the Intel® OneAPI HPC Toolkit(Beta) Mandatory
OpenMP* GPU Offload Basics Mandatory
Offload Your Code from CPU to GPU and Optimize It Mandatory
Profile DPC++ and GPU Workload Performance Mandatory
Find and Debug Threading and Memory Errors at the Source Mandatory
Introduction to Intel® Cluster Checker Optional

Improve MPI Application Performance with the Intel® OneAPI HPC Toolkit(Beta)

This course includes:

  • How to use Intel® MPI Library with an existing MPI program
  • How to collect MPI performance data using Intel® Trace Analyzer and Collector
  • Definition of analysis charts provided by Intel® Trace Analyzer and Collector to understand MPI application behavior
  • How to use the Message Checking library in Intel® Trace Analyzer and Collector

OpenMP* GPU Offload Basics

Learn how to accelerate your C, C++, or Fortran code on GPUs with OpenMP using either the Intel® Fortran Compiler or the Intel® C++ Compiler, which are part of the Intel oneAPI HPC Toolkit(Beta).

  • Use the target directive offload execution to the GPU
  • Efficiently manage data communication between the GPU device and the host CPU
  • Increase the performance of the application by using offload-specific constructs that leverage the types of parallelism the GPU offers
  • Practice the OpenMP offload concepts with hands-on exercises using either Fortran or C++.

Offload Your Code from CPU to GPU and Optimize It

Locating and removing bottlenecks is an inherent challenge for every application developer. And it’s made more complex when porting an app to a new platform, such as from a CPU to a GPU. Developers must also identify which parts of the code benefit from offloading.

In this training, software optimization expert Kevin O’Leary discusses how the Intel® Advisor(Beta) helps developers remove these new CPU-to-GPU porting obstacles.

The course covers:

  • Offload Advisor—a command-line feature of the beta product that projects performance speed-up on accelerators and estimates offload overhead
  • GPU Roofline Analysis—a technical preview that identifies bottlenecks in GPU-ported code and shows how close its performance is to system maximums
  • A walk-through of a matrix multiplication example to learn how the above features can help optimize application efficiency for GPUs

Profile DPC++ and GPU Workload Performance

In this webinar, technical consulting engineer Vladimir Tsymbal demonstrates how to analyze and optimize offload performance using Intel® VTune™ Profiler(Beta), a performance analysis tool that takes the guesswork out of cross-architecture performance improvements.

Using a sample application written in DPC++, Vladimir demonstrates how Intel VTune Profiler(Beta) can be used to:

  • Profile DPC++ code running on both host and GPU processors
  • Collect the right data and turn it into rich, interpretable analysis
  • Identify the hot spots in your compute kernels, including which hot spots are key areas for optimization
  • Show how the GPU resources are being used and locate hardware bottlenecks

Find and Debug Threading and Memory Errors at the Source

Join Intel technical consulting engineer Mayank Tiwari to learn about Intel® Inspector, a dynamic memory and threading error debugger for C, C++, and Fortran.

In this session, learn how to:

  • Locate and debug non-deterministic threading errors such as data races, deadlocks, and lock hierarchy violations
  • Detect memory errors such as leaks, corruption, and invalid accesses
  • Diagnose errors faster using the tool’s powerful analysis and debug features

Intel® Cluster Checker

Quickly validate the configuration of an HPC system using Intel Cluster Checker, which looks for and identifies potential uniformity concerns at a software and hardware level. It can also make performance comparisons across the cluster to identify slower systems, while offering potential remedies to solve the issue.

This course covers:

  • How to use Intel Cluster Checker to identify and resolve system issues
  • Best practices for when and how to use Intel Cluster Checker