OpenCL™ Tutorials

These tutorials work with the supplied sample code to demonstrate important features in this release and can be found on Intel Software Documentation Library repository..

  • OpenCL™ JumpStart Tutorial: In this tutorial we will show you how to use OpenCL Wizard in Microsoft Visual Studio to create an image processing application for Sobel edge detection of a given image, by creating an OpenCL project based on a project template.
  • Runtime Generated FFT for Intel® Processor Graphics: Highlights techniques to optimize FFT. 
  • Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization: Robert Ioffe describes a consistent series of optimizations that improve OpenCL kernel performance on Intel® Iris™ Graphics or Intel® Iris™ Pro Graphics using Intel® SDK for OpenCL™ Applications 2013. The optimizations we describe are general in nature; developers could apply them to a broad set of OpenCL™ kernels. After studying the optimizations presented here, the developers will know the fundamentals of mastery of Intel® Iris™ Graphics for compute purposes. We start with a simple Modulate kernel.
  • Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization: Robert Ioffe describes a consistent series of optimizations that improve OpenCL kernel performance on Intel® Iris™ Graphics or Intel® Iris™ Pro Graphics using Intel® SDK for OpenCL™ Applications 2013. The optimizations we describe are general in nature; developers could apply them to a broad set ofOpenCL™ kernels. After studying the optimizations presented here, the developers will know the fundamentals of masteryof Intel® Iris™ Graphics for compute purposes. We proceed with optimizing a Sobel kernel.
  • OpenCL™ - Introduction for HPC Programmers: OpenCL™ is an important new standard for heterogeneous computing. With OpenCL, a software developer can write a single program running on everything: from cell phones to nodes in a supercomputer. To reach its full potential, however, OpenCL needs to deliver more than portability. It needs to deliver "performance portability". In this presentation, we discuss the "performance portability" of OpenCL programs.
  • Video Motion Estimation Using OpenCL™ Technology: The Video Motion Estimation (VME) tutorial provides step-by-step guidelines on the using Intel’s motion estimation extension for OpenCL™ standard. The motion estimation extension includes a set of host-callable functions for frame-based Video Motion Estimation.
  • Advanced Video Motion Estimation Tutorial: The Advanced Video Motion Estimation (VME) tutorial provides step-by-step guidelines on using the Intel’s motion estimation extensions for the OpenCL™ standard. The advanced motion estimation extension includes a set of host-callable functions for frame-based Video Motion Estimation.
  • Using OpenCL™ 2.0 sRGB Image Format:OpenCL™ 2.0 standard provides embedded sRGB image format support. The new feature handles conversion from sRGB into RGB values and speeds up both the development time and the kernel performance. Now you don’t need to implement the conversion algorithm in your kernel.
  • Using OpenCL™ 2.0 Work-group Functions: Among new OpenCL 2.0 features, several new and useful built-ins were introduced, called “work-group functions”. These built-ins provide popular parallel primitives that operate at the workgroup level. This article is a short introduction on work-group functions and their usage. It is also backed with some performance data gathered from Intel HD Graphics OpenCL device.
  • Using OpenCL™ 2.0 Atomics: The goal of this tutorial is to provide a short introduction to the new OpenCL™ 2.0 atomics functionality and to discuss some caveats in the atomics usage and applicability to various GPU programming tasks.
  • OpenCL 2.0 Shared Virtual Memory Code Sample: This sample demonstrates the fundamentals of using Shared Virtual Memory (SVM) capabilities in OpenCL™ applications. The SVM Basic code sample uses the OpenCL 2.0 APIs to query SVM support and manage SVM allocations for the selected OpenCL 2.0 device. The sample code implements an algorithm to demonstrate pointer sharing between host and device with OpenCL SVM features. Advanced topics like use of atomics within SVM allocations and associated performance considerations are out of the scope of this tutorial.
  • Device Self-enqueue in OpenCL 2.0: Recursive Algorithm Example. Sierpiński Carpet: Device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks
  • Device Self-enqueue and Work-Group Scan Functions in OpenCL 2.0: Iterative Algorithm Example. GPU-Quicksort: This tutorial shows how to use two powerful features of OpenCL™ 2.0: enqueue_kernel functions that allow you to enqueue kernels from the device and work_group_scan_exclusive_add and work_group_scan_inclusive_add, two of a new set of work-group functions that were added to OpenCL 2.0 to facilitate scan and reduce operations across work-items of a work-group. This tutorial demonstrates these features on our very own GPU-Quicksort implementation in OpenCL, which, as far as we know, is the first known implementation of that algorithm in OpenCL. The tutorial shows an important design pattern of enqueueing kernels of NDRange of size 1 to perform housekeeping and scheduling operations previously reserved for the CPU.
  • Using Image2D From Buffer Extension: The goal of this sample is to demonstrate how to connect buffer-based kernel and image-based kernel into pipeline using the cl_khr_image2d_from_buffer extension. This feature is supported as extension in OpenCL™ 1.2 and became core functionality in OpenCL 2.0, so any 2.0 device must support it. The functionality enables creating OpenCL image objects, based on OpenCL buffer objects without extra coping, providing dual API to the same piece of memory. Once an image is created, you can use such image features as interpolation and border checking in one kernel, while continuing to access the same physical memory as a regular OpenCL buffer in another kernel.
  • OpenCL™ 2.0 Non-Uniform Work-Groups: OpenCL 2.0 has a new feature called “non-uniform work-groups” and it allows an OpenCL 2.0 runtime to divide an NDRange in a way that produces non-uniform work-group sizes in any dimension.
  • The Generic Address Space in OpenCL™ 2.0: One of the new features of OpenCL 2.0 is the generic address space. Prior to OpenCL 2.0 the programmer had to specify an address space of what a pointer points to when that pointer was declared or the pointer was passed as an argument to a function. In OpenCL 2.0 the pointer itself remains in the private address space, but what the pointer points has changed its default to be generic meaning it can point to any of the named address spaces within the generic address space. This features requires you to set a flag to turn it on, so OpenCL C 1.2 programs will continue to compile with no changes.
  • SPIR in OpenCL 2.0: Using SPIR for fun and profit with Intel® OpenCL™ Code Builder: In this short tutorial, we are going to give you a brief introduction to KHRONOS SPIR. We will touch on the differences between SPIR binary and Intel™ proprietary Intermediate Binary, and then demonstrate a couple of ways of creating SPIR binaries using our the SDK tools.
  • Getting Started with OpenCL™ on Android* OS:OpenCL™ Basic Tutorial for Android* OS provides guidelines on using OpenCL in Android applications. The tutorial is an interactive image processing Android application.

    The main focus for the tutorial is to show how to use OpenCL in an Android application, how to start writing OpenCL code, and how to link to OpenCL runtime. The tutorial shows a typical sequence of OpenCL API calls and general workflow to get a simple image processing kernel running with an animation on an OpenCL device. Advanced topics like efficient data sharing or Android OpenCL performance BKMs are out of the scope of this tutorial.
  • Simple Optimizations of OpenCL™ Code: Simple Optimizations sample demonstrates simple ways of measuring the performance of OpenCL™ kernels in an application. It describes basics of profiling and important caveats like having dedicated “warming” run. It also demonstrates several simple optimizations, some of optimizations are rather CPU-specific (like mapping buffers), while others are more general (like using relaxed-math). The sample also shows how to employ the OpenCL profiling events.
  • Performance Debugging Intro: You can measure performance of OpenCL kernels in many ways. For example, you can perform such measurements using host-side timing mechanisms like QueryPerformanceCounter or rdtsc. Still those “wall-clock” measurements might not provide any insights into the actual cost breakdown. We show you some of the methods of measuring OpenCL kernel performance.
  • OpenCL™ and OpenGL* Interoperability Tutorial:OpenCL™ and OpenGL* are two common APIs that support efficient interoperability. OpenCL is specifically crafted to increase computing efficiency across platforms, and OpenGL is a popular graphics API. This tutorial provides an overview of basic methods for resource-sharing and synchronization between these two APIs, supported by performance numbers and recommendations. A few advanced interoperability topics are also introduced, along with references.
  • OpenCL™ and OpenGL* Interoperability Sample: This sample gives an overview of basic methods for texture sharing and synchronization between the two APIs backed with performance numbers and recommendations. Finally, few advanced interoperability topics are also covered in this document along with some further references.
  • Sharing Surfaces between OpenCL™ and OpenGL* 4.3 on Intel® Processor Graphics using implicit synchronization: This tutorial demonstrates the creation of a texture in OpenGL* 4.3 that has a sub-region updated by an OpenCL™ C kernel running on Intel® Processor Graphics with Microsoft Windows*. One example use of this is for a real-time computer vision applications where we want to run a feature detector over an image in OpenCL but render the final output to the screen in real time with the detectors clearly marked. In this case you wants access to the expressiveness of the OpenCL C kernel language for compute but the rendering capabilities of the OpenGL API for compatibility with your existing pipeline. Another example might be a dynamically generated procedural texture created in OpenCL used as a texture when rendering a 3D object in the scene. Finally, imagine post processing an image with OpenCL after rendering the scene using the 3D pipeline. This could be useful for color conversions, resampling, or performing compression in some scenarios.
  • Sharing Surfaces between OpenCL™ and DirectX* 11 on Intel® Processor Graphics: This tutorial demonstrates how to share surfaces between OpenCL™ and DirectX* 11 with Intel ® Processor Graphics on Microsoft Windows*, using the surface sharing extension in OpenCL. The goal is to provide access to the expressiveness enabled by the OpenCL C kernel and the rendering capabilities of the DirectX11 API.
  • Intel® Processor Graphics Optimization: Discover how to optimize OpenCL™ kernels for running on the Intel® Graphics device with the Image Processing sample based on the Sobel Filter algorithm. The optimization tips, described in this tutorial, are also applicable to any other image processing algorithms targeting the Intel Graphics OpenCL device.
  • Media Resource Sharing: This tutorial demonstrates how to use the Microsoft Direct3D* and Intel® SDK for OpenCL™ Applications together. Specifically, the provided sample integrates data processing using the OpenCL technology on the Intel Processor Graphics and rendering with Microsoft® DirectX* while featuring no memory overhead interoperability.

    This tutorial also demonstrates how to use Microsoft DirectX* Video Acceleration (DXVA) and Intel SDK for OpenCL Applications together for efficient post processing and fast rendering. The sample relies on DXVA for hardware-accelerated rendering, while Intel OpenCL implementation is used for post processing. Specifically the sample demonstrates how to:
    • Create a shared DXVA* surface so it can be effectively shared with the OpenCL technology
    • Use the OpenCL technology to perform post processing on the surface before rendering to the screen with DXVA.
  • Using Basic Capabilities of Multi-Device Systems with OpenCL™:The Multi-Device Basic tutorial is an example of utilizing capabilities of a multi-device system.

    This tutorial’s sample targets systems with multiple Intel® Xeon Phi™ coprocessor devices. The guidelines and methods provided in this sample are also applicable to multi-device systems with CPU and GPU devices, or a CPU and one Intel® Xeon Phi™ coprocessor device.

    The Multi-Device Basic sample provides an example of three basic scenarios with simultaneous utilization of multiple devices under the same system:
    • System-level
    • Multi-context
    • Shared context
  • OpenCL™ API Debugger Tutorial: OpenCL™ API Debugger is a Microsoft Visual Studio* plug-in that enables debugging of OpenCL applications by allowing developers to monitor and understand their OpenCL environment.

    In this tutorial we show one important use of API Debugger, and that is to debug and find the root cause of an application failure in the presence of only the kernel source code and the application binary (no host side source code).
  • Intel® VTune™ Amplifier XE: Getting started with OpenCL* performance analysis on Intel® HD Graphics: The VTune Amplifier tracks the overall GPU activity (graphics, media, and compute), collects the Intel® Integrated Graphics hardware metrics, details OpenCL™ activity on the GPU, and then presents them correlated with the CPU processes and threads. Figure 1 shows CPU and GPU activities presented by the VTune Amplifier.
  • Performance Tuning of OpenCL™ Applications on Intel® Xeon Phi™ Coprocessor using Intel® VTune™ Amplifier XE 2013/2015: Intel® SDK for OpenCL™ Applications provides a development environment for OpenCL applications on both Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors for both Windows* and Linux* operating systems. The latest SDK is available at and includes development tools, runtime, and support for optimization tools. In addition, recent releases of the Intel® VTune™ Amplifier XE provide essential functionality for tuning OpenCL applications on Intel Xeon Phi coprocessors, including OpenCL kernels source-level analysis. This article provides a basic workflow for profiling OpenCL applications on Intel Xeon Phi coprocessors and some examples of performance analysis.
  • How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics: This tutorial shows how to minimize the memory footprint of applications and reduce the amount of copying on buffers in the shared physical memory system of an Intel® Processors with Intel® Processor Graphics.
For more complete information about compiler optimizations, see our Optimization Notice.