Introduction to OpenCL™

Submit New Article

December 20, 2011 11:00 PM PST


Open Compute Language (OpenCL™) provides a framework to write programs in C-like language that can run on heterogeneous cores such as CPUs, GPUs or specialized hardware. This white paper provides a brief introduction to the OpenCL framework.

Before the arrival of OpenCL, developers would have to choose a GPU vendor-specific programming option and develop code for a given GPU family using an appropriate technology such as NVIDIA’s* CUDA, AMD’s Stream SDK or Shader Languages. Once written, these kernels would only run on devices from that vendor, and often would not run optimally on next generations of devices from the same vendor unless recompiled. These languages were also not supported across operating systems.

OpenCL allows developers to write programs that can run on various device architectures and can take advantage of instructions that may become available in the future, thanks to built-in just-in-time compilation. OpenCL is also supported on various operating systems, as well as on various CPUs such as x86, ARM*, IBM CELL BE*, IBM POWER*, etc. OpenCL is also not just an SDK or APIs allowing programming of target devices. It is also backed by runtime systems that can optimally compile just-in-time programs to make the best use of hardware.

The OpenCL specification does not have APIs to support load balancing or task scheduling and prioritization. The OpenCL specification does not present unified views of memory, and also does not enforce data consistency across heterogeneous devices residing in non-coherent memory domains. Developers programming for high-performance computing should have an alternate way to enable such features.

OpenCL programs have built-in values for resources, and have no way of dynamically altering resource requirements at runtime based on application requirements and whatever is running on a given heterogeneous core at the time of execution.

OpenCL™ Components

The OpenCL framework has three components. The first is the OpenCL front-end compiler and language extensions. Second are platform APIs that let the user define a given compute context consisting of various devices with varying capabilities where computation is going to take place. Finally, there are runtime APIs that include a back-end compiler to manage program execution, memory transfers between devices, and synchronization.

Language Extensions to ISO C99

OpenCL enables CPU-GPU heterogeneous computing by providing language extensions to C language. These language extensions enable vector data types-based processing and can translate code to x86, Intel® SSE, SIMD, or Intel® Advanced Vector Extensions (Intel® AVX,) as well as other similar instructions specific to GPUs and DSP. OpenCL programs go through two phase compilations. The first phase of compilation converts programs to Intermediate Representation (IR). The second phase takes place at runtime (also called JIT or just-in-time compilation) which translates IR to actual machine instructions of the target device. This enables programs written earlier to take advantage of new hardware features without recompilation. OpenCL also has an added half data type for reduced storage/precision, and enforces minimum floating-point precision requirements to IEEE 754-2008. This requirement ensures consistent results across all devices.

As programs can run on various architectures with varying resources (limited registers) and capabilities, the OpenCL language restricts the use of certain language features which are resource intensive such as recursion, bit fields, variable length arrays, variable arguments (variadic) macros and functions, various standard C library functions, qualifiers such as extern, static, auto and registers.

Implementations may not implement all language extensions. Support for the half data type is still limited, and first implementations from various vendors did not have support for float3, int3 data types. Please refer to supported extensions and vendor documentation to get a full picture of supported features.

Platform APIs Defining Compute Domain

This layer enables programmers to find out what they have at their disposal, carve out various computation domains (called contexts in OpenCL), set up data structures for data sharing between various devices, compile and submit programs called kernels by creating device-specific queues within a context, and finally manage synchronization.

In a typical system, there could be various OpenCL implementations (called Platforms) residing on a system. Programmers will query Platforms, get their capabilities, get supported devices by that implementation, and then create compute context utilizing devices of their interest. These compute contexts cannot be created across implementations. There is at present no way to divide devices across contexts either (i.e., there is no way to specify using 2 cores out of 16 available cores for context A and the rest for context B). Developers may create as many contexts as they like, but communication across contexts is not possible. The main program managing it all is called the host program. This host program may decide to implement various contexts across multiple platforms to utilize devices as it sees fit.

Context also unifies memory domain and synchronization facilities available to a program. Context uses command queues to submit work to devices and transfer memory to and from device. Programmers can create as many command queues as they like for a given device; however, only one device can be used with any command queue. Various cores within a device are called Compute units (CUs) in OpenCL.


Compute units can be further divided into Processing Elements (PEs). This distinction is useful for GPUs where compute units represent stream multiprocessor or a SIMD engine core. These cores have multiple streaming processors or shader (ALU) units.

OpenCL Programming Paradigms - Task & Data Parallelism
OpenCL programming supports the task parallel and data parallel paradigms for CPUs, and only the data parallel paradigm for GPUs.

Task Parallelism
Task parallelism is achieved by submitting lots of tasks. These tasks are run by framework by monopolizing a core for any given task. Programmers can submit multiple tasks to a queue.

If tasks are unrelated and there is no dependency among tasks, it is better to create as many command queues as there are CPU cores, and then submit tasks to those queues to achieve better CPU utilization.

If a given implementation does not allow out-of-order execution of commands submitted to a queue and a single queue is used, tasks submitted to a queue are not going to run on all available cores at the same time, resulting in lower CPU utilization.

Data Parallelism
On the other hand, if similar work needs to be performed on a large dataset, OpenCL provides a way to write programs in scalar fashion, which is then executed by multiple cores in parallel over the large dataset.


Figure 2.0 Scalar C version

The same program translates to the following OpenCL color conversion code when written in OpenCL using OpenCL C vector extensions and supporting functions.

The first thing that you will notice is that the outer loop is gone and the index is retrieved using get_global_id(0). We will go over these in detail in our future white papers. Here the code in OpenCL is coded as if it is a scalar code. We use a datatype such as float4 as natural C language style types. These types are expanded by the compiler to supporting Intel SSE/Intel AVX instructions at runtime. We will go over runtime details in our next series of white papers.


Figure 3.0 OpenCL Version

In the OpenCL version of Color Inversion, there will be (width*height) threads running in parallel only operating on float4 (single pixel) at a time. If this code is run on CPUs, you might want to create a version that divides the image to a number of available cores and then runs a loop to processes all pixels available in given stride.

Runtime execution can be directed to not over subscribe cores by giving workgroup size as number of cores. There is no way to limit cores for a given kernel or to provide affinity hints.

Data transfer between devices is achieved by submitting read-write commands to the queue. This allows a way for the host program to know when all the processing is complete by the kernel so that it can further process data.

Commands submitted to the command queue carry a qualifier that tells whether a command is a blocking or non-blocking one. The host program does not wait for non-blocking commands to finish before proceeding forward. OpenCL provides excellent support for callback functions and events. Programmers can use these events and callback functions to manage work effectively.

Command queues also provide extensive support for profiling. Developing kernels first on CPUs is easier, as the tools are lot more mature and you can use debuggers to see how your kernels are executing.

For image processing and media application-related needs, OpenCL also provides samplers that can be created at the context level. There are also image data types as well as address space qualifiers primarily used to target different caches on GPUs.

Runtime APIs/Compilers to Manage Compute Domain

OpenCL runtime allows for building and compiling kernels, executing them on target devices and managing contexts and command queues, events, and callbacks, as well as user events that can synchronize work between various devices through callbacks.

OpenCL uses dynamic compilation. The host program and other libraries used are compiled statically.

At runtime, OpenCL kernels are first converted to Low Level Virtual Machine (LLVM) intermediate representation (IR). This intermediate representation is then sent to back-end compilers of target devices to create final machine code.

Runtime program build/compilation management is done using program objects. Programmers can use source to create a program object, or can provide precompiled source to be used to create a program object for a given target device. After this step, specific kernel functions (OpenCL functions with __kernel qualifier) can be used to create kernel objects.

In order to execute these kernels, programmers first need to set arguments to kernel object, and then these kernels are submitted to command queues for any devices within a context for execution. The OpenCL framework does not provide a standalone compiler. The only way to compile OpenCL kernels is to write a small program that uses OpenCL APIs to build the program.

If you are pre-compiling binaries for devices, there is no guarantee that they will continue to work with future generations of hardware. It is best to build and load for maximum compatibility across devices and device generations.

OpenCL APIs are thread safe, so developers can create as many threads as they like and then from within those threads manage their device-specific tasks. Developers can also use callbacks within and across threads as long as the callbacks are thread safe.

Basic POSIX threading and OpenCL task/data parallelism are at times complementary. Read and write commands can be queued using threads in host programs along with setting of kernel arguments. If various kernels are independent, this can be achieved on the host program using POSIX threads. When OpenCL kernels are executed, OpenCL will utilize all supported configured devices. This way the developer can get maximum device utilization using all possible options.

PCI latencies are usually a thousand times more than CPU accessing memory. This means that for smaller kernels, it is best to use the CPU and configure your workgroup size to the number of cores.


If data parallelism can help a given problem, then it is best to enqueue read and write while the GPU is busy executing kernels. This way the developer can somewhat hide PCI latencies.

For CPUs, it is best to use more queues so that CPU cores are not waiting for data to arrive before the cores can start computation. For GPUs, it is best to use a single queue, as all of the queues are going to communicate to the GPU through the device driver, which may have limited support to handle multiple command queues simultaneously.

As there is no support for prioritization on command queues, users can use multiple queues for CPU devices and use user events to trigger computations.


Conclusion

OpenCL programming for CPUs provides a great way to write programs and is geared to future programming models.

OpenCL for data parallel needs is easy to program and manage, as the developer only needs to focus on functionality and not on threading APIs. Programs will run fine on various architectures and on various operating systems as long as OpenCL implementations exist for that platform.

For task parallel needs, OpenCL provides limited support, as there is no way to manage priority and manage core such as there is no oversubscription of threads. It is best to use OpenCL task parallelism when the tasks at hand are fairly agnostic to prioritization and each task can run on a single core efficiently. It is best to use out-of-order command queue properties in that case so that the command queue can manage running these tasks in parallel.

Programming in OpenCL is easy, as debug tool chains are very mature and it is easy to understand the code that is getting generated for optimizations. Vector data types provide an intuitive way to program, and C-like language is easy to maintain and learn.

You are free to download the Intel® OpenCL 1.1 SDK at http://whatif.intel.com

About the Author

Vinay Awasthi works as an Application Engineer for the Apple* Enabling Team at Intel at Santa Clara. Vinay has a Master’s Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.