This white paper is second in series of white papers on OpenCL describing status of OpenCL implementation and available tools for developers using Intel OpenCL SDK.
The Intel OpenCL 1.1 specification Beta implementation for CPU (Core 2 Duo (Penryn) or later CPUs) can be retrieved from https://software.intel.com/en-us/intel-opencl. It is still evolving into a mature product so feel free to try it and provide feedback to us. At present it only runs on Linux* (only 64 bit), Windows 7* (with SP1) and Vista* operating systems (32 and 64 bit).
The Intel implementation is the only implementation at the moment that implements out of order queues. Intel's implementation also allows multiple work-items per workgroup for CPUs. There is also preview support for device fission extension (not fully validated). We will cover the benefit of such options later in this whitepaper. With this implementation, you will also receive OpenCL offline compiler. This compiler will let you observe assembly instructions and intermediate representation (IR) of your OpenCL kernels instantly without having to plug them into a program or using any APIs to get IR. Developers can use this tool to also compile kernels for correctness.
Intel OpenCL 1.1 Beta Tools
The Intel OpenCL implementation comes with OpenCL 1.1 compliant SDK, offline compiler (32 and 64 bit on Windows and 64 bit on Linux). Programs compiled using OpenCL 1.1 can also be analyzed using VTune™ Amplifier XE. VTune is not part of this distribution however you can get a trial version from /en-us/articles/intel-vtune-amplifier-xe website. OpenCL programs can be compiled using Microsoft Visual Studio* 2010, 2008 using Microsoft or Intel Compilers. You can get a trial version of Intel Compilers from /en-us/articles/intel-parallel-studio-xe/. Graphics performance analyzer tool 4.0 (/en-us/articles/intel-gpa) can also be used to analyze OpenCL applications. The user's guide has details on how to go about it.
Simple OpenCL Kernel Development
OpenCL toolchain from Intel includes an offline compiler. Let us write a simple kernel using OpenCL vector data-types and view assembly and LLVM intermediate representation using this tool.
Figure 1.0 Intel Offline compiler.
Using Intel Offline compiler, you can compile kernels and create binaries and save them in .ir files to load later (skipping build for faster response). See OpenCL SDK User's Guide for more details.
Now we will go over key areas of generated assembly. First thing you may notice that generated assembly uses following calls.
call ___ocl_svml_h8_acoshf4 call ___ocl_svml_h8_acoshf16
Intel implementation utilizes the Intel® Math Kernel Library to generate most optimal code. This way all OpenCL kernels using math functions can take advantage of Intel® MKL functions. If you use only float4, you will see calls to
i.e. no 16 wide acosh operations. It is good to use as wide a data type as possible.
Other thing you will notice is that generated assembly also lists Spills and Reloads in comments. This will give you an idea about data alignment and loading.
Intel® Graphics Performance Analyzer v 4.0 for OpenCL (cpu) Kernels
The OpenCL SDK Tracer tool packaged with earlier Alpha version of SDK is deprecated and it is replaced with Intel Graphics Performance Analyzer 4.0 (download it from /en-us/articles/intel-gpa). This tool provides platform level multiple threads execution profiles along with respective tasks and memory operations. Intel OpenCL SDK works with Intel® GPA after setting CL_CONFIG_USE_GPA environment variable to True. There are multiple tools packaged under Intel GPA, for OpenCL we will only use Intel GPA monitor (to collect traces) and Intel GPA Platform Analyzer (to view traces). Other tools under Intel GPA are for analyzing DirectX 11 game performance.
Intel GPA Platform Analyzer aligns clocks across all cores to provide uniform timeline for CPU and GPU workloads and running tasks/threads.
We will run acosh16_cpu with WorkGroup Size 2 and Local Size set to 1 first. You will need to configure Profiles etc… Please refer to Section 7.3.2 of Intel OpenCL SDK User's Guide for details.
Figure 2.0 Intel GPA Monitor Program (Run from System Tray).
Here is top level view of this run viewed using Intel GPA Platform Analyzer Tool.
Figure 2.0 Intel GPA Platform Analyzer 4.0 Beta
As you will expect, there are 2 tasks. Each task's relative statistics are displayed on the right hand side. This tool can track execution time as well as write and read enqueue commands. For CPUs, these times are too small when compared with execution times so try zooming in to see vertical bars representing those tasks.
Figure 3.0 Intel GPA Platform Analyzer 4.0 Beta Individual Tasks (zoomed in view)
This task analyzer can process multiple kernels along with their respective write enqueue and read enqueue commands. Here is a trace view with multiple kernels. This is very handy when you want to compare multiple implementations of various kernels.
You will also notice that even though work group numbers go up, the number of threads that are created to execute tasks does not. This way the implementation utilizes fewer threads to execute multiple independent tasks.
Figure 4.0 Intel GPA Platform Analyzer 4.0 Beta Trace with multiple kernels
For details, please refer to Intel Graphics Performance Analyzers Getting Started Guide and Section 7.3.3 "Generating a Trace File" of Intel OpenCL SDK User's Guide.
Intel OpenCL SDK 1.1 Beta at present only runs on CPUs. Intel GPA v 4.0 is graphics performance analyzer supports Intel GPUs and DirectX 11 game performance analysis as well. We are using this tool to gather system level view of OpenCL kernels running on CPUs.
Amplifier XE Analysis
Intel OpenCL tool-chain implementation is done on top of Intel® Thread Building Block APIs. This means that users also have options to analyze their implementation using Amplifier XE tools packaged with Intel Parallel Studio XE tools. Use Amplifier XE in Administration mode. To profile OCL kernels, you will also need to set ENABLE_JITPROFILING=1 and CL_CONFIG_USE_VTUNE=True. Please refer to Intel OpenCL User's guide (Section 7.1.1) for further details.
Here is output of the same program analyzed using Amplifier XE.
Figure 5.0 Amplifier XE analyzing OpenCL programs.
This gives developer a comprehensive view of various OpenCL commands as well as top level view of CPU utilization while running various kernels. There is also a summary view where users can quickly see top hotspots of a typical OpenCL program. For small kernels, hotspot analysis may show create program and build program as most expensive functions.
Using Intel Offline compiler, you can skip building phase at load time and just load precompiled .ir kernel files. This option however will not be able to take advantage of just in time compilation features. So if new compiler begins to take advantage of new instructions available on a given platform, these new instructions will not be available to an already compiled binary. Please refer to Section 4.5 of Intel OpenCL SDK User's Guide for full description. This section describes how to collect CPU performance counters and perform hotspot analysis on OpenCL kernels.
Performance Tips for CPU as target device
The best source of information for performance tips is the Optimization guide packaged with Intel OpenCL SDK distribution.
In short use vector data types, do not use lot of work groups and try creating as many workgroups as there as logical processors in a system. Refer to section 2.7 Workgroup Size considerations of Performance Guide for detailed description.
Supply more work to each thread by using loops and smaller workgroup sizes as shown in following example.
Figure 6.0 Simple OCL kernel
Figure 7.0 CPU Optimized OpenCL Kernel
In second example, we reduce group size to number of logical cores and then based on global id, perform more work in each kernel. For 4096 items and dual core machine, we will use work group size as 2 and set group_size4 to 2048, where as in the example shown in Figure 6.0, we will use work group size as 4096.
For CPUs, it is better to use single dimension workloads as that way CPU caches can mask data latencies very effectively.
If precision can be traded with speed, use cl-fast-relaxed-math build time option. This option will create code that is non IEEE 754 compliant. See Section 4.9 of Intel OpenCL SDK User's Guide.
The Intel OpenCL implementation can support 2 work items per work groups as well as out of order execution model. These options allow much better CPU utilization than using in-order queues and having just single work items per workgroups so feel free to try these options.
The Intel OpenCL SDK is implemented is installable client driver compliant. This means it can co-exist with other OpenCL implementations that are ICD compliant.
Newly implemented device fission extension supports various modes to carve out compute devices using NUMA affinity or using number of cores to create sub-devices. For best performance while using NUMA affinity, allocate memory using CL_AFFINITY_DOMAIN_NUMA_EXT property to ensure memory locality next to the node running code. Please refer to Section 4.4 of Intel OpenCL SDK User's Guide for full details. Device fission is a great way to reserve some CPUs for low latency tasks/kernels while utilizing remaining cores for general purpose OpenCL computations.
This version also supports cl_khr_fp64 (double precision on Linux and Windows) and OpenCL and OpenGL buffer sharing extensions (cl_khr_gl_sharing only on Windows platform).
Feel free to download Intel® SDK for OpenCL™ Applications at https://software.intel.com/en-us/intel-opencl.
About the Author
Vinay Awasthi works as Application Engineer for the Apple Enabling Team at Intel at Santa Clara. Vinay has a Masters Degree in Chemical Engineering from Indian Institute of Technology, Kanpur. Vinay enjoys mountain biking and scuba diving in his free time.