Get Started with Intel® SDK for OpenCL™ Applications 2019 on Linux* OS with Training Sample

ID 673113
Updated 12/21/2018
Version Latest
Public

author-image

By

Overview

This article demonstrates getting started with OpenCL™ Tools developer components in Intel® System Studio 2019 initial release and Update 1 release on Linux* OS. This article also applies to the Intel® SDK for OpenCL Applications 2019 as a standalone . The walkthrough assumes developer targets of both Intel® CPU and Intel® Graphics Technology. Training sample source code is attached for download via upper right orange button and at the article footer.

The article walks through:

  • Part 1:
    • Prerequisites and Installation
    • Build
    • Program execution
  • Part 2:
    • IDE integration
    • Disk Provisions
    • Developer Tips
  • Part 3:
    • Explanation of training sample applications and execution
    • Offline Compilation
    • More resources

The specific platform used in this example is Ubuntu* OS 16.04.4 (stock Linux* OS kernel 4.13) on an Intel® Core™ i7-6770HQ processor (system model NUC6i7KYK). The Intel® Core™ i7-6770HQ processor has Intel® Iris™ Pro Graphics 580. More information on this platform and other platforms can be viewed at ark.intel.com. The article walkthrough is *not* exclusive to this specific hardware model.

New developers are highly encouraged to develop on a system with both intended OpenCL™ implementations and hardware available. However, in the general case, target hardware is not absolutely required for building OpenCL™ applications. Intel® CPU and Intel® Graphics Technology hardware is required to use all the developer tool features and it is required to support compiling device kernels.

 

Part 1

 

Prerequisites & Installation

  • Review the Intel® System Studio 2019 release notes.
  • Review the release notes for the OpenCL™ developer tools. Observe supported platforms.
  • Download and install OpenCL™ runtimes:
    • *2019 Initial Release* Download the Intel® CPU Runtime for OpenCL™ Applications
      • *2019 Update 1* users have Intel® CPU Runtime for OpenCL™ Applications installed automatically with the SDK.
      • Review the appropriate release notes. Observe supported platforms.
      • This CPU runtime does not require any Intel® Graphics Technology hardware. It serves as a production OpenCL™ implementation useful for:
        • Backing production applications using OpenCL™.
        • Reference development for targeting other types of devices, such as Intel® Graphics Technology or Intel® FPGA.
      • Example Prerequisites Setup:
        #edit apt properties to configure proxy as needed
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo gedit /etc/apt/apt.conf
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt update
        #extract
        tutorial@2019OCL-Dev-Host:~/Downloads$ tar -xvf l_opencl_p_18.1.0.013.tgz
        tutorial@2019OCL-Dev-Host:~/Downloads$ cd l_opencl_p_18.1.0.013/
        
        #prerequisites from repository noted in runtime release notes
        #consider sudo apt-get --simulate install <packages> to observe system changes first!
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ sudo apt-get install libnuma1 zlib1g libxml2 lsb-core
        Reading package lists... Done
        Building dependency tree       
        Reading state information... Done
        libnuma1 is already the newest version (2.0.11-1ubuntu1.1).
        zlib1g is already the newest version (1:1.2.8.dfsg-2ubuntu4.1).
        The following additional packages will be installed:
          alien at autotools-dev debhelper debugedit dh-strip-nondeterminism
          gcc-6-base:i386 libc6:i386 libfile-stripnondeterminism-perl libgcc1:i386
          libmail-sendmail-perl librpm3 librpmbuild3 librpmio3 librpmsign3 libsigsegv2
          libsys-hostname-long-perl lsb-invalid-mta lsb-security m4 ncurses-term pax
          po-debconf rpm rpm-common rpm2cpio s-nail zlib1g:i386
        
      • Example Runtime install:
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ sudo ./install_GUI.sh
        #Review license agreement.
        #Review collection consent. Consent helps improve future products.
        #(Optional) Select custom install location
        
        #Check installation by inspecting icd file residency
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /etc/OpenCL/vendors/
        total 0
        lrwxrwxrwx 1 root root 42 Dec 17 17:27 intel64.icd -> /etc/alternatives/opencl-intel-runtime-icd
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /etc/alternatives/opencl-intel-runtime-icd
        lrwxrwxrwx 1 root root 75 Dec 17 17:27 /etc/alternatives/opencl-intel-runtime-icd -> /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ cat /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
        /opt/intel/opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libintelocl.so
        
      • The CPU Runtime sets up symbolic links to the ICD loader library libOpenCL.so. Advanced users: if the development system prefers a different libOpenCL.so, you may wish to ensure the alternate libOpenCL.so is first in the search order for your dynamic linker.
        #Check libOpenCL.so system references... ldconfig -v and ls -l can show references are symlinked and which libOpenCL.so files will be seen by the dynamic linker.
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ sudo updatedb
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ locate *libOpenCL.so*
        /etc/alternatives/opencl-libOpenCL.so
        /etc/alternatives/opencl-libOpenCL.so.1
        /etc/alternatives/opencl-libOpenCL.so.2.0
        /opt/intel/opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libOpenCL.so
        /opt/intel/opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libOpenCL.so.1
        /opt/intel/opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libOpenCL.so.2.0
        /usr/lib/x86_64-linux-gnu/libOpenCL.so
        /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
        /usr/lib/x86_64-linux-gnu/libOpenCL.so.2.0
        /var/lib/dpkg/alternatives/opencl-libOpenCL.so
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /usr/lib/x86_64-linux-gnu/libOpenCL.so
        lrwxrwxrwx 1 root root 37 Dec 17 17:27 /usr/lib/x86_64-linux-gnu/libOpenCL.so -> /etc/alternatives/opencl-libOpenCL.so
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ ls -l /etc/alternatives/opencl-libOpenCL.so
        lrwxrwxrwx 1 root root 97 Dec 17 17:27 /etc/alternatives/opencl-libOpenCL.so -> /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libOpenCL.so
        
    • Download the Intel® Graphics Runtime for OpenCL™ Driver "NEO"
      • This runtime enables OpenCL™ kernels to target Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ processor in this example, or supported Intel® Graphics Technology available on other Intel® platforms.
      • Review the release notes on the GitHub portal. The README, FAQ, and LIMITATIONS documents are particularly useful. Observe supported platforms. 
      • Install the appropriate runtime package(s). This graphics runtime is available as a prebuilt install package(s) for the distribution used in this example from the GitHub portal releases page.
        #After downloading latest release from the releases page:
        tutorial@2019OCL-Dev-Host:~/Downloads/l_opencl_p_18.1.0.013$ cd ..
        tutorial@2019OCL-Dev-Host:~/Downloads$ ls -l intel-*.deb
        -rw-rw-r-- 1 tutorial tutorial    82078 Dec 11 17:38 intel-gmmlib_18.4.0.348_amd64.deb
        -rw-rw-r-- 1 tutorial tutorial  8678088 Dec 11 17:38 intel-igc-core_18.48.1124_amd64.deb
        -rw-rw-r-- 1 tutorial tutorial 13492006 Dec 11 17:38 intel-igc-opencl_18.48.1124_amd64.deb
        -rw-rw-r-- 1 tutorial tutorial   528816 Dec 11 17:38 intel-opencl_18.48.11934_amd64.deb
        
        #Consider simulation sudo apt-get --simulate install <packages> to check the configuration!
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get install ./intel-gmmlib_18.4.0.348_amd64.deb ./intel-igc-core_18.48.1124_amd64.deb ./intel-igc-opencl_18.48.1124_amd64.deb ./intel-opencl_18.48.11934_amd64.deb
        [sudo] password for tutorial: 
        Reading package lists... Done
        Building dependency tree       
        Reading state information... Done
        Note, selecting 'intel-gmmlib' instead of './intel-gmmlib_18.4.0.348_amd64.deb'
        Note, selecting 'intel-igc-core' instead of './intel-igc-core_18.48.1124_amd64.deb'
        Note, selecting 'intel-igc-opencl' instead of './intel-igc-opencl_18.48.1124_amd64.deb'
        Note, selecting 'intel-opencl' instead of './intel-opencl_18.48.11934_amd64.deb'
        
        #Optional: For cl_intel_media extensions
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get install libva1 libva-drm1 libdrm2
        Reading package lists... Done
        Building dependency tree       
        Reading state information... Done
        The following additional packages will be installed:
          i965-va-driver libdrm-amdgpu1 libdrm-common libllvm6.0 mesa-va-drivers
          va-driver-all
        
        #check install
        tutorial@2019OCL-Dev-Host:~/Downloads$ cat /etc/OpenCL/vendors/intel.icd 
        /usr/local/lib/libigdrcl.so
        
    • Install OpenCL™ Tools prerequisites:
      • Make sure to see the install notes section of the release notes for the OpenCL™ tools. That section is the official source for validated package prerequisites.
      • The Eclipse* IDE plugin has a Java Runtime Environment requirement. Java 8 (1.8) Runtime Environment or higher is expected to satisfy the plugin prerequisite. Intel® System Studio 2019: OpenCL™ Tools and the Eclipse* IDE should automatically deploy the Java Runtime Environment. In other cases, when manual install is required here is an example installation:
        tutorial@2019OCL-Dev-Host:~/Downloads$ export http_proxy=http://<proxy-server>:<proxy-port>
        tutorial@2019OCL-Dev-Host:~/Downloads$ export https_proxy=http://<proxy-server>:<proxy-port>
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo -E add-apt-repository ppa:webupd8team/java
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get update
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get install oracle-java8-installer
        
      • libicu55 prerequisite install example: 
        #Install the character internationalization library mentioned in the release notes
        #set the proxy for wget as necessary
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo gedit /etc/wgetrc
        tutorial@2019OCL-Dev-Host:~/Downloads$ wget http://security.ubuntu.com/ubuntu/pool/main/i/icu/libicu55_55.1-7ubuntu0.4_amd64.deb
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get install ./libicu55_55.1-7ubuntu0.4_amd64.deb
        
      • For mono use the latest guidance from https://www.mono-project.com. Here is an example used on the test system:
        sudo apt-key adv --keyserver-options http-proxy=http://<proxy-server>:<proxy-port> --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
        sudo apt install apt-transport-https
        echo "deb https://download.mono-project.com/repo/ubuntu stable-xenial main" | sudo tee /etc/apt/sources.list.d/mono-official-stable.list
        sudo apt update
        sudo apt-get install mono-devel ca-certificates-mono
      • dkms dependency example install:
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo apt-get install dkms
      • libwebkitgtk IDE rendering dependency:
        tutorial@2019OCL-Dev-Host:~/Downloads/system_studio_2019$ sudo apt-get install libwebkitgtk-3.0-0
      • Add the user to the video group so user programs are privileged for the Intel® Graphics Runtime for OpenCL™ Driver
        tutorial@2019OCL-Dev-Host:~/Downloads$ sudo usermod -a -G video tutorial
        

Install Intel® System Studio 2019 (or Intel® SDK for OpenCL™ Applications 2019 standalone)

  • Click the Register & Download button from the Intel® System Studio 2019 portal downloads page.
  • Select 'Linux*' for host OS. Select 'Linux* and Android*' for target OS.
  • Click the Add button for the OpenCL™ Tools line item.
    • The Eclipse* IDE is automatically added to the installer manifest when OpenCL™ Tools is added.
    • Optional: Add Intel® C++ Compiler and Intel® Vtune™ Amplifier. These products have useful features for heterogeneous and general development. Intel® Threading Building Blocks is also added as a prerequisite. These other components are not required in this article.
  • Click 'continue'
  • Click 'download'
    • Note any extraction instructions. Configured downloads may come with a .json file to be referenced by the installer.
  • Extract the installation package. Execute the installer with sudo ./install.sh as sudoer user: 
    tutorial@2019OCL-Dev-Host:~/Downloads$ tar -xvf system_studio_2019.tar.gz
    tutorial@2019OCL-Dev-Host:~/Downloads$ cd system_studio_2019
    tutorial@2019OCL-Dev-Host:~/Downloads/system_studio_2019$ sudo ./install.sh
  • Set proxy as necessary
  • Install to 'this computer'
  • Review and accept license terms
  • Choose your preferred software improvement collection option. Consent greatly helps Intel® improve this product.
  • (Optional) Select deployment folder
  • (Optional) This package has been configured with the Intel® C++ Compiler and Intel® Vtune™ Amplifier in addition to the OpenCL™ Tools (Intel® SDK for OpenCL™ Applications). The additional tools are very useful for OpenCL™ developers, but strictly speaking are not required for OpenCL™ Tools features to be used.

 OpenCL™ Tools Installer Dialog

  • Errata: If you install Intel® System Studio 2019 initial release, ensure OpenCL™ 2.1 Experimental runtime for Intel® CPU Device component is selected. This tool is required for Eclipse* IDE plugin functionality. Starting with Update 1 and later, future releases deprecate and remove the Experimental runtime. The Experimental runtime is replaced by Intel® CPU Runtime for OpenCL™ Applications 18.1 and newer.
  • The Windriver* and Android* screens are unrelated to this article walkthrough.
  • After the components are installed, the 'Launch Intel(R) System Studio checkbox' can be deselected as the next section of this walkthrough demonstrates building from the command line.

 

Build

Two example source sets are in the .tar.gz archive attached to this article. They are entitled GPUOpenCLProjectForLinux and CPUOpenCLProjectForLinux. These sources match the two implementations installed earlier.

These example build commands demonstrate the two main build requirements: inclusion of OpenCL™ headers, and linking the libOpenCL.so (ICD loader library) runtime. The first program executes it's kernel on Intel® Graphics Technology. The second second program executes it's kernel on Intel® CPU and not for Intel® Graphics Technology. Here are command line examples to build the host side applications with Intel® System Studio 2019 redistributed headers and libraries:

#GFX
g++ -L/opt/intel/system_studio_2019/opencl/SDK/lib64 -I/opt/intel/system_studio_2019/opencl/SDK/include GPUOpenCLProjectforLinux.cpp  utils.cpp -Wno-deprecated-declarations -lOpenCL -o GPUOpenCLProjectforLinux

 

#CPU
g++ -L/opt/intel/system_studio_2019/opencl/SDK/lib64 -I/opt/intel/system_studio_2019/opencl/SDK/include CPUOpenCLProjectforLinux.cpp  utils.cpp -Wno-deprecated-declarations -lOpenCL -o CPUOpenCLProjectforLinux

 

The filesystem paths used in this example build reflect default Intel® System Studio 2019 initial release install locations. These may vary for the standalone SDK or custom tool installations.

Notice -Wno-deprecated-declarations is used in the build. This allows this build to proceed using deprecated function calls that can allow for OpenCL™ 1.2 API implementations in addition to the version 2.1 API implementations from Intel®. Developers may wish to maintain portability in a different manner than this example, however the example source regions relevant for this toggle are discussed in the code walkthrough section of this article.

Compiling the device side OpenCL-C program offline is a helpful practice, although not the only option. In this article we use offline compilation only for developer feedback:

/opt/intel/opencl/bin/ioc64 –device=gpu –input=TemplateGFX.cl –output=TemplateGFX.log

The ioc64 (or ioc32) offline compiler picks an OpenCL™ implementation with which to compile with the -device=<device> toggle. The ‘gpu’ device switch sends the kernel source through Intel® Graphics Compute Runtime for OpenCL™ Driver. The kernel is compiled for Intel® Graphics Technology.

/opt/intel/opencl/bin/ioc64 –device=cpu –input=TemplateCPU.cl –output=TemplateCPU.log

The ‘cpu’ switch sends the kernel through Intel® CPU Runtime for OpenCL™ Applications compiling the kernel for Intel® x86_64. The output log file shows build feedback. Note that the contents of cpu/TemplateCPU.cl and gfx/TemplateGFX.cl kernel sources are not the same. 

In this article, both CPU and Graphics kernels will also ultimately compile and execute at runtime.

See /opt/intel/opencl/bin/ioc64 -help for more information about offline compiler capabilities. They are also pasted in the source walkthrough section of the article for convenience.

Note for 2019 Initial Release: For CPU runtime builds, ioc64 may show a segfault if both the Intel® CPU Runtime for OpenCL™ Applications and OpenCL™ 2.1 Experimental runtime for Intel® CPU Device are simultaneously referenced in ICD files. Developers are highly encouraged to target Intel® CPU Runtime for OpenCL™ Applications over the Experimental runtime. Programs can effectively ignore the Experimental Runtime by setting aside the .icd file associated with it. Consider moving /etc/OpenCL/vendors/intel_exp64.icd out of the vendors folder to a backup location as a workaround.

 

Run

Execute the test application on Intel® Graphics Technology application and check the output:

./GPUOpenCLProjectforLinux

Output:

tutorial@2019OCL-Dev-Host:~/Downloads/ocl-tools-walkthrough-20181221/gfx$ ./GPUOpenCLProjectforLinux
Selected device type is CL_DEVICE_TYPE_GPU
Number of available platforms: 2
Selected platform: Intel(R) OpenCL HD Graphics
NDRange performance counter time 1.744000 ms.

Execute the test application on Intel® CPU and check the output:

./CPUOpenCLProjectforLinux

Output:

tutorial@2019OCL-Dev-Host:~/Downloads/ocl-tools-walkthrough-20181221/cpu$ ./CPUOpenCLProjectforLinux
Selected device type is CL_DEVICE_TYPE_CPU
Number of available platforms: 2
Selected platform: Intel(R) OpenCL HD Graphics
clGetDeviceIDs() returned CL_DEVICE_NOT_FOUND for platform Intel(R) OpenCL HD Graphics.
Selected platform: Intel(R) CPU Runtime for OpenCL(TM) Applications
NDRange performance counter time 0.904000 ms.


Device selection error feedback is presented in the output here to show no CPU device is available underneath the 'Intel(R) OpenCL HD Graphics' platform as expected.

Developer tips and a sample explanation are later on in the walkthrough. Newer developers in particular may gain insight into how OpenCL™ applications function.

 

Part 2:

 

IDE facilities

The OpenCL™ tools IDE plugin allows a quick view of OpenCL™ capabilities for the target system. These capabilities help developers understand how to target OpenCL™ devices. Launch the Eclipse based developer environment from the launcher icon placed on the Ubuntu* desktop or from the iss_ide_eclipse-launcher.sh script in the Intel® System Studio 2019 install directory. After launching the Intel® System Studio 2019 developer environment, go to Tools -> Intel Code Builder for OpenCL API -> Platform Info...:

Platform Information Menu Item

Three platforms and three devices are shown with drop down dialogs are shown on the example system here for completeness. They are 'Experimental Host Only OpenCL 2.1 Platform', 'Intel(R) CPU Runtime for OpenCL(TM) Applications', and 'Intel(R) HD OpenCL Graphics'.

In 2019 Update 1, Experimental runtime is removed. It is replaced by Intel® CPU Runtime for OpenCL™ Applications.

Closed Drop Downs

The first platform is the experimental platform. This platform has one device: the Experimental 2.1 OpenCL™ CPU device. This implementation is selected as an installed component through the Intel® System Studio 2019 Initial Release installer dialogs. This platform is for nonproduction purposes only. Developers should target Intel® CPU Runtime for OpenCL™ Applications version 18.1 or newer as a replacement for the Experimental runtime. The Experimental Runtime is required to run the IDE plugin in 2019 initial release and earlier.

The second platform is the the Intel® CPU Runtime for OpenCL™ Applications. It exposes the CPU target device. 

The third platform, Intel® HD OpenCL Graphics, exposes the Intel® Graphics Technology target device.

Open Drop Downs

Platform and Device names exposed here are expected to match interrogated parameters in developer programs from clGetPlatformInfo(...) and clGetDeviceInfo(...) OpenCL™ API calls. These names are subject to change depending on the operating system, version of the Intel® implementation, and underlying hardware.

Note: Production OpenCL™ platforms interrogated from Intel® implementations are consistently expected to contain the string "Intel" when using CL_PLATFORM_NAME with the clGetPlatformInfo(...) OpenCL™ API call.

Right clicking on a device will allow detailed device properties to be viewed.

Properties Dialog

For completeness, here are the device properties for the CPU device.

CPU Runtime Device Properties

Here are the properties for the Intel® Graphics Technology Device.

Graphics Device Properties

These properties match the various property bitflags discernable with clGetDeviceInfo(...) OpenCL™ API call. Some examples where properties are useful at runtime are: reading device OpenCL™ standard capabilities, sizing constraints for memory or OpenCL™ kernel work-items, and OpenCL™ device extentions.

The platform experimental runtime is shown here for completeness, but in Intel® System Studio 2019 initial release, it's highly recommended to target Intel® CPU Runtime for OpenCL Applications 18.1 as opposed to the Experimental Runtime . Using both experimental and Intel® CPU Runtime for OpenCL™ Applications in the same process is known to show issues leading to program segfaults. The Experimental Runtime was removed in 2019 Update 1.

 

Disk Provisions

Intel® System Studio 2019: OpenCL™ Tools come with standard headers and libraries from Khronos*. Key examples are the cl.h header and the OpenCL™ ICD interrogator loader library, libOpenCL.so. This header and library are needed to build and execute OpenCL™ programs. The libOpenCL.so library interrogates and routes OpenCL™ API function calls to a chosen OpenCL™ implementation. In our example programs, the routines create contexts and execute work on OpenCL™ devices. The main getting started guide diagrams the modular relationship of OpenCL™ implementations and the OpenCL™ ICD interrogator library.

Alternate versions of the OpenCL™ ICD interrogator library are available from third parties. One such example is the ocl-icd package available from various system package managers. The key to using any ICD library effectively is to ensure it’s OpenCL™ capability can support desired OpenCL™ implementation capabilities. For example:

  • Intel® Graphics Compute Runtime for OpenCL™ Driver offers an OpenCL™ 2.1 implementation for Intel® Iris™ Pro Graphics 580 on the Intel® Core™ i7-6770HQ Processor. The OpenCL™ ICD interrogator library included with Intel® System Studio 2019: OpenCL™ Tools is 2.1 capable.
  • Current Intel® Atom™ Processors’ Graphics hardware may support only up to OpenCL™ version 1.2, but the OpenCL™ 2.1 ICD interrogator library should still be able to resolve the legacy specification’s features.

The OpenCL™ implementation installers put their ICD loader library references in /etc/OpenCL/vendors. At runtime the ICD loader library uses these ICD files to route OpenCL™ API calls through the intended OpenCL™ implementation. The contents of the /opt/OpenCL/vendors folder are useful for understanding the deployments available on a system. Developers may observe ICD reference text files from multiple OpenCL™ device vendors in this folder. Example:

tutorial@2019OCL-Dev-Host:~$ ls -l /etc/OpenCL/vendors/
total 8
lrwxrwxrwx 1 root root 42 Nov 30 12:28 intel64.icd -> /etc/alternatives/opencl-intel-runtime-icd
-rw-r--r-- 1 root root 77 Oct 25 04:52 intel_exp64.icd
-rw-r--r-- 1 1020 1024 28 Oct 12 06:17 intel.icd
tutorial@2019OCL-Dev-Host:~$ ls -l /etc/alternatives/opencl-intel-runtime-icd
lrwxrwxrwx 1 root root 75 Nov 30 12:28 /etc/alternatives/opencl-intel-runtime-icd -> /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
tutorial@2019OCL-Dev-Host:~$ cat /opt/intel//opencl_compilers_and_libraries_18.1.0.013/linux/etc/intel64.icd
/opt/intel/opencl_compilers_and_libraries_18.1.0.013/linux/compiler/lib/intel64_lin/libintelocl.so
tutorial@2019OCL-Dev-Host:~$ cat /etc/OpenCL/vendors/intel.icd
/usr/local/lib/libigdrcl.so
tutorial@2019OCL-Dev-Host:~$ cat /etc/OpenCL/vendors/intel_exp64.icd
/opt/intel/system_studio_2019/opencl/exp-runtime-2.1/lib64/libintelocl_2_1.so

In the above filesystem:

  • intel64.icd maps to Intel® CPU Runtime for OpenCL™ Applications.
  • intel.icd maps to Intel® Graphics Compute Runtime for OpenCL™ Driver.
  • intel_exp64.icd file is shown for completeness and it maps to the experimental 2.1 runtime included with Intel® System Studio 2019 initial release. Again, programs are recommended to target either Intel® CPU Runtime for OpenCL™ Applications 18.1 OR the Experimental Runtime, but not both. Targeting Intel® CPU Runtime for OpenCL™ Applications 18.1 or newer is strongly recommended where applicable. Using 2019 Update 1 or newer versions of the developer tools is highly recommended as the Experimental Runtime is deprecated and removed.

Developer Tips

Intel® recommends keeping the two main bottlenecks in mind for heterogeneous development:

  1. Minimizing offload transfer.
    • OpenCL™ devices may operate within a different memory domains. Minimizing transfers in these cases is crucial for performance.
    • OpenCL™ devices may share a memory domain with the host processor. The API typically exposes ways to avoid copying data all together in these cases. Such a shared configuration is typical of current Intel® Graphics Technology capable processors.
  2. Target device topology. Devices differ in number of compute units and other device characteristics. So, how to best schedule work to devices may change depending on device.

For information on the using the Intel® System Studio 2019: OpenCL™ tools or Intel SDK for OpenCL™ Applications 2019 standalone… see the Developer Guide.

For more on getting the most out of OpenCL™ programming for the CPU only implementation see the OpenCL™ Developer Guide for Intel® Core™ and Intel® Xeon processors.

For a a video instructional guide on general programming considerations for Intel® Graphics Technology check the video, "What Intel® Processor Graphics GEN9 Unlocks in OpenCL*", located at techdecoded.intel.io. Searching techdecoded.intel.io for 'opencl' yields other related content.

For a comprehensive look at Intel® Graphics Technology hardware, see this compute architecture overview document for Gen 9. Note that a manual attempt to walk through how OpenCL™ kernels and generated assembly is greatly obfuscated by the scheduling employed by the OpenCL™ runtime implementation.

Intel® FPGA products guidance can be accessed through fpgasoftware.intel.com.

  

Part 3

 

Explanation of sample source

The test application adds two input values together in an OpenCL™ kernel for each work-item, the result is available on the host after the kernel execution.

The Kernel - Intel® Graphics Technology

The texture sampler facility is used to read two texture sampler elements. The elements are added and the result is written to the output image. The kernel uses the read_imageui(…) and write_imageui(...) function calls available from the OpenCL-C language to use the device’s texture sampler facility. In the Intel® Graphics Technology implementation case there are multiple hardware texture samplers per device. This source file is stored on disk as TemplateGFX.cl for consumption at run time by the OpenCL™ host program.

constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

__kernel void Add(read_only image2d_t imageA, read_only image2d_t imageB, write_only image2d_t imageC)
{
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    uint A = read_imageui(imageA, sampler, (int2)(x, y)).x;
    uint B = read_imageui(imageB, sampler, (int2)(x, y)).x;

    write_imageui(imageC, (int2)(x, y), A + B);
}

Note: There are more compute elements than texture samplers on current Intel® Graphics Technology hardware. Texture samplers are preferable for interpolation tasks. For simple reads or writes, operating on straight forward OpenCL™ buffers may be preferable. Memory staged for image2d_t objects may be reorganized by the OpenCL™ implementation for interpolation prior to kernel launch. Consider the opaque reorganization's cost to be ammortized by interpolation operations.

The Host Side Application - Intel® Graphics Technology

The host side application has all the OpenCL™ host API calls to set up a context, pick a target device, compile and execute the device kernel, and stage kernel input and output memory.

This example is not the only way to create an OpenCL™ host program, as production applications may wish to use different interrogation procedures, OpenCL™ API event facilities, or even different command queue methodology.  C++ developers may wish to consider the OpenCL™ C++ Wrapper API for use in C++ programs. This example is written in C++ and uses the OpenCL™ API and some C99 library routines to focus the walkthrough on base API usage.

Starting in main() function, notice the application picks a 1024 x 1024 array size for the image data. The sample doesn't read in and operate on a real image. Random data will be generated to fill the image data. However, the sizing for the input is critical later on to control the number of OpenCL™ work-items that are launched for the kernel.

cl_uint arrayWidth  = 1024;
cl_uint arrayHeight = 1024;

The FindOpenCLPlatform(...) routine finds an Intel® platform that has a device type that we specify, in this case we're looking for a platform with a GPU device. The program uses the clGetPlatformIDs(...) API call to get the number of OpenCL™ platforms and an array of OpenCL™ platform id's.

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetplatform_ids() to get num platforms returned %s.\n", TranslateOpenCLError(err));
        return NULL;
    }
    LogInfo("Number of available platforms: %u\n", numPlatforms);

    if (0 == numPlatforms)
    {
        LogError("Error: No platforms found!\n");
        return NULL;
    }

    std::vector<cl_platform_id> platforms(numPlatforms);

    // Now, obtains a list of numPlatforms OpenCL platforms available
    // The list of platforms available will be returned in platforms
    err = clGetPlatformIDs(numPlatforms, &platforms[0], NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetplatform_ids() to get platforms returned %s.\n", TranslateOpenCLError(err));
        return NULL;
    }

Each platform's devices are checked to see if the platform contains our preferred device type using clGetDeviceIDs(...) API call: 

    // Check if one of the available platform matches the preferred requirements
    for (cl_uint i = 0; i < numPlatforms; i++)
    {
        bool match = true;
        cl_uint numDevices = 0;

        // If the preferredPlatform is not NULL then check if platforms[i] is the required one
        // Otherwise, continue the check with platforms[i]
        if ((NULL != preferredPlatform) && (strlen(preferredPlatform) > 0))
        {
            // In case we're looking for a specific platform
            match = CheckPreferredPlatformMatch(platforms[i], preferredPlatform, platformStr);
        }

        // match is true if the platform's name is the required one or don't care (NULL)
        if (match)
        {
            // Obtains the number of deviceType devices available on platform
            // When the function failed we expect numDevices to be zero.
            // We ignore the function return value since a non-zero error code
            // could happen if this platform doesn't support the specified device type.
            err = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &numDevices);
            if (CL_SUCCESS != err)
            {
                LogError("clGetDeviceIDs() returned %s for platform %s.\n", TranslateOpenCLError(err), platformStr.c_str());
            }

            if (0 != numDevices)
            {
                // There is at list one device that answer the requirements
                return platforms[i];
            }
        }
    }

A new OpenCL™ context is created on the matching platform. The clCreateContextFromType(...) OpenCL™ API call creates the context associated with a device of our selected type. The clGetContextInfo(...) OpenCL™ API call allows the program to recover the selected device's id:

    // Create context with device of specified type.
    // Required device type is passed as function argument deviceType.
    // So you may use this function to create context for any CPU or GPU OpenCL device.
    // The creation is synchronized (pfn_notify is NULL) and NULL user_data
    cl_context_properties contextProperties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)platformId, 0};
    ocl->context = clCreateContextFromType(contextProperties, deviceType, NULL, NULL, &err);
    if ((CL_SUCCESS != err) || (NULL == ocl->context))
    {
        LogError("Couldn't create a context, clCreateContextFromType() returned '%s'.\n", TranslateOpenCLError(err));
        return err;
    }

    // Query for OpenCL device which was used for context creation
    err = clGetContextInfo(ocl->context, CL_CONTEXT_DEVICES, sizeof(cl_device_id), &ocl->device, NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetContextInfo() to get list of devices returned %s.\n", TranslateOpenCLError(err));
        return err;
    }

The GetPlatformAndDeviceVersion(...) function interrogates the given platform and device to understand OpenCL™ feature capability. The function allows for some flexibility in using OpenCL™ 1.2, OpenCL™ 2.0, or OpenCL™ 2.1 API calls as appropriate for target devices. The function also informs the program if OpenCL-C kernel language 2.0 features are supported.

    // Read the platform's version string
    // The read value returned in platformVersion
    err = clGetPlatformInfo(platformId, CL_PLATFORM_VERSION, stringLength, &platformVersion[0], NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetplatform_ids() to get CL_PLATFORM_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;
    }

    if (strstr(&platformVersion[0], "2.0") != NULL)
    {
        ocl->platformVersion = OPENCL_VERSION_2_0;
    }

    if (strstr(&platformVersion[0], "2.1") != NULL)
    {
        ocl->platformVersion = OPENCL_VERSION_2_1;
    }

                      

    // Read the device's version string
    // The read value returned in deviceVersion
    err = clGetDeviceInfo(ocl->device, CL_DEVICE_VERSION, stringLength, &deviceVersion[0], NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetDeviceInfo() to get CL_DEVICE_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;
    }

    if (strstr(&deviceVersion[0], "2.0") != NULL)
    {
        ocl->deviceVersion = OPENCL_VERSION_2_0;
    }

    if (strstr(&deviceVersion[0], "2.1") != NULL)
    {
        ocl->deviceVersion = OPENCL_VERSION_2_1;
    }

 

    // Read the device's OpenCL C version string
    // The read value returned in compilerVersion
    err = clGetDeviceInfo(ocl->device, CL_DEVICE_OPENCL_C_VERSION, stringLength, &compilerVersion[0], NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clGetDeviceInfo() to get CL_DEVICE_OPENCL_C_VERSION returned %s.\n", TranslateOpenCLError(err));
        return err;
    }

 

Back in the SetupOpenCL(...) funtion, the program creates a command queue with the device context. Command queue creation changed between OpenCL™ 1.2 API and OpenCL™ 2.0 API, so there are helper macros to pick the correct API calls. A mismatch in command queue creation is a frequent build breaker for new developers. OpenCL™ API code developers review in the wild may not be explicitly labeled for it's OpenCL™ revision. The build example in the article uses -Wno-deprecated-declarations to easily leverage the older style command queue OpenCL™ API call:

    // Create command queue.
    // OpenCL kernels are enqueued for execution to a particular device through special objects called command queues.
    // Command queue guarantees some ordering between calls and other OpenCL commands.
    // Here you create a simple in-order OpenCL command queue that doesn't allow execution of two kernels in parallel on a target device.
#if defined(CL_VERSION_2_0) || defined(CL_VERSION_2_1)
    if (OPENCL_VERSION_2_0 == ocl->deviceVersion || OPENCL_VERSION_2_1 == ocl->deviceVersion )
    {
        const cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
        ocl->commandQueue = clCreateCommandQueueWithProperties(ocl->context, ocl->device, properties, &err);
    } 
    else {
        // default behavior: OpenCL 1.2
        cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
        ocl->commandQueue = clCreateCommandQueue(ocl->context, ocl->device, properties, &err);
    } 
#else
    // default behavior: OpenCL 1.2
    cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
    ocl->commandQueue = clCreateCommandQueue(ocl->context, ocl->device, properties, &err);
#endif

Note: const cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0}; The CL_QUEUE_PROFILING_ENABLE flag is useful for debugging. It can help provide useful information in concert with OpenCL™ events. Refer to the Khronos* documentation on using this feature for more information: clCreateCommandQueueWithProperties(...) clGetEventProfilingInfo(...). In production circumstances, enabling command queue profiling is often undesirable due to queue serialization.

Back in our main(...) function our input and output buffers are created. The aligned allocations assume 4K page size. This enables us to get the zero copy behavior to eliminate unnecessary copying. We take advantage of the unified host and device memory domain. See the Intel® article on zero copy for OpenCL for more information:

    // allocate working buffers. 
    // the buffer should be aligned with 4K page and size should fit 64-byte cached line
    cl_uint optimizedSize = ((sizeof(cl_int) * arrayWidth * arrayHeight - 1)/64 + 1) * 64;
    cl_int* inputA;
    cl_int* inputB;
    cl_int* outputC;
    if(posix_memalign((void**)&inputA, 4096, optimizedSize) != 0)
    {
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;
    }
    if(posix_memalign((void**)&inputB, 4096, optimizedSize) != 0)
    {
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;
    }
    if(posix_memalign((void**)&outputC, 4096, optimizedSize) != 0)
    {
        LogError("Error: posix_memalign failed to allocate buffer.\n");
        return -1;
    }

 

In int CreateBufferArguments(ocl_args_d_t *ocl, cl_int* inputA, cl_int* inputB, cl_int* outputC, cl_uint arrayWidth, cl_uint arrayHeight) image objects are created. A description data structure is configured to denote the image object parameters. An image object is created for the 2 inputs and 1 output. The results are 1-channel images with bitdepth of 32-bits. Data is stored with unsigned integer values. CL_MEM_USE_HOST_PTR allows buffers to exhibit zero copy behavior:

    // Define the image data-type and order -
    // one channel (R) with unit values
    format.image_channel_data_type = CL_UNSIGNED_INT32;
    format.image_channel_order     = CL_R;

    // Define the image properties (descriptor)
    desc.image_type        = CL_MEM_OBJECT_IMAGE2D;
    desc.image_width       = arrayWidth;
    desc.image_height      = arrayHeight;
    desc.image_depth       = 0;
    desc.image_array_size  = 1;
    desc.image_row_pitch   = 0;
    desc.image_slice_pitch = 0;
    desc.num_mip_levels    = 0;
    desc.num_samples       = 0;
#if defined(CL_VERSION_2_0) || defined(CL_VERSION_2_1)
    desc.mem_object        = NULL;
#else
    desc.buffer            = NULL;
#endif

    // Create first image based on host memory inputA
    ocl->srcA = clCreateImage(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, &format, &desc, inputA, &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateImage for srcA returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    // Create second image based on host memory inputB
    ocl->srcB = clCreateImage(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, &format, &desc, inputB, &err);
    if (CL_SUCCESS != err)
    {

 

Next, the program creates the kernel program object. The kernel program "Template.cl" is read in from source on disk. The source is then associated with the program object and context.

    // Upload the OpenCL C source code from the input file to source
    // The size of the C program is returned in sourceSize
    char* source = NULL;
    size_t src_size = 0;
    err = ReadSourceFromFile("Template.cl", &source, &src_size);
    if (CL_SUCCESS != err)
    {
        LogError("Error: ReadSourceFromFile returned %s.\n", TranslateOpenCLError(err));
        goto Finish;
    }

    // And now after you obtained a regular C string call clCreateProgramWithSource to create OpenCL program object.
    ocl->program = clCreateProgramWithSource(ocl->context, 1, (const char**)&source, &src_size, &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateProgramWithSource returned %s.\n", TranslateOpenCLError(err));
        goto Finish;
    }

The program is built for the target context and device with clBuildProgram(...). Build feedback is recorded and is often very useful for scheduling and sizing kernel execution. Build feedback is also useful for debugging. The empty string is the argument where the OpenCL-C revision for the kernel program source or other preprocessor variables would be specified. Example "-cl-std=CL2.0". The OpenCL-C revision may or may not be the same as the OpenCL™ API revision. See the Khronos* official reference for more information:

    // Build the program
    // During creation a program is not built. You need to explicitly call build function.
    // Here you just use create-build sequence,
    // but there are also other possibilities when program consist of several parts,
    // some of which are libraries, and you may want to consider using clCompileProgram and clLinkProgram as
    // alternatives.
    err = clBuildProgram(ocl->program, 1, &ocl->device, "", NULL, NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clBuildProgram() for source program returned %s.\n", TranslateOpenCLError(err));

        // In case of error print the build log to the standard output
        // First check the size of the log
        // Then allocate the memory and obtain the log from the program
        if (err == CL_BUILD_PROGRAM_FAILURE)
        {
            size_t log_size = 0;
            clGetProgramBuildInfo(ocl->program, ocl->device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);

            std::vector<char> build_log(log_size);
            clGetProgramBuildInfo(ocl->program, ocl->device, CL_PROGRAM_BUILD_LOG, log_size, &build_log[0], NULL);

            LogError("Error happened during the build of OpenCL program.\nBuild log:%s", &build_log[0]);
        }
    }

A kernel object is created. This kernel object is associated with the desired function, Add(...), in the kernel source file via character string "Add" in the second argument:

    // Program consists of kernels.
    // Each kernel can be called (enqueued) from the host part of OpenCL application.
    // To call the kernel, you need to create it from existing program.
    ocl.kernel = clCreateKernel(ocl.program, "Add", &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateKernel returned %s\n", TranslateOpenCLError(err));
        return -1;
    }

Kernel arguments are bound. This binding maps image objects to parameters for the Add(...) kernel function. Now, the program can feed the proper host side data to the intended kernel parameters:

cl_uint SetKernelArguments(ocl_args_d_t *ocl)
{
    cl_int err = CL_SUCCESS;

    err  =  clSetKernelArg(ocl->kernel, 0, sizeof(cl_mem), (void *)&ocl->srcA);
    if (CL_SUCCESS != err)
    {
        LogError("error: Failed to set argument srcA, returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    err  = clSetKernelArg(ocl->kernel, 1, sizeof(cl_mem), (void *)&ocl->srcB);
    if (CL_SUCCESS != err)
    {
        LogError("Error: Failed to set argument srcB, returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    err  = clSetKernelArg(ocl->kernel, 2, sizeof(cl_mem), (void *)&ocl->dstMem);
    if (CL_SUCCESS != err)
    {
        LogError("Error: Failed to set argument dstMem, returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    return err;
}

The built kernel program is now associated with the context for our Intel® Graphics Technology device and memory arguments are bound. The kernel program is enqueued on the command queue for execution. The enqueuue specifies the 2 dimensional hard coded size assigned at the beginning of the example. clFinish(...) OpenCL™ API call is used to block host program execution until the enqueued kernel has finished executing. The main performance timer measuring kernel performance begins and ends after these operations:

    // Define global iteration space for clEnqueueNDRangeKernel.
    size_t globalWorkSize[2] = {width, height};


    // execute kernel
    err = clEnqueueNDRangeKernel(ocl->commandQueue, ocl->kernel, 2, NULL, globalWorkSize, NULL, 0, NULL, NULL);
    if (CL_SUCCESS != err)
    {
        LogError("Error: Failed to run kernel, return %s\n", TranslateOpenCLError(err));
        return err;
    }

    // Wait until the queued kernel is completed by the device
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clFinish return %s\n", TranslateOpenCLError(err));
        return err;
    }

Next, the output is mapped back to a host pointer for result verification. This mapping operation avoids a copy and exploits the shared memory domain between the host and Intel® Graphics Technology hardware on this platform.

    cl_int *resultPtr = (cl_int *)clEnqueueMapImage(ocl->commandQueue, ocl->dstMem, true, CL_MAP_READ, origin, region, &image_row_pitch, &image_slice_pitch, 0, NULL, NULL, &err);

    if (CL_SUCCESS != err)
    {
        LogError("Error: clEnqueueMapBuffer returned %s\n", TranslateOpenCLError(err));
        return false;
    }

    // Call clFinish to guarantee that output region is updated
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clFinish returned %s\n", TranslateOpenCLError(err));
    }

The result is also computed on the host and compared to the result from the OpenCL™ device for verification. The source code that follows in GPUProjectforLinux.cpp is for tear down and clean-up purposes.

Intel® CPU

The Intel® CPU version of the source is mostly similar. Below we look at the key differences.

The Kernel - Intel® CPU

The CPU version does not show texture sampler usage in the sample. Hardware texture sampler access is only offered through the Intel® Graphics Technology implementation. The CPU runtime implements texture sampling functionality in software. The CPU the kernel in this example for the CPU target accesses kernel data through raw pointers. The data index desired for this work-item is calculated by a basic 2-dimensional to 1-dimensional transformation.

__kernel void Add(__global int* pA, __global int* pB, __global int* pC)
{
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int width = get_global_size(0);

    const int id = y * width + x;

    pC[id] = pA[id] + pB[id];
}

The Host Side Application - Intel® CPU

The sample uses OpenCL™ basic buffer objects and not OpenCL™ images for preparing kernel data:

    // Create new OpenCL buffer objects
    // As these buffer are used only for read by the kernel, you are recommended to create it with flag CL_MEM_READ_ONLY.
    // Always set minimal read/write flags for buffers, it may lead to better performance because it allows runtime
    // to better organize data copying.
    // You use CL_MEM_COPY_HOST_PTR here, because the buffers should be populated with bytes at inputA and inputB.

    ocl->srcA = clCreateBuffer(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputA, &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateBuffer for srcA returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    ocl->srcB = clCreateBuffer(ocl->context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, inputB, &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateBuffer for srcB returned %s\n", TranslateOpenCLError(err));
        return err;
    }

    // If the output buffer is created directly on top of output buffer using CL_MEM_USE_HOST_PTR,
    // then, depending on the OpenCL runtime implementation and hardware capabilities, 
    // it may save you not necessary data copying.
    // As it is known that output buffer will be write only, you explicitly declare it using CL_MEM_WRITE_ONLY.
    ocl->dstMem = clCreateBuffer(ocl->context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_uint) * arrayWidth * arrayHeight, outputC, &err);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clCreateBuffer for dstMem returned %s\n", TranslateOpenCLError(err));
        return err;
    }

The validation step maps the OpenCL™ result buffer to a host program pointer at the end of the program. clEnqueueMapBuffer(...) OpenCL™ API call is used:

    // Enqueue a command to map the buffer object (ocl->dstMem) into the host address space and returns a pointer to it
    // The map operation is blocking
    cl_int *resultPtr = (cl_int *)clEnqueueMapBuffer(ocl->commandQueue, ocl->dstMem, true, CL_MAP_READ, 0, sizeof(cl_uint) * width * height, 0, NULL, NULL, &err);

    if (CL_SUCCESS != err)
    {
        LogError("Error: clEnqueueMapBuffer returned %s\n", TranslateOpenCLError(err));
        return false;
    }

    // Call clFinish to guarantee that output region is updated
    err = clFinish(ocl->commandQueue);
    if (CL_SUCCESS != err)
    {
        LogError("Error: clFinish returned %s\n", TranslateOpenCLError(err));
    }

 

On error handing

Error handling, even with sandbox OpenCL™ development, triages many issues before they start. See the source examples for the TranslateOpenCLError(...) function definition to observe the various detailed error codes. These error codes are set by the OpenCL™ standard and are defined in the Khronos* standard headers. Always handle OpenCL™ API return values. Even for sandbox programs basic error handling is advised.

 

Offline compilation

The ioc64 offline compiler front end has a few different output format options for compiled kernel intermediates. These can be stored on disk and consumed by the OpenCL™ API at a later time In some cases linked to other compiled kernel objects for future builds. SPIR and SPIR-V provide useful targeting for OpenCL™ as well as non OpenCL™ kernels. SPIR provides a necessary level of kernel obfuscation for developers who wish not to distribute kernels in source text format. This avoids distributions as a text file (.cl) or constant string within the host binary. For developer reference, the -help usage output of ioc64 from 2019 initial release is provided here:

tutorial@2019OCL-Dev-Host:~$ ioc64 -help

usage: ioc64 <ARGUMENTS>
Intel(R) SDK for OpenCL(TM) - offline compiler command line, version 7.0.0.3099
Copyright (C) 2018 Intel Corporation.  All rights reserved.

ARGUMENTS:
    -cmd=<command>                   - Command to be performed:
                                      'build' create executable IR from source
                                       code (default if none specified)
                                      'compile' create compiled object IR from
                                       source code
                                      'link' create executable IR / library
                                       from object IR and libraries
    -input=<input_file_path>         - Build the OpenCL(TM) code given in
                                       <input_file_path> (use with the 'build'
                                       & 'compile' commands)
    -binary="<binary_files_paths>"   - Build/Link binary files, comma separated if
                                       more than one (use with 'build' & 'link' command)
    -version                         - Show tool version
    -help                            - Show available commands
    -device=<device_type>            - Set target device type:
                                       'cpu' for Intel(R) CPU device
                                       'gpu' for Intel(R) Graphics device (default)
                                       'fpga_fast_emu' Intel(R) FPGA Emulation Platform for OpenCL(TM) (preview)
                                       'cpu_2_1' for Intel(R) CPU on Experimental OpenCL(TM) 2.1 Platform
    -targetos=<os>                   - Set target operating system if it is
                                       different from current:
                                       (supported in 32-bit version only)
                                       'android' (use with 'cpu' device only)
    -simd=<instruction_set_arch>     - Set target instruction set architecture
                                       (use with 'cpu' or 'fpga_fast_emu' devices only):
                                       'sse42' for Intel(R) Streaming SIMD Extensions 4.2
                                       'avx' for Intel(R) Advanced Vector Extensions
                                       'avx2' for Intel(R) Advanced Vector Extensions 2
                                       'skx' for Intel(R) Advanced Vector Extensions 512
    -output[=<output_file_path>]     - Write the build log to <output_file_path>
    -asm[=<file_path>]               - Generate assembly code
    -llvm[=<file_path>]              - Generate llvm code
    -llvm-spir32[=<file_path>]       - Generate llvm spir code (32-bit)
    -llvm-spir64[=<file_path>]       - Generate llvm spir code (64-bit)
    -ir[=<file_path>]                - Generate intermediate binary file
    -spir32[=<file_path>]            - Generate SPIR (32-bit) binary file
    -spir64[=<file_path>]            - Generate SPIR (64-bit) binary file
    -spirv32[=<file_path>]           - Generate SPIR-V (32-bit) binary file
    -spirv64[=<file_path>]           - Generate SPIR-V (64-bit) binary file
    -txt-spirv32[=<file_path>]       - Generate SPIR-V (32-bit) code file
    -txt-spirv64[=<file_path>]       - Generate SPIR-V (64-bit) code file
    -bo="<build_options>"            - Add OpenCL(TM) build options
                             

Note: Future updates to ioc64 no longer support 'cpu_2_1' device toggle.

A build log may report the x86_64 instruction set architecture extensions used for the kernel. As of December 2018 and Intel® CPU Runtime for OpenCL™ Applications 18.1, avx512 generated kernels can be generated and executed on capable Intel® processors. See the -simd=<instruction_set_arch> toggle for more information.

More

For discussion on OpenCL™ developer components in Intel® System Studio 2019 and the Intel® SDK for OpenCL™ Applications 2019, join the community at the Intel® forums.

For more on CPU and Intel® Graphics Technology performance analysis for heterogeneous OpenCL™ programs, see Intel® VTune™ Amplifier, available as a standalone or as an Intel® System Studio 2019 component.

The OpenCL™ intercept layer instrumentation tool is a useful and lightweight. It can track API calls and provide some straightforward performance metrics.

*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.