Intel provides new Integrated OpenCL development experience

OpenCL support at Intel is now going mainstream with full integration into Intel’s portfolio of software development suites. The Intel® SDK for OpenCL™ Applications features are now integrated into various development tools under a new name of OpenCL™ Code Builder.

The different solutions are tailored to the target development environments:

clEnqueueReadBuffer transfers truncated data on HD4600

In certain cases clEnqueueReadBuffer doesn't transfer all the required data when executed on HD4600. System: Win7 x64, driver version, 32-bit application.

It seems that in case of page-aligned destination buffer and transfer length that is not multiple of 4KB only multiple of 4KB is transfered. Sample code:

Intel Phi does not write data to buffer when clEnqueueReadBuffer is called CentOS

Hello everyone!

I'm running into a problem where data is not being written to my buffer when the kernels finish. I've tested my kernel in isolation in Eclipse running in Ubuntu on an Intel i5 CPU and it seems to output the correct results. When I move it over to CentOS I can't get printf statements to return from the kernel and my output buffers are never written to. Here is an example of my code:

double * coef_elts = (double *) calloc(p * voxels, sizeof(double));

Low throughput - how to diagnose?


I've just recently started programming opencl on my IvyBridge GT2 (16 EU) powered Laptop, however results don't look that promising for my use-case. To narrow things down, I started with a very basic kernel which traverses a buffer holding 2d image data:

__kernel void image_scaling(__global const char* in, __global char* out, int inputStride)


unsigned int idx = get_global_id(1) * inputStride + get_global_id(0);

char input_value = in[idx];

out[idx] = input_value + 50;



Broadwell 64-bit ulong performance regression on min/max/compare

Intel IGP CL team,

I'm seeing a huge performance regression between Haswell and Broadwell when comparing 64-bit ulong's or performing min/max operations.

Please check the number of instructions that Broadwell is generating for the two ulong (64-bit) "compare-exchange" sequences below.

The 64-bit compare-exchange sequences are running half as fast on Broadwell when compared to Haswell. 

The 32-bit compare-exchanges appear to be correct (they're very fast).

OpenCL™ Device Fission 助力 CPU 性能

下载 PDF


Device Fission 是 OpenCL™ 规范的一种特性,可为 OpenCL 编程人员提供更强的能力和控制,以更好地管理哪些计算单元运行 OpenCL 命令。 从根本上讲,Device Fission 支持将设备再次划分为一个或多个子设备,如果使用得当,这可以提供出色的性能优势,尤其是在运行 CPU 时。

面向 OpenCL™ 应用的英特尔® 软件开发套件是专为基于英特尔® 架构的平台上的 OpenCL 应用提供的一个全面的软件开发环境。 该软件开发套件支持开发人员在使用 Windows* 和 Linux* 操作系统的英特尔® CPU 上开发 OpenCL 应用并以其为目标。

  • Developers
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Intel® SDK for OpenCL™ Applications
  • OpenCL™ Code Builder
  • Intel® Integrated Native Developer Experience (INDE)
  • OpenCL*
  • Parallel Computing
  • Kernel debuuging to find an erroring work item


      During visual studio code builder debugging, I have a clEnqueueNDRangeKernel failing for a certain kernel, with a bad memory access. To find out which work item (of 512) is causing the error, I have had to gradually (and awkwardly) limit the participating work items, until I find the item(s) that are requesting this bad access. I find myself in this position reasonably often: is there a more graceful way to do this?


    GL CL interop on CPU

    I have a question regarding OpenCL / OpenGL interoperation when run on an Intel CPU.

    I just don't quite understand the purpose of the cl_khr_gl_sharing extension on a CPU.

    When a shared GL-CL context is created, must the GL device context and CL context both be associated with the same device?
    Or am I able to have the GL context associated with my GPU, and the CL context associated with my CPU?

    Subscribe to OpenCL*