Accelerate Performance Using OpenCL* with Intel® HD Graphics


Accelerate Performance Using OpenCL* with Intel® HD Graphics PDF [975.83 KB]


Recently, Intel announced the release of the Intel® SDK for OpenCL Applications 2013 with certified OpenCL* 1.2 support on 3rd and 4th generation Intel® Core™ processors with Intel® Iris Graphics and Intel® HD Graphics family. Visual computing applications like content creation, home movies, music, and personal images will benefit from the value of OpenCL general purpose programmability combined with access to the combined power of the CPU and the hardware acceleration capabilities of Intel® Iris Graphics and Intel® HD Graphics to increase performance and improve battery life.

In fact, with the launch of the 3rd generation Intel® Core™ processors, Intel has made it possible for software developers to heterogeneously program both CPU and Intel HD Graphics through the OpenCL* framework.

This article documents lessons learned while accelerating video processing with OpenCL* on the 3rd generation Intel® Core™ processors. Common video effects in a commercial application were optimized with OpenCL, and then evaluated on Intel® HD Graphics 4000. Using OpenCL, key video effects were sped up by as much as 2.3x, and with further tuning for Intel® HD Graphics 4000, additional performance gains of up to 4.3x were achieved.

Many of the lessons learned and documented in this article are also applicable to the newer versions of Intel® Graphics Processors.


  1. Introduction

  2. Analysis of the Title

    1. General Optimizations

  3. Performance Case Studies

    1. Performance Case Study 1 – Lookup Tables

      Lookup Tables in OpenCL Kernels

      OpenCL Code on Kernels with LUTs

      Performance Optimization

    2. Performance Case Study 2 – Software Bilinear Interpolation

      Kernel Pseudo Code for Bilinear Interpolation

      Performance Optimization

    3. Performance Case Study 3 – Lens Flare

      Texture Traffic Overhead Using Multiple Kernels

      Performance Optimization

      Pseudo Code for Lens Flare

  4. Conclusion

  5. Acknowledgements


A popular software title for professional video editing was updated in 2011 to accelerate video processing effects with OpenCL.  Intel used this as an opportunity to test compatibility and runtime performance of OpenCL on the Intel® 3rd Generation Core Processor with Intel HD Graphics 4000, and to determine if there were opportunities to enhance performance further.  During development of the initial application release, over 60 video effects were accelerated with OpenCL, for which over 120 OpenCL kernels were implemented.  With so many effects accelerated with OpenCL, it was essential to functional test and to assess performance of every OpenCL kernel. 

This article outlines lessons learned and some optimization techniques used while testing and assessing the performance of the OpenCL kernels.  In addition, performance and/or bottleneck issues found with some kernels and their solutions are outlined.

This article assumes the reader is familiar with the OpenCL programming model [1].  This article refers to the Intel® HD Graphics 4000 OpenCL as HDG or HDG OpenCL throughout.  The HDG OpenCL capability was first introduced with the Intel 3rd Generation Core Family of Processors codenamed Ivy Bridge.

Analysis of the Title

A workload was released as part of the app’s release kit to highlight video processing acceleration with OpenCL.  The release kit consisted of seven workloads designed to test different video effects which are accelerated with OpenCL. These seven workloads of the release kit were used in performance and functional analysis throughout the testing of the HDG implementation of OpenCL.  A number of issues were encountered on the OpenCL compiler and runtime which were resolved.  HDG runtime challenges were observed and it took time to understand before steps could be taken to address and optimize the performance bottlenecks observed.  Figure 1 compares the initial performance to the optimized performance observed with HDG today.

The application release kit workloads were developed to determine increase in playback performance and decrease in render time. The kit is divided into 7 workloads, each showcasing different video effects. These included the following video effects, all implemented with OpenCL kernels.

Workload 1: includes a fade and a key framed Gaussian blur effects.

Workload 2: includes two crossfades, two simultaneous animated titles, a key framed black & white effect, and slow motion.

Workload 3: includes a key framed picture-in-picture sequence with clock wipe transition and an animated title. All of this is happening over a transition from the slow motion background image (using the Cross Effect transition) to the next clip.

Workload 4: includes an iris transition, a 2-track composite shot with 2 levels of secondary color correction and key framed cookie cutter effect, and a crossfade to a slow motion clip.

Workload 5: includes a PNG file which uses a Chroma key filter to composite it over a generated media lower third and a slow motion clip with a key framed Bump Map filter.

Workload 6: includes a two-track composite created by a Mask Generator filter on the top track’s event. It also features a key framed black & white effect, slow motion, and fades.

Workload 7: includes a PNG file with transparency over a clip with a key framed sepia effect, a key framed lens flare effect, and fades.

Figure 1. Performance Improvements

Figure 1 shows how early OpenCL performance compared to the performance observed with the majority of the OpenCL kernels optimized for HDG.

General Optimizations

This section outlines lesson learned and general optimizations for creating measureable performance improvements in OpenCL kernels.  These optimizations were integrated into the release of workloads shown in Figure 1.

  • Use 4x2 Chroma blocks when converting color format e.g., YUV to RGBA format. This yields as much as 20% better performance compared to processing 2x2 Chroma blocks.
  • Use fmin(a, b) instead of min(a, b) for float data types.
  • Use native built-in functions cautiously. Most native functions yield better performance but not all. For example: Use y= y * y * y when x = 3.0f instead of pow(y, x) or even native_powr(y, 3.0f). The multiplication generated code is more than 2ms faster on HDG than either built-in function.
  • Use bit-wise operations for Boolean comparisons whenever possible; e. g., use if (y & x) instead of if (y && x). This optimization improves performance especially if the kernel is big, usually 4K instructions or bigger.
  • Use multiplication and truncate function trunc(…) to get the fractional value part instead of fmod(x, y, 1).  The built-in fp = fmod(x, y, 1) is slow compared to fp = x * Y; fp = fp – trunc(fp).
  • Eliminate arithmetic operations of invariant variables in kernel code. Move computations to host code whenever possible. E.g., y = a * b + c if c is the only variant, compute a * b in host code and pass value of a*b in an argument parameter to the kernel.

For additional OpenCL optimization guidelines, refer to the Intel OpenCL Optimization guide.

Performance Case Studies

This section documents three case studies of OpenCL kernels optimized for HDG:  1) use of shared local memory, 2) use of defined filters in the HDG OpenCL engine, as well as 3) transfer and handling of texture data on HDG.  The OpenCL kernels used in these studies came from a professional video editing application.  The optimizations outlined in the article were scheduled to release in subsequent application updates.

Performance Case Study 1 – Lookup Tables

This case study documents why and how lookup tables (LUT) present performance bottlenecks in OpenCL kernels. The study examines the performance of the “color curves” video effect program where three lookup tables are used.  Analysis of HDG OpenCL capabilities determined that HDG Shared Local Memory (SLM) would speed up kernels that use LUTs. This case study proves that use of HDG SLM can in fact speed up the performance of OpenCL kernels that use LUTs. The case study also shows how simple it is to further accelerate video processing using SLM in OpenCL kernels where appropriate.

Lookup Tables in OpenCL Kernels

Processing lookup table data in OpenCL kernels in most cases creates a performance bottleneck. This is due to the large number of data transfers for lookup table data that occur between main memory and the memory available for the OpenCL device. The more the kernel is “compute” bound (limited by the number of computations and not the number of pixels rendered), the more severe the performance impact is. This is especially important since kernels should always be designed to be “compute” bound to maximize performance as hardware becomes more capable. If an OpenCL kernel is not compute bound, the kernel program should be redesigned or the algorithm should not be programmed using OpenCL at all. In general, lookup tables should be avoided in OpenCL kernels if at all possible because LUTs preclude a kernel from being compute bound. Avoiding lookup tables and whether the kernel is compute bound or not are topics for another paper. For now, we’ll examine how lookup tables can manifest as performance bottlenecks and what can be done to prevent bottlenecks.

Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than 256 bytes.  As lookup table data is being accessed by hundreds of OpenCL HW threads, lots of data transfers occur between main system memory and HDG memory.  Inherently, these data transfers between the two system memories incur latency and/or collisions.  The data transfer latency and access hits slow the HDG OpenCL compute engine, which in turn prevents the kernel from running optimally.

OpenCL Code on Kernels with LUTs

Consider the kernel code from the “color curves” video effect in Figure 2.  Notice that in addition to other parameters, the kernel also has three parameters (global pointers) for lookup tables; these are lutR, lutG, and lutB. The OpenCL keywords “__global” implies memory is used from the global memory pool which in the HDG architecture usually means cache memory. Data held in this memory has to transfer to the kernel along a slow path.  The highlighted code shows LUT table data being used, where the code is indexing through the tables to retrieve LUT data. The indexes were computed based on the incoming image pixel data – not shown here.  Notice there are six values to retrieve by each kernel thread from system memory.  The data transfer latency impact is compounded as there are a large number of OpenCL hardware threads running for the kernel which are trying to hit the same memory address space.

__kernel void ColorCurves(…,  __global uchar* lutR,  __global uchar* lutG,   __global uchar* lutB)
       …… code omitted
        uchar4 vlow = {lutR[blow.s0], lutG[blow.s1], lutB[blow.s2], 0};
        uchar4 vhigh = {lutR[bhigh.s0], lutG[bhigh.s1], lutB[bhigh.s2], 0};
      …… code omitted


Figure 2. Partial Code for Color Curves Video Effect

So what can be done to avoid bottlenecks when using lookup tables? In most cases the answer is as easy as using the HDG local memory.  Local memory is also referred to as SLM (Shared Local Memory) because variables which use the “__local” prefix are allocated in local memory and local memory is shared by all work-items in a work-group. For more details on OpenCL semantics refer to the OpenCL Specification.

Performance Optimization

If at all possible, lookup table data should be copied to shared local memory. Using SLM prevents excessive shuttling of lookup table data between kernel threads thus greatly minimizing data transfer penalties.  With the memory latency removed or minimized, the kernel compute throughput will no longer be bogged down and should show substantially better performance.  The code in Figure 3 illustrates how to use SLM for lookup tables to prevent performance bottlenecks in this OpenCL kernel.

typedef struct {            // Lookup table data encapsulated into one buffer to minimize clSetarg(…) calls.
    uchar lutR[256];
    uchar lutG[256];
    uchar lutB[256];
} KernelLutRGB;

__kernel void ColorCurves(…,  __global uchar* lutRGB)                  // kernels one lookup table buffer
      …… code omitted for simplicity

      __local uchar lutR[256];
      __local uchar lutG[256];
      __local uchar lutB[256];

      int init = get_local_id(0)+get_local_size(0)*get_local_id(1);  
      int step = get_local_size(0)*get_local_size(1);

      for (int i = init; i < 256; i += step)
            lutR[i] = lutRGB->lutR[i];
            lutG[i] = lutRGB->lutG[i];
            lutB[i] = lutRGB->lutB[i];
      …… code omitted
     uchar4 vlow = {lutR[blow.s0], lutG[blow.s1], lutB[blow.s2], 0};
     uchar4 vhigh = {lutR[bhigh.s0], lutG[bhigh.s1], lutB[bhigh.s2], 0};
     …… code omitted


Figure 3. Partial Code with SLM for Color Curves Video Effect

  • get local size [0|1] – each function returns the value of the local work size specified on the kernel execution. When local work size is not specified, the OpenCL kernel engine selects most appropriate based on global worksize. These values remain constant across all kernel HW threads. In this case the local work size was not specified and HDG OpenCL engine selected [64, 8], so get_local_size(0) always returned 64 and get_local_size(1) returned 8.
  • get local id [0|1] – each function returns a value within the range of the local work size. get_local_id(0) returns a value between 0 – 63, while get_local_id(1) returns a value between 0 - 7

The barrier(CLK_LOCAL_FENCE)  call blocks all kernel HW threads until all threads copy their corresponding chunks of the global lookup table data to local memory.  The kernel code still indexes through the lookup tables, but those tables now reside in local memory. As data resides in local memory, the data transfer latency is avoided which expedites the compute part of the kernel, and thus achieves much better performance. Removing the performance bottleneck on kernels that use lookup tables is often just as simple as using SLM.

Table 1 shows the metrics of performance measured on the color curves effect which was optimized to use SLM. A stand-alone application was written and a single image (1440x1080) was used to assess the performance of the OpenCL kernel for the color curves effect. The stand-alone application host code looped the execution of the OpenCL kernel 100 times. The kernel with no SLM took about 76.5 milliseconds to process the effect on the image. While the kernel optimized with SLM took only 18 milliseconds. The performance is 4.3x faster with SLM when compared to the performance of the kernel not using SLM.

Table 1Performance of Color Curves with and without SLM

Color Curves run type

Elapsed Time (ms)

OpenCL kernel with no SLM


OpenCL kernel with SLM


Performance was measured on Intel® 3rd generation processor with HDG OpenCL and compared to the same processor running the kernel on four logical threads.  For additional system details refer to the system Information found on Appendix A. 

As observed in this study, using SLM clearly benefits the runtime performance of OpenCL kernels.  However, using SLM comes with some restrictions.  It is important to understand the OpenCL device hardware limitations and capabilities when programming OpenCL kernels to use SLM. For HDG OpenCL and as rule of thumb, do not use SLM if the LUT size is larger than 4K bytes.

Figure 4 is a screenshot of the color curve (infrared) effect, showing the output after the different colors have been computed.

Figure 4. Sample Video Output of the Color Curves Effect

Performance Case Study 2 – Software Bilinear Interpolation

The OpenCL 1.1 specification requirements support the bilinear interpolation (BLI) algorithm through implementation of the required CLK_FILTER_LINEAR filter.  Based on the OpenCL specification, the CLK_FILTER_LINEAR filter can be used to achieve bilinear interpolation. The read image function using a sampler with the CLK_FILTER_LINEAR filter returns the exact equivalent of a bilinear interpolated pixel.

Early on in the effort to accelerate video effects with HDG OpenCL, the HDG BLI capability was found to produce incorrect results.  Due to these incorrect results, the video editing app programmers ended up writing their own bilinear interpolation functionality in OpenCL kernel code. The incorrect results issue was promptly fixed in the HDG graphics driver and the performance of both BLI implementations was compared. This case study shows that HDG hardware implementation for BLI achieves better performance than that of the functionality written with OpenCL.

Kernel Pseudo Code for Bilinear Interpolation

The code for the page peel video effect is shown in Figure 5 and it illustrates both the software implementation for BLI as well as the use of the CLK_FILTER_LINEAR to achieve bilinear interpolation with HDG hardware.

As the code shows, the software implementation of BLI relies on four calls the “image read” built-in function, it also uses arithmetic operations. In contrast, code in second row of Figure 5 shows the use of the CLK_FILTER_LINEAR filter in a sampler. The sampler is passed as an argument to the “read image” function; this serves as an indication for HDG to use hardware BLI. Note that the ScaledSampler sampler is defined so that the CLK_FILTER_LINEAR filter is used, and note that with this new sampler only one call to “read image” is needed.  Highlighted code in Figure 5 shows differences between the two implementations.

Software Version of BLI

__constant sampler_t
__constant sampler_t

float4 swBilinearInterpolation( input, coords)
   float2 fVal  =  // derived from incoming coords
   int2 iCoords = // derived from incoming coords
   float y0 = 1.0f – fVal.y;
   float x0 = 1.0f – fVal.x;

   float4 pixel = y0 * x0 * read image (input, UnscaledSampler, iCoords ) +
                         fVal.y * x0 * read imagel(input, UnscaledSampler, iCoords + (int2)( 0, 1)) +
                         y0 * fVal.x * read image(input, UnscaledSampler, iCoords + (int2)( 1, 0)) +
                         fVal.y * fVal.x * read imagel(input, UnscaledSampler, iCoords + (int2)( 1, 1));
   return pixel;
__kernel void PagePeel( inputA, inputB, outDst, …)
    uint cx = get_global_id(0);
    uint cy = get_global_id(1);
    int2 icoord = (int2)(cx,cy);

    float4 dst;
    float4 aPixel = read image(inputA, UnscaledSampler, icoord);

   float fXSrc = … X progress of the effect;
   float fYSrc = … Y progress of the effect;
   float4 bPixel = swBilinearInterpolation(inputB, (float2)(fXSrc+0.5f, fYSrc+0.5f));

  dst = mix(aPixel, bPixel, …);
  write image(outDst, coords, dst);

Using HW Version of BLI

float4 hwBilinearInterpolation(input, coords)
   float4 pixel = read image(input, ScaledSampler, coords );
   return pixel;
__kernel void PagePeel( inputA,  inputB, outDst, …)
    uint x = get_global_id(0);
    uint y = get_global_id(1);
    int2 coord = (int2)(x,y);
    float4 aPixel = read image(inputA, UnscaledSampler, coord);

   float fXSrc = …progress based calculated X coordinate;
   float fYSrc = …progress based calculated Y coordanate;

   float4 biPixel = swBilinearInterpolation(inputB, (float2)(fXSrc, fYSrc));


  dst = mix(aPixel, biPixel, …);
  write image(outDst, coords, dst);

Figure 5. Partial OpenCL Code for a PagePeel Transition Effect

The OpenCL 1.1 specification documents the CLK_FILTER_LINEAR filter implementation as a permutation that calculates the interpolated pixel result as follows:

T = (1 - y) * (1 - x) * Ti0j0
   + y * (1 - x) * Ti1j0
   + (1 - y) * x * Ti0j1
   + y * x * Ti1j1

Where xand y values are derived from the running thread ids which you get when calling get global id 0 and 1, respectively.

Performance Optimization

With HDG BLI capability fixed, performance of the software implementation for BLI was compared to the performance of HDG hardware implementation of BLI.  As expected, HDG hardware BLI greatly outperformed the software version.

When optimizing an OpenCL kernel for performance on HDG, if software BLI is being used, the kernel code is likely not to perform optimally.  The good news is that this performance issue can easily be avoided by simply using HDG own BLI instead of an OpenCL programmed implementation for BLI.  BLI is commonly used in several video effects, some of which were used to gauge the performance deltas and output result consistency. The HDG BLI consistently outperformed the software BLI implementation. Depending on the video effect, HDG BLI performed between 1.25x and 1.3x compared to the software function. 

Figure 6 is a screenshot of the page peel effect being used in a video transition scenario.

Figure 6.Sample Video Output with Page Transition Effect

For those curious about the performance of the C/C++ and OpenCL implementations for the page peel effect, Table 2 shows the performance achieved in frames per second.  The HDG OpenCL implementation showed a 2.5x better performance over the CPU C/C++ implementation, and achieved an additional 1.25x with use of HDG HW bilinear interpolation.

Page peel run type


Native C/C++


OpenCL (Software BLI)


OpenCL (HW bilinear interpolation)


Table 2: Performance of page peel effect with software and hardware BLI

For additional test system details, refer to the system Information found on Appendix A.

Performance Case Study 3 – Lens Flare

Even as OpenCL best practices and optimization guidelines suggest to program kernels with as few instructions as possible, there are exceptions to this advice.  This case study explores performance shortcomings of the lens flare video effect which required six kernels, and compares the performance against a monolithic one-kernel solution.

Texture Traffic Overhead Using Multiple Kernels

Multiple OpenCL kernels are usually viewed as an optimal design solution for video effects where multiple independent video elements are added to the video output. In practice, a single kernel would minimize texture traffic overhead and it might be a better solution in terms of performance.  This case study highlights the lens flare video effect which uses six kernels.  Each kernel was designed to draw a lens effect element: poly, ring, circle, diffused ray, thin ring, and sunburst onto the video output.

Depending on the lens flare effect setting, a particular kernel would be executed multiple times to draw multiple instances of the same element on the same video frame. This required taking multiple passes over the same image and thus creating texture overhead. The video frame being processed in multiple passes incurs data traffic overhead.  The traffic overhead was determined to slow down the processing of the video effect. A one-kernel approach was proposed to eliminate texture traffic and improve performance. This case study outlines the performance results with the one-kernel solution.

Performance Optimization

To consolidate six kernels into one, the unique code was taken from each of six kernels and turned into six functions which are called from the main kernel.  A specific function would be called within a loop to draw multiple elements as needed. Surprisingly, not all of the settings of the lens flare effect showed performance improvement with this approach. In fact, two of the settings showed minor performance degradation. Table 3 includes the performance metrics observed with the six kernels and the one-kernel implementations. The one-kernel solution sped up three of the five settings while decreasing performance of the other two settings.

Lens Flare run type

6-kernels FPS

1-kernel FPS


Setting 1 – 8 elements




Setting 2 – 13 elements




Setting 3 – 9 elements




Setting 4 – 23 elements




Setting 5 – 16 elements




Table 3.One-Kernel vs. Multiple Kernel Performance Metrics

Table 3 shows that as the number of elements to draw increase, the one-kernel implementation achieves better performance. It also shows that if the number of elements to draw is less than 10, then the six kernel implementation yields better performance.  At the time of this article’s publication, further optimization of the functions engaged for setting 1 and setting 3 were still being pursued.  It is still possible that the one kernel per element solution might perform better even in lens flare effects with less than 10 elements.

Figure 7 is a screenshot of the video with setting 5 of lens flare effect.

Figure 7.Sample Video Output with Lens Flare Effect

Pseudo Code for Lens Flare

The pseudo code below includes both host and OpenCL code for the six kernels as well as for the one-kernel implementations of the lens flare effect. Figure 8 shows the host code and Figure 9 shows the OpenCL kernel code. Some code is omitted to simplify and to help illustrate key code changes.

Host code (Six Kernels)

Host code (One Kernel)

hostCodeFunction(…)  {

  flareElemnts = N Elements to draw

  size_t sizeBuf = sizeof (flarePropBuf);
  fPropBuf = clCreateBuffer(clContext,..., sizeBuf, &flarePropBuf, …);

   for (long i = 0; i < flareElements; i++)  {  

       cl_kernel kernel = kernelAry[i];        // multiple kernels    

       status  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
       status  = clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);

      status = clSetKernelArg(kernel, argN, sizeof(cl_mem), &fPropBuf);

   size_t global[2] = {nWidth, nHeight};

   clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL, global, NULL, 0, NULL, NULL);

hostCodeFunction(…)  {
     flareElemnts = N Elements to draw

   for (long i = 0; i < flareElements; i++)   {

       ElemProps[i].eType = flareProp[i].eType;
       ElemProps[i].data1 = flareProp[i].data1;
       ElemProps[i].dataN = flareProp[i].dataN;
   size_t sizeBuf = sizeof (ElemProps);
   flarePropBuf = clCreateBuffer(clContext, …., nSizeBuf, &ElemProps, &status);


     cl_kernel kernel = kernelHandle;     //one-kernel

   status  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &src);
   status  = clSetKernelArg(kernel, 1, sizeof(cl_mem), &dst);

   status = clSetKernelArg(kernel, 4, sizeof(cl_mem), &flarePropBuf);

   size_t global[2] = {nWidth, nHeight};

   clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL, global,NULL, 0, NULL, NULL);

Figure 8.Partial Host Code for one-kernel and six-kernel Lens Flare Implementation

Opencl Code – CL file with Six Kernels

Opencl Code - CL file with One Kernel

_constant sampler_t  iSampler = …

void lensflare_e1 (input, output, … , __constant KernelFlareProp* pFlareInfo)
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, iSampler, coord);

    if (within bounds) {
         eval_color = e1Eval (…., pFlareInfo);
         color =  apply_eval_color;
     // write into output
     write _ pixel(output, coord, color);

void lensflare_e2 ( input, output, …, __constant KernelFlareProp* pFlareInfo)
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, imageSampler, coord);

    if (within bounds) {
         eval_color = e2Eval (…., pFlareInfo);
         color =  apply_eval_color;
     // write into output
     write image(output, coord, color);

_kernel lensflare_eN(…..) { … }

_constant sampler_t  iSampler = …

void lens_flare_evals (input, output, …, __constant KernelFlareProp* pFlareInfo)
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, iSampler, coord);

     int nElems = (int)pFlareInfo->lensElemCount;

     // Get properties for each element to evaluate
     for (int i = 0; i < nElems; i++)  {
          eType = (int)pFlareInfo->newElemProps[i].eType;
          data1 = pFlareInfo->newElemProps[i].data1;

          dataN = pFlareInfo->newElemProps[i].dataN

           if (within bounds) {

                        if(eType == e1)  {
                         eval_color = e1Eval (…., pFlareInfo);

                     } else if(eType == e2)  {
                         eval_color = e2Eval (…, pFlareInfo);

                     } else if(….) { 
                    } // end of eType
                    color =  apply_eval_color;
       } end of for loop
      // write into output
      write image(output, coord, color);

Figure 9.Partial OpenCL Kernel Code for one-kernel and six-kernel Lens Flare Implementation



In summary, video and image processing can be accelerated with OpenCL.  Further optimization can be achieved on HDG with additional work.  For most applications, excellent performance improvements will be observed with Intel® 3rd Generation Core Processors with HD Graphics running OpenCL, when compared to equivalent C/C++ code.  Should the performance for a given kernel not improve as expected, consider the optimization techniques outlined in this paper. Refer to the Intel OpenCL optimization guide for detailed optimization techniques and methods to best performance tune OpenCL kernels for HDG.

Appendix A: System Information

Test system information:


System Item



Core i7-3667U @ 2.0GHz


Intel® HD Graphics 4000 @ 350 MHz Core Clock


4 GB


1696 MB



System Item



Windows 8 Pro x64

Graphics Driver

Video BIOS



Application Configuration

System Item


Graphics Settings



1440x1080 UI Preview Display


Appendix B: Tools

Tools used for the analysis.

  • Intel® OpenCL 2013 SDK Kernel Analyzer
  • Intel® Graphics Performance Analyzer 4.0

Appendix C: References

  1. Getting Started with OpenCL:
  2. For latest OpenCL SDK, either download from Intel® Native Development Environment (INDE): or from Intel® Media Server Studio:
  3. Intel OpenCL Optimization Guide:
  4. Intel OpenCL SDK 2013:
  5. OpenCL 1.1 and 1.2 Specifications:


Thanks to Justin Landon, Bradley Werth, Jun De Vega, and Others for reviewing the article and for their valuable contributions and suggestions.

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Biao W.'s picture

hi, Eliseo:

Could you shed more light about two magical numbers in this article:

256 byte, the threshold size of look up table

4k, the threshold size of local memory.

Are there any hardware related limitations?

will this still hold for the 4th generation Haswell architecture?

As I run my code on HD 4000 and Haswell both, a significant performance boost is observed on Haswell, though the compute unit is only increase from 16 to 20, the performance is improved 2 times.

Glad to hear from you.


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.