Accelerate Performance Using OpenCL* with Intel® HD Graphics

Downloads


Accelerate Performance Using OpenCL* with Intel® HD Graphics PDF [897.41 KB]

Abstract


Recently, Intel announced the release of the Intel® SDK for OpenCL Applications 2013 with certified OpenCL* 1.2 support on 3rd and 4th generation Intel® Core™ processors with Intel® Iris Graphics and Intel® HD Graphics family. Visual computing applications like content creation, home movies, music, and personal images will benefit from the value of OpenCL general purpose programmability combined with access to the combined power of the CPU and the hardware acceleration capabilities of Intel® Iris Graphics and Intel® HD Graphics to increase performance and improve battery life.

In fact, with the launch of the 3rd generation Intel® Core™ processors, Intel has made it possible for software developers to heterogeneously program both CPU and Intel HD Graphics through the OpenCL* framework.

This article documents lessons learned while accelerating video processing with OpenCL* on the 3rd generation Intel® Core™ processors. Common video effects in a commercial application were optimized with OpenCL, and then evaluated on Intel® HD Graphics 4000. Using OpenCL, key video effects were sped up by as much as 2.3x, and with further tuning for Intel® HD Graphics 4000, additional performance gains of up to 4.3x were achieved.

Many of the lessons learned and documented in this article are also applicable to the new versions of Intel® Iris Graphics and Intel® HD Graphics family.

Contents


  1. Introduction

  2. Analysis of the Title

    1. General Optimizations

  3. Performance Case Studies

    1. Performance Case Study 1 – Lookup Tables

      Lookup Tables in OpenCL Kernels

      OpenCL Code on Kernels with LUTs

      Performance Optimization

    2. Performance Case Study 2 – Software Bilinear Interpolation

      Kernel Pseudo Code for Bilinear Interpolation

      Performance Optimization

    3. Performance Case Study 3 – Lens Flare

      Texture Traffic Overhead Using Multiple Kernels

      Performance Optimization

      Pseudo Code for Lens Flare

  4. Conclusion

  5. About the Author

Introduction


A popular software title for professional video editing was updated to accelerate video processing effects with OpenCL. Intel used this as an opportunity to test compatibility and runtime performance of OpenCL on the 3rd generation Intel® Core™ processors with Intel® HD Graphics 4000, and to determine if there were opportunities to enhance performance further. During development of the initial application release, over 60 video effects were accelerated with OpenCL, for which over 120 OpenCL kernels were implemented. With so many effects accelerated with OpenCL, it was essential to functional test and to assess performance of every OpenCL kernel.

This article outlines lessons learned and some optimization techniques used while testing and assessing the performance of the OpenCL kernels. In addition, performance and/or bottleneck issues found with some kernels and their solutions are outlined.

This article assumes the reader is familiar with the OpenCL programming model.

Analysis of the Title


The application’s release kit included seven workloads designed to test different video effects that can be accelerated with OpenCL. These workloads were used in performance and functional analysis throughout the testing of the Intel® HD Graphics implementation of OpenCL. The issues encountered on the OpenCL compiler and runtime were resolved. Runtime challenges were observed, and it took some time to understand before steps could be taken to address and optimize the performance bottlenecks. Figure 1 compares the initial performance to the optimized performance observed today.

The application release kit workloads were developed to determine increases in playback performance and decreases in render times. The different video effects showcased in the seven workloads, all implemented with OpenCL kernels, are:

Workload 1: a fade and a key framed Gaussian blur effects.

Workload 2: two crossfades, two simultaneous animated titles; a key framed black & white effect, and slow motion.

Workload 3: is a key framed picture-in-picture sequence with clock wipe transition and an animated title. All of this is happening over a transition from the slow motion background image (using the Cross Effect transition) to the next clip.

Workload 4: an iris transition, a 2-track composite shot with 2 levels of secondary color correction and key framed cookie cutter effect, and a crossfade to a slow motion clip.

Workload 5: a PNG file which uses a Chroma key filter to composite it over a generated media lower third and a slow motion clip with a key framed Bump Map filter.

Workload 6: a two-track composite created by a Mask Generator filter on the top track’s event. It also features a key framed black & white effect, slow motion, and fades.

Workload 7: a PNG file with transparency over a clip with a key framed sepia effect, a key framed lens flare effect, and fades.


Figure 1. Performance Improvements

Figure 1 shows how early performance compared to the performance observed with the OpenCL kernels after optimization.

General Optimizations


This section outlines lessons learned and general optimizations for creating measureable performance improvements in OpenCL kernels. These optimizations were integrated into the release of workloads shown in Figure 1.

      • Use 4x2 Chroma blocks when converting color format, e.g., YUV to RGBA format. This yields as much as 20% better performance compared to processing 2x2 Chroma blocks.
      • Use fmin(a, b) instead of min(a, b) for float data types.
      • Use native built-in functions cautiously. Most native functions yield better performance but not all. For example: Use y= y * y * y when x = 3.0f instead of pow(y, x) or even native_powr(y, 3.0f). The multiplication generated code is more than 2ms faster on Intel® HD Graphics than either built-in function.
      • Use bit-wise operations for Boolean comparisons whenever possible; e. g., use if (y & x) instead of if (y && x). This optimization improves performance especially if the kernel is big, usually 4K instructions or bigger.
      • Use multiplication and truncate function trunc(…) to get the fractional value part instead of fmod(x, y, 1). The built-in fp = fmod(x, y, 1) is slow compared to fp = x * Y; fp = fp – trunc(fp).
      • Eliminate arithmetic operations of invariant variables in kernel code. Move computations to host code whenever possible. For example, y = a * b + c if c is the only variant, compute a * b in host code and pass value of a*b in an argument parameter to the kernel.

For additional OpenCL optimization guidelines, refer to the Intel® SDK for OpenCL* Applications 2013 OpenCL* Optimization Guide.

Performance Case Studies


Three case studies of OpenCL kernels optimized for Intel® HD Graphics are detailed in this section: 1) use of shared local memory, 2) use of defined filters in the OpenCL engine, and 3) transfer and handling of texture data. The OpenCL kernels used in these studies came from a professional video editing application.

Performance Case Study 1 – Lookup Tables


This case study documents why and how lookup tables (LUT) present performance bottlenecks in OpenCL kernels. The study examines the performance of the “color curves” video effect program where three lookup tables are used. Analysis of OpenCL capabilities determined that Intel® HD Graphics Shared Local Memory (SLM) would speed up kernels that use LUTs. This case study proves that the use of SLM can in fact speed up the performance of OpenCL kernels that use LUTs. The case study also shows how simple it is to further accelerate video processing using SLM in OpenCL kernels where appropriate.

Lookup Tables in OpenCL Kernels

Processing lookup table data in OpenCL kernels in most cases creates a performance bottleneck. This is due to the large number of data transfers for lookup table data that occur between main memory and the memory available for the OpenCL device. The more the kernel is “compute” bound (limited by the number of computations and not the number of pixels rendered), the more severe the performance impact is. This is especially important since kernels should always be designed to be “compute” bound to maximize performance as hardware becomes more capable. If an OpenCL kernel is not compute bound, the kernel program should be redesigned or the algorithm should not be programmed using OpenCL at all. In general, lookup tables should be avoided in OpenCL kernels if at all possible because LUTs preclude a kernel from being compute bound. Avoiding lookup tables and whether the kernel is compute bound or not are topics for another paper. For now, we’ll examine how lookup tables can manifest as performance bottlenecks and what can be done to prevent bottlenecks.

Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than 256 bytes. As lookup table data is being accessed by hundreds of OpenCL HW threads, lots of data transfers occur between main system memory and graphics memory. Inherently, these data transfers between the two system memories incur latency and/or collisions. The data transfer latency and access hits slow the OpenCL compute engine, which in turn prevents the kernel from running optimally.

OpenCL Code on Kernels with LUTs

Consider the kernel code from the “color curves” video effect in Figure 2. Notice that in addition to other parameters, the kernel also has three parameters (global pointers) for lookup tables: lutR, lutG, and lutB. The OpenCL keywords “__global” implies memory is used from the global memory pool which in the Intel® HD Graphics architecture usually means cache memory. Data held in this memory has to transfer to the kernel along a slow path. The highlighted code shows LUT table data being used, where the code is indexing through the tables to retrieve LUT data. The indexes were computed based on the incoming image pixel data (not shown here). Notice there are six values to be retrieved by each kernel thread from system memory. The data transfer latency impact is compounded as a large number of OpenCL hardware threads running for the kernel are trying to hit the same memory address space.


__kernel void ColorCurves(…, 
__global uchar* lutR,  __global uchar* lutG,   __global uchar* lutB)
{
       …… code omitted
        uchar4 vlow = {lutR[blow.s0], lutG[blow.s1], lutB[blow.s2], 0};
        uchar4 vhigh = {lutR[bhigh.s0], lutG[bhigh.s1], lutB[bhigh.s2], 0};
      …… code omitted
}

Figure 2. Partial Code for Color Curves Video Effect

So what can be done to avoid bottlenecks when using lookup tables? In most cases the answer is as easy as using the local memory. Local memory is also referred to as SLM (shared local memory) because variables that use the “__local” prefix are allocated in local memory and local memory is shared by all work-items in a work-group. For more details on OpenCL semantics, refer to the OpenCL Specification.

Performance Optimization

If at all possible, lookup table data should be copied to shared local memory. Using SLM prevents excessive shuttling of lookup table data between kernel threads thus greatly minimizing data transfer penalties. With the memory latency removed or minimized, the kernel compute throughput will no longer be bogged down and should show substantially better performance. The code in Figure 3 illustrates how to use SLM for lookup tables to prevent performance bottlenecks in this OpenCL kernel.

typedef struct {            // Lookup table data encapsulated into one buffer to minimize clSetarg(…) calls.
    uchar lutR[256];
    uchar lutG[256];
    uchar lutB[256];
} KernelLutRGB;

__kernel void ColorCurves(…,  __global uchar* lutRGB)                  // kernels one lookup table buffer
{
      …… code omitted for simplicity

      __local uchar lutR[256];
      __local uchar lutG[256];
      __local uchar lutB[256];

      int init = get_local_id(0)+get_local_size(0)*get_local_id(1);  
      int step = get_local_size(0)*get_local_size(1);

      for (int i = init; i < 256; i += step)
      {
            lutR[i] = lutRGB->lutR[i];
            lutG[i] = lutRGB->lutG[i];
            lutB[i] = lutRGB->lutB[i];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
  
      …… code omitted
     uchar4 vlow = {lutR[blow.s0], lutG[blow.s1], lutB[blow.s2], 0};
     uchar4 vhigh = {lutR[bhigh.s0], lutG[bhigh.s1], lutB[bhigh.s2], 0};
     …… code omitted
}

Figure 3. Partial Code with SLM for Color Curves Video Effect

      • get local size [0|1] – each function returns the value of the local work size specified on the kernel execution. When local work size is not specified, the OpenCL kernel engine selects the most appropriate based on global worksize. These values remain constant across all kernel HW threads. In this case the local work size was not specified and the OpenCL engine selected [64, 8], so get_local_size(0) always returned 64 and get_local_size(1) returned 8.
      • get local id [0|1] – each function returns a value within the range of the local work size. get_local_id(0) returns a value between 0–63, while get_local_id(1) returns a value between 0–7.

The barrier(CLK_LOCAL_FENCE) call blocks all kernel HW threads until all threads copy their corresponding chunks of the global lookup table data to local memory. The kernel code still indexes through the lookup tables, but those tables now reside in local memory. As data resides in local memory, the data transfer latency is avoided which expedites the compute part of the kernel, and thus achieves much better performance. Removing the performance bottleneck on kernels that use lookup tables is often just as simple as using SLM.

Table 1 shows the metrics of performance measured on the color curves effect, which was optimized to use SLM. A stand-alone application was written and a single image (1440x1080) was used to assess the performance of the OpenCL kernel for the color curves effect. The stand-alone application host code looped the execution of the OpenCL kernel 100 times. The kernel with no SLM took about 76.5 milliseconds to process the effect on the image. While the kernel optimized with SLM took only 18 milliseconds. The performance is 4.3x faster with SLM when compared to the performance of the kernel not using SLM.

Table 1Performance of Color Curves with and without SLM

Color Curves run type

Elapsed Time (ms)

OpenCL kernel with no SLM

76.5

OpenCL kernel with SLM

18.0

Performance was measured on a 3rd generation Intel® Core™ processor with Intel® HD Graphics 4000 and compared to the same processor running the kernel on four logical threads. For additional system details, refer to the system Information found on Appendix A.

As observed in this study, using SLM clearly benefits the runtime performance of OpenCL kernels. However, using SLM comes with some restrictions. It is important to understand the OpenCL device hardware limitations and capabilities when programming OpenCL kernels to use SLM. As a rule of thumb, do not use SLM if the LUT size is larger than 4K bytes.

Figure 4 is a screenshot of the color curve (infrared) effect, showing the output after the different colors have been computed.


Figure 4. Sample Video Output of the Color Curves Effect

Performance Case Study 2 – Software Bilinear Interpolation

The OpenCL 1.1 specification requirements support the bilinear interpolation (BLI) algorithm through implementation of the required CLK_FILTER_LINEAR filter. Based on the OpenCL specification, the CLK_FILTER_LINEAR filter can be used to achieve bilinear interpolation. The read image function, using a sampler with the CLK_FILTER_LINEAR filter, returns the exact equivalent of a bilinear interpolated pixel.

Early on in the effort to accelerate video effects with OpenCL, the Intel® HD Graphics BLI capability was found to produce incorrect results. Due to these incorrect results, the video editing app programmers ended up writing their own bilinear interpolation functionality in OpenCL kernel code. The incorrect results issue was promptly fixed in the graphics driver and the performance of both BLI implementations was compared. This case study shows that hardware implementation for BLI achieves better performance than that of the functionality written with OpenCL.

Kernel Pseudo Code for Bilinear Interpolation

The code for the page peel video effect is shown in Figure 5 and it illustrates both the software implementation for BLI as well as the use of the CLK_FILTER_LINEAR to achieve bilinear interpolation with Intel® HD Graphics hardware.

As the code shows, the software implementation of BLI relies on four calls to the “image read” built-in function; it also uses arithmetic operations. In contrast, the code in the second row of Figure 5 shows the use of the CLK_FILTER_LINEAR filter in a sampler. The sampler is passed as an argument to the “read image” function; this serves as an indication to use hardware BLI. Note that the ScaledSampler sampler is defined so that the CLK_FILTER_LINEAR filter is used, and with this new sampler only one call to “read image” is needed. Highlighted code in Figure 5 shows differences between the two implementations.

Software Version of BLI

__constant sampler_t
UnscaledSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
__constant sampler_t
ScaledSampler   = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_LINEAR;

float4 swBilinearInterpolation( input, coords)
{
   float2 fVal  =  // derived from incoming coords
   int2 iCoords = // derived from incoming coords
   float y0 = 1.0f – fVal.y;
   float x0 = 1.0f – fVal.x;

   float4 pixel = y0 * x0 * read image (input, UnscaledSampler, iCoords ) +
                         fVal.y * x0 * read imagel(input, UnscaledSampler, iCoords + (int2)( 0, 1)) +
                         y0 * fVal.x * read image(input, UnscaledSampler, iCoords + (int2)( 1, 0)) +
                         fVal.y * fVal.x * read imagel(input, UnscaledSampler, iCoords + (int2)( 1, 1));
   return pixel;
}
__kernel void PagePeel( inputA, inputB, outDst, …)
{
    uint cx = get_global_id(0);
    uint cy = get_global_id(1);
    int2 icoord = (int2)(cx,cy);

    float4 dst;
    float4 aPixel = read image(inputA, UnscaledSampler, icoord);

   float fXSrc = … X progress of the effect;
   float fYSrc = … Y progress of the effect;
   float4 bPixel = swBilinearInterpolation(inputB, (float2)(fXSrc+0.5f, fYSrc+0.5f));

  dst = mix(aPixel, bPixel, …);
  write image(outDst, coords, dst);
}

Using HW Version of BLI

float4 hwBilinearInterpolation(input, coords)
{
   float4 pixel = read image(input, ScaledSampler, coords );
   return pixel;
}
__kernel void PagePeel( inputA,  inputB, outDst, …)
{
    uint x = get_global_id(0);
    uint y = get_global_id(1);
    int2 coord = (int2)(x,y);
    …
    float4 aPixel = read image(inputA, UnscaledSampler, coord);
    …

   float fXSrc = …progress based calculated X coordinate;
   float fYSrc = …progress based calculated Y coordanate;

   float4 biPixel = swBilinearInterpolation(inputB, (float2)(fXSrc, fYSrc));
    …

  dst = mix(aPixel, biPixel, …);
  write image(outDst, coords, dst);
}

Figure 5. Partial OpenCL Code for a PagePeel Transition Effect

The OpenCL 1.1 specification documents the CLK_FILTER_LINEAR filter implementation as a permutation that calculates the interpolated pixel result as follows:

T = (1 - y) * (1 - x) * Ti0j0
   + y * (1 - x) * Ti1j0
   + (1 - y) * x * Ti0j1
   + y * x * Ti1j1

Where x and y values are derived from the running thread ids which you get when calling get global id 0 and 1, respectively.

Performance Optimization

With the BLI capability fixed, performance of the software implementation for BLI was compared to the performance of the hardware implementation of BLI. As expected, the hardware BLI greatly outperformed the software version.

When optimizing an OpenCL kernel for performance, if software BLI is being used, the kernel code is likely not to perform optimally. The good news is that this performance issue can easily be avoided by simply using Intel® HD Graphics hardware BLI instead of an OpenCL programmed implementation for BLI. BLI is commonly used in several video effects, some of which were used to gauge the performance deltas and output result consistency. The hardware BLI consistently outperformed the software BLI implementation. Depending on the video effect, hardware BLI performed between 1.25x and 1.3x compared to the software function.

Figure 6 is a screenshot of the page peel effect being used in a video transition scenario.


Figure 6. Sample Video Output with Page Transition Effect

For those curious about the performance of the C/C++ and OpenCL implementations for the page peel effect, Table 2 shows the performance achieved in frames per second. The OpenCL implementation showed a 2.5x better performance over the CPU C/C++ implementation, and achieved an additional 1.25x with use of hardware bilinear interpolation.

Page peel run type

FPS

Native C/C++

9.8

OpenCL (Software BLI)

24.6

OpenCL (HW bilinear interpolation)

30.7

Table 2: Performance of page peel effect with software and hardware BLI

For additional test system details, refer to the system Information found on Appendix A.

Performance Case Study 3 – Lens Flare

Even as OpenCL best practices and optimization guidelines suggest to program kernels with as few instructions as possible, there are exceptions to this advice. This case study explores performance shortcomings of the lens flare video effect which required six kernels, and compares the performance against a monolithic one-kernel solution.

Texture Traffic Overhead Using Multiple Kernels

Multiple OpenCL kernels are usually viewed as an optimal design solution for video effects where multiple independent video elements are added to the video output. In practice, a single kernel would minimize texture traffic overhead and it might be a better solution in terms of performance. This case study highlights the lens flare video effect which uses six kernels. Each kernel was designed to draw a lens effect element: poly, ring, circle, diffused ray, thin ring, and sunburst onto the video output.

Depending on the lens flare effect setting, a particular kernel would be executed multiple times to draw multiple instances of the same element on the same video frame. This required taking multiple passes over the same image and thus creating texture overhead. The video frame being processed in multiple passes incurs data traffic overhead. The traffic overhead was determined to slow down the processing of the video effect. A one-kernel approach was proposed to eliminate texture traffic and improve performance. This case study outlines the performance results with the one-kernel solution.

Performance Optimization

To consolidate six kernels into one, the unique code was taken from each of six kernels and turned into six functions which are called from the main kernel. A specific function would be called within a loop to draw multiple elements as needed. Surprisingly, not all of the settings of the lens flare effect showed performance improvement with this approach. In fact, two of the settings showed minor performance degradation. Table 3 includes the performance metrics observed with the six kernels and the one-kernel implementations. The one-kernel solution sped up three of the five settings while decreasing performance of the other two settings.

Lens Flare run type

6-kernels FPS

1-kernel FPS

Scale

Setting 1 – 8 elements

16.9

14.2

0.84

Setting 2 – 13 elements

12.6

15.7

1.25

Setting 3 – 9 elements

17.8

16.7

0.94

Setting 4 – 23 elements

8.9

14.4

1.62

Setting 5 – 16 elements

11.8

15.7

1.33

Table 3. One-Kernel vs. Multiple Kernel Performance Metrics

Table 3 shows that as the number of elements to draw increase, the one-kernel implementation achieves better performance. It also shows that if the number of elements to draw is less than 10, then the six kernel implementation yields better performance. At the time of this article’s publication, further optimization of the functions engaged for setting 1 and setting 3 were still being pursued. It is still possible that the one kernel per element solution might perform better even in lens flare effects with less than 10 elements.

Figure 7 is a screenshot of the video with setting 5 of lens flare effect.


Figure 7. Sample Video Output with Lens Flare Effect

Pseudo Code for Lens Flare

The pseudo code below includes both host and OpenCL code for the six kernels as well as for the one-kernel implementations of the lens flare effect. Figure 8 shows the host code and Figure 9 shows the OpenCL kernel code. Some code is omitted to simplify and to help illustrate key code changes.

Host code (Six Kernels)

Host code (One Kernel)

hostCodeFunction(…)  {

  flareElemnts = N Elements to draw

  size_t sizeBuf = sizeof (flarePropBuf);
  fPropBuf = clCreateBuffer(clContext,..., sizeBuf, &flarePropBuf, …);

   for (long i = 0; i < flareElements; i++)  {  

       cl_kernel kernel = kernelAry[i];        // multiple kernels    

       status  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
       status  = clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);

      status = clSetKernelArg(kernel, argN, sizeof(cl_mem), &fPropBuf);

   size_t global[2] = {nWidth, nHeight};

   clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL, global, NULL, 0, NULL, NULL);
}

hostCodeFunction(…)  {
     flareElemnts = N Elements to draw

   for (long i = 0; i < flareElements; i++)   {

       ElemProps[i].eType = flareProp[i].eType;
       ElemProps[i].data1 = flareProp[i].data1;
       …
       ElemProps[i].dataN = flareProp[i].dataN;
   }
   size_t sizeBuf = sizeof (ElemProps);
   flarePropBuf = clCreateBuffer(clContext, …., nSizeBuf, &ElemProps, &status);

     cl_kernel kernel = kernelHandle;     //one-kernel

   status  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &src);
   status  = clSetKernelArg(kernel, 1, sizeof(cl_mem), &dst);

   status = clSetKernelArg(kernel, 4, sizeof(cl_mem), &flarePropBuf);

   size_t global[2] = {nWidth, nHeight};

   clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL, global,NULL, 0, NULL, NULL);
}

Figure 8. Partial Host Code for one-kernel and six-kernel Lens Flare Implementation

Opencl Code – CL file with Six Kernels

Opencl Code - CL file with One Kernel

_constant sampler_t  iSampler = …

_kernel
void lensflare_e1 (input, output, … , __constant KernelFlareProp* pFlareInfo)
{
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, iSampler, coord);

    if (within bounds) {
         eval_color = e1Eval (…., pFlareInfo);
         color =  apply_eval_color;
     }
     // write into output
     write _ pixel(output, coord, color);
}

_kernel
void lensflare_e2 ( input, output, …, __constant KernelFlareProp* pFlareInfo)
{
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, imageSampler, coord);

    if (within bounds) {
         eval_color = e2Eval (…., pFlareInfo);
         color =  apply_eval_color;
     }   
     // write into output
     write image(output, coord, color);
}

_kernel lensflare_eN(…..) { … }

_constant sampler_t  iSampler = …

_kernel
void lens_flare_evals (input, output, …, __constant KernelFlareProp* pFlareInfo)
{
   int cx = get_global_id(0);  int cy = get_global_id(1);
   int2 coord = (int2)(cx, cy);
   float4 color = read image(input, iSampler, coord);

     int nElems = (int)pFlareInfo->lensElemCount;

     // Get properties for each element to evaluate
     for (int i = 0; i < nElems; i++)  {
          eType = (int)pFlareInfo->newElemProps[i].eType;
          data1 = pFlareInfo->newElemProps[i].data1;

          dataN = pFlareInfo->newElemProps[i].dataN

           if (within bounds) {

                        if(eType == e1)  {
                         eval_color = e1Eval (…., pFlareInfo);

                     } else if(eType == e2)  {
                         eval_color = e2Eval (…, pFlareInfo);

                     } else if(….) { 
                         …
                    } // end of eType
                    color =  apply_eval_color;
            }
       } end of for loop
      // write into output
      write image(output, coord, color);
}

Figure 9. Partial OpenCL Kernel Code for one-kernel and six-kernel Lens Flare Implementation

Conclusion


In summary, video and image processing can be accelerated with OpenCL. Further optimization can be achieved on Intel® HD Graphics with additional work. For most applications, excellent performance improvements will be observed with 3rd generation Intel® Core™ processors with Intel® HD Graphics 4000 running OpenCL, when compared to equivalent C/C++ code. Should the performance for a given kernel not improve as expected, consider the optimization techniques outlined in this paper. Refer to the Intel® SDK for OpenCL* Applications 2013 OpenCL* Optimization Guide for detailed optimization techniques and methods to best performance tune OpenCL kernels for Intel® HD Graphics.

Appendix A: System Information


Test system information:

Hardware

System Item

Value

CPU

Core i7-3667U @ 2.0GHz

GPU

Intel® HD Graphics 4000 @ 350 MHz Core Clock

Memory

4 GB

Max DVMT

1696 MB

Software

System Item

Value

OS

Windows 8 Pro x64

Graphics Driver

9.18.10.3070

Video BIOS

2137.0

Application Configuration

System Item

Value

Graphics Settings

High

Resolution

1440x1080 UI Preview Display

Appendix B: Tools


Tools used for the analysis.

      • Intel® SDK for OpenCL* Applications 2013 - Kernel Builder
      • Intel® Graphics Performance Analyzers 2013

Appendix C: References


    1. Intel® SDK for OpenCL* Applications 2013: http://software.intel.com/en-us/vcsource/tools/opencl-sdk-2013
    2. OpenCL code samples : http://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-samples-getting-started
    3. Intel® SDK for OpenCL* Applications 2013 OpenCL* Optimization Guide: http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/index.htm
    4. OpenCL 1.1 and 1.2 Specifications: http://software.intel.com/en-us/articles/opencl-resource-by-the-khronos-group

    About the Author


    Eli Hernandez is an Application Engineer in the Consumer Client and Power Enabling Group at Intel Corporation where he works with customers to optimize their software for power efficiency and to run best on Intel hardware and software technologies. He is currently focused on performance of applications that use OpenCL on Intel processor graphics as well as enabling performance with SIMD using latest IA instruction set as well as threaded parallel coding for performance. These mostly applied to media applications such as video and audio codecs and video effect processing.

    Intel, Core, and the Intel logo are trademarks of Intel Corporation in the US and/or other countries.
    OpenCL and the OpenCL logo are trademarks of Apple Inc and are used by permission by Khronos.
    Copyright © 2012 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

    Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.