OpenCL VEBox Minimal Samples

By Adam W Herr,

Published:02/07/2017   Last Updated:02/07/2017

VEBox Minimal  Samples

This article covers the VEBox minimal samples package.   Download here.

Please note: this package is intended for use in Linux, SRB4/Media Server Studio 2017 R2 and forward.  Windows support will be added as VEBox features become available in public drivers.

 

Introduction

The Video Enhancement pipeline (VEBox) is a stand-alone silicon block within Intel GPU hardware specialized for frame processing tasks.  It executes independently from the rest of the GPU.  Workloads combining VEBox and regular kernel execution can radically improve performance by exploiting concurrent execution between the VEBox and EU hardware.

New extensions have been added to Intel's OpenCL implementation to give developers greater access to this hardware.  Currently they are only available in Linux SRB4 and forward.  These are preview extensions and must be separately enabled.

export OCL_EnablePreviewFeatures=1

While the extensions are new, the VEBox hardware is not experimental. It has been present in many generations of Intel GPUs, going back to 2nd Generation Core/Sandybridge.

VEBox Samples Overview

Historically, VEBox hardware has been used exclusively by Intel media drivers and Intel Media SDK. The new OpenCL VEBox extensions provide low-level access to the entire VEBox pipeline, making it available to developers for the first time as a directly-accessible compute resource to use in applications. While many stages are highly configurable, the algorithms implemented by the VEBox cannot be significantly altered. However, what the VEBox lacks in flexibility it makes up for in high performance and independent execution. The pipeline provides a suite of image processing operations including color correction/conversion, noise reduction, demosaic, deinterlace and many other image enhancement stages.  These operations are a more configurable version of many of the operations in Intel® Media SDK VPP.

These extensions are built on the existing OpenCL kernel, queue, event, image and enqueue idioms. The API is exposed through a set of built-in kernels, a VEBox-specific accelerator object, and a special command queue property. The VEBox pipeline state is configured by the user with the desired stage configurations.   Built-in kernels are enqueued onto a VEBox-enabled command queue, passing the accelerator object and input/output images as kernel arguments. Synchronization of VEBox execution is accomplished via the regular cl_event mechanisms.

The samples open a sequence of NV12 video frames and passes them through the VEBox where they are operated on by a simple VEBox pipeline including conversion to RGBA. Resulting images are written in a sequence of uncompressed RGBA frames. This samples capture the common themes for all OpenCL VEBox applications and should be easily modified to support more sophisticated pipelines and applications.

Running the Samples and Viewing Results:

The package contains two samples:

  • adaptive_contrast_enhancement
  • total color correction

There is also a "common" directory and an example input file in "media".

Both samples can be built by running make.

Before running, set OCL_EnablePreviewFeatures=1. 

Here is the console output for Adaptive Contrast Enhancement, as run on an Intel(R) Core(TM) i7-6500U processor:

$ ./sample_ACE 
OpenCL VEBox: adaptive contrast enhancement Searching for required extensions...
  found:cl_intelx_video_enhancement
  found:cl_intelx_video_enhancement_camera_pipeline
  found:cl_intelx_video_enhancement_color_pipeline
  found:cl_intel_planar_yuv
  found:cl_intel_packed_yuv
Using VEBox interface Version 1
Hardware Has 1 VEBox Instances

If you see a message saying "VEBox extensions not found" instead, chances are you don't have the right hardware/driver or preview features are not correctly enabled. See here for details.

The application should output a file in the current work directory called output_ace.rgba. You'll need an application that can view raw video frames in order to see the results. One approach is to use ffplay command from ffmpeg to playback video sequences and compare changes.

To view the original video:

$ ffplay -f rawvideo -pix_fmt nv12 -video_size 176x96 ../media/test_stream_176x96.nv12 

To view the generated video:

$ ffplay -f rawvideo -pix_fmt rgba -video_size 176x96 output_ace.rgba

You can generate a side by side view like this

ffmpeg -y \
       -s 176x96 -pix_fmt nv12 -f rawvideo -i ../media/test_stream_176x96.nv12 \
       -s 176x96 -pix_fmt rgba -f rawvideo -i output_ace.rgba \
       -filter_complex hstack -pix_fmt rgba -f rawvideo output_stacked.rgba

which would be played back with

$ ffplay -f rawvideo -pix_fmt rgba -video_size 352x96 output_stacked.rgba

Output for Adaptive Contrast Enhancement (ACE) should show enhanced contrast and brightness:

(original on left, ACE output on right.)

 

Output for Total Color Correction should show green desaturization with full saturization of other colors.

 

Sample Overview

These samples demonstrate how developers can utilize VEBox hardware, which is a dedicated high-performance offload engine (see here for more information). The code demonstrates the semantics for configuring pipeline stages (in this case ACE and TCC) and the use of built-in kernels to invoke hardware. There is also an implicit conversion from NV12 to RGBA in within the processing pipeline.  

While this is a walk through of the minimal ACE and TCC samples, the steps to use the VEBox extensions will be similar in any application. Working with VEBox requires some boilerplate code. The intent of this section is to explore the interface and explain how it works. We will do this by working through a sequence of steps that are documented both here and in the sample code. This overview only covers the key OpenCL APIs, so some of the details that appear in the sample code (reading and writing files, etc. ) are omitted. 

The main programs for the VEBox samples are very short.  Basically, they just set up the VEBox stages used and their parameters, then run the VEBox pipeline.

int main(int argc, const char ** argv)
{
    // Use ACE to boost the luma channel
    cl_ve_ace_attrib_intel ace;
    ace.enabled = CL_TRUE;
    ace.skin_threshold = CL_VE_ACE_SKIN_THRESHOLD_DEFAULT_INTEL; // not using skin threshold in this example 
    ace.level = CL_VE_ACE_LEVEL_MAX_INTEL;
    ace.strength = CL_VE_ACE_STRENGTH_DEFAULT_INTEL + 1;
    cl_ve_attrib_desc_intel attrib[] = {{ CL_VE_ACCELERATOR_ATTRIB_ACE_INTEL, &ace }};
    cl_ve_desc_intel desc = {1, attrib};

    return RunVEBoxPipeline(INPUT_VIDEO_WIDTH, INPUT_VIDEO_HEIGHT, CL_NV12_INTEL, INPUT_VIDEO_FILE, CL_RGBA, OUTPUT_VIDEO_FILE, &desc);
}

The pipeline code is implemented in common/cl_helpers.h.

The first step is to obtain an Intel GPU device and check if the hardware+driver supports the VEBox interface with the VEBox extensions.

...
    clGetPlatformIDs( 1, &my_platform, NULL );
    clGetDeviceIDs(my_platform, CL_DEVICE_TYPE_GPU, 1, &my_device, NULL );
    if(!RequiredExtensions(my_device, VEBoxExtensions()))
...

Here we assume that the first GPU device is an Intel device. Once we have obtained a cl_device_id object, we can used this to determine of hardware supports the VEBox interface:

...
err = clGetDeviceInfo(my_device, CL_DEVICE_VE_VERSION_INTEL, sizeof(ve_ver), &ve_ver, NULL);
...

The err variable will be set to CL_SUCCESS if VEBox is available. The CL_DEVICE_VE_VERSION_INTEL attribute indicates the VEBox interface supported by hardware+driver. The version number may impact the availability of particular features in the VEBox pipeline for this hardware and/or driver version. 

In certain hardware SKUs there may be multiple VEBoxes. The number of VEBox instances available may be obtained using the CL_DEVICE_VE_ENGINE_COUNT_INTEL attribute. This is purely informational since multi-VEBox configurations share a common hardware front-end that splits images across n-number of VEBoxes. A single VEBox invocation will spread work across all available hardware instances.  

...
err = clGetDeviceInfo(my_device, CL_DEVICE_VE_ENGINE_COUNT_INTEL, sizeof(num_veboxes), &num_veboxes,0);
...

The next step is to create a VEBox accelerator object, which represents the complete state configuration the pipeline. The VEBox accelerator object is based on the existing cl_intel_accelerator extension, which is describe in detail here. This is an opaque object that stores the user-specified pipeline state, statistical data (used in adaptive filters) and related hardware state information. Since the accelerator interface is itself an extension, we'll need to obtain the function pointers for the interface. This is done by calling the LoadAcceleratorAPI, a helper function also defined in common/cl_helpers.h.

...
LoadAcceleratorAPI(my_platform);
...

Next we need a cl_context. We don't do anything specially here, just create it in the usual way:

...
cl_context_properties cps[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)my_platform, 0};
my_context = clCreateContext(cps, 1, &my_device, NULL, NULL, &err);
...

The next step is to create a command queue that can talk to the VEBox's dedicated command stream. Command queues of this type do not support regular kernel execution and follow the regular in-order command queue semantics. To create one simply pass the CL_QUEUE_VE_ENABLE_INTEL attribute to the clCreateCommandQueueWithProperties function:

...
cl_command_queue my_ve_q = NULL;
const cl_queue_properties properties[] = { CL_QUEUE_VE_ENABLE_INTEL, CL_TRUE, 0};
my_ve_q = clCreateCommandQueueWithProperties( my_context, my_device,  properties, &err );
...

Since we can only enqueue built-in VEBox kernels on my_ve_q, we need to also create a normal command queue that can handle things like memory transfers.

...
cl_command_queue my_q = NULL;
my_q = clCreateCommandQueueWithProperties( my_context, my_device,  NULL, &err );
...

 

Now we can create an accelerator object with the pipeline configuration we want. The accelerator is created by calling the __clCreateAcceleratorINTEL function with the CL_ACCELERATOR_TYPE_VE_INTEL type and passing the cl_ve_desc_intel structure describing the pipeline state as set up in the main function. Here is what it looks like in code: 

 

accel = __clCreateAcceleratorINTEL( my_context, CL_ACCELERATOR_TYPE_VE_INTEL, sizeof( cl_ve_desc_intel ), desc, &err );

 

To reduce implementation complexity, this pipeline depends on the driver to detect that the input image is CL_NV12_INTEL and the output image is CL_RGBA with automatic color space conversion. This default behavior can be overriden with an explicit stage configuration, but that is not shown here to minimize sample complexity.   

Now that we have a configured accelerator object, we need to get the built-in kernels that allow us to execute on the VEBox. There are three built-in kernels, but for now we're only going to consider the ve_enhance_kernel, which covers the broadest set of use cases. A cl_kernel object is obtained using OpenCL's built-in kernel mechanism:

...

my_ve_program = clCreateProgramWithBuiltInKernels( my_context, 1, &my_device, "ve_enhance_intel", &err );
my_ve_enhance_k = clCreateKernel( my_ve_program, "ve_enhance_intel", &err );

...

The enhance kernel has the following signature:

ve_enhance_intel (
    sampler_t              accelerator,	// Pipeline state
    int                    flags,		// Flags for this enqueue
    __read_only image2d_t  current_input,	// Input image
    __write_only image2d_t current_output	// Output Image
);

This kernel is pretty simple, the accelerator argument is set to the accelerator object we just created. The flags argument has special meaning for certain stages in the pipeline. The current_input and current_output arguments are image objects that have been created using the clCreateImage function. The VEBox supports various image formats, check the spec for details. Setting arguments uses the regular OpenCL idioms:  

...
cl_int flags = i == 0 ? CL_VE_FIRST_FRAME_INTEL : 0;
clSetKernelArg( my_ve_enhance_k, 0, sizeof( cl_accelerator_intel * ), &accel);
clSetKernelArg( my_ve_enhance_k, 1, sizeof( cl_int ), &flags );
clSetKernelArg( my_ve_enhance_k, 2, sizeof( cl_mem* ), &current_input);
clSetKernelArg( my_ve_enhance_k, 3, sizeof( cl_mem* ), &current_output);
...

The VEBox stages are implemented as adaptive filters. During execution, statistics are gathered for each frame and stored inside the accelerator object for use on the next frame. We assume the above code is in a loop over multiple frames. In this case we set the flags argument to CL_VE_FIRST_FRAME_INTEL to reset the accelerator's statistics and put hardware in "first frame" mode when the frame index (e.g. i) is 0. Subsequent enqueues clear this flag to ensure proper adaptive filtering. It is good practice to always set this flag for the first invocation of VEBox.

Now that we have all the necessary pieces ready to go, we can actually enqueue VEBox work. This is done using the regular clEnqueueNDRangeKernel function, though the ND Range is interpreted differently. VEBox processing occurs on 64-pixel wide columns, so the driver interprets the offset and workgroup size as the first and last pixel column to process. Here are the  constraints on clEnqueueNDRangeKernel when invoking VEBox kernels:

  • global workgroup size must be 1
  • global_work_size[0] must be greater than or equal to 64
  • if global_work_offset is NULL, processing starts at the leftmost pixel of the input image
  • if global_work_offset[0] is provided, it must be divisible by 64 and less than global_work_size[0]
  • The local workgroup size must be set to NULL. 

This may sound complicated, but it is really a matter of setting the offset[0] to 0 and the global_size[0] to the width of the image, like this:

... 
const size_t off[ 1 ] = { 0 }; // Start with the leftmost pixel

const size_t gws[ 1 ] = { width }; // The process to image width

clEnqueueNDRangeKernel( my_ve_q, my_ve_enhance_k, 1, off, gws, NULL, 0, NULL, NULL);

...

The call returns immediately.  Work is added to the VEBox queue . This will run concurrently with work on other parts of the GPU.  The required synchronization to access output can be achieved by either blocking the host on the VEBox command queue by calling clFinish, or by working with the cl_event object returned from the clEnqueueNDRangeKernel function. If a cl_event object is used, this completion object may be used in anywhere in the OpenCL API as a dependency.  

The full set of hardware operations in the VEBox pipeline is below. While the pipeline is fixed in terms of available stages and order, operations can be configured with parameters including whether each stage is on or off. This means that the time to execute a pipeline with one stage active is very close to the time to execute a pipeline executing all possible stages.

There are some pathway choices in the pipeline. For example, in the DN/DI sequence deinterlace OR demosaic/color correction can run but not both.

These samples are intended to be a starting point. All stages are not yet covered. Please watch for more documentation/ examples on this subject.

References

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol09-media_vebox.pdf

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804