D3D9 Media Surface Sharing Between Intel® Quick Sync Video and OpenCL* on Intel® HD Graphics

Download Article

D3D9 Media Surface Sharing Between Intel® Quick Sync Video and OpenCL* on Intel® HD Graphics


Intel has defined an extension to OpenCL*  version 1.0 and newer, allowing applications to directly access images embedded in Microsoft DirectX* 9 (DX9) media surfaces, without first copying them. For OpenCL v1.2, the Khronos standards organization has defined a standardized extension for the same purpose. Intel® Quick Sync Video (QSV) is able to decode video images into DX9 surfaces.  The Intel and Khronos extensions can significantly improve performance of applications that use OpenCL to process QSV video frames. These benefits can apply to a range of applications, including video enhancement during playback, video editing with effects, and video generation and encoding

OpenCL and QSV Media Surface Sharing

QSV supports video decode and encode acceleration by Intel® processor graphics and can be accessed by applications using the Intel(R) Media SDK.  The Intel Media SDK and Intel® SDK for OpenCL are distributed as part of both the Intel® Media Server Studio1 and the Intel® Integrated Native Development Environment (Intel® INDE) 2.  OpenCL 3 is a framework for developing applications to take advantage of heterogeneous platforms. In the case considered in this article, these platforms consist of a CPU and a programmable graphics unit, with parts of the application executing on both to achieve improved processing efficiency and performance.  The Intel defined extension to OpenCL v1.0 (and usable on subsequent versions) enables direct access to DX9 media surfaces created by QSV.

QSV media surface sharing with OpenCL can avoid many surface copies between QSV and OpenCL kernels. If an application tries to apply an OpenCL filter to QSV video without surface sharing, it will need to copy QSV output surfaces to OpenCL image objects so that the OpenCL filter can consume the data. This copy will get more time-consuming as the resolution of input video increases, as there is more data to copy.

Figure 1 shows an example of this, in which OpenCL is used to modify a video stream. Video frames decoded by QSV into DX9 surfaces are copied from the GPU to the CPU, then copied to an OpenCL memory object, copied from CPU to GPU, processed in OpenCL with the output copied back to the CPU, then copied into a DX9 surface and finally copied back to the GPU for video encode by QSV. With shared DX9 surfaces, none of these copies will be needed.

Figure 1:Video transcode and processing pipeline without surface sharing

Because Intel processor graphics shares memory with the CPU, image surfaces might only need to be copied during conversion to a system memory buffer. Once a system memory pointer is available, it can be used to create an OpenCL buffer using that same system memory buffer. The inverse is also true. After OpenCL processing completes into an OpenCL buffer that shares a system memory buffer, that buffer could be copied and converted to a DX9 buffer. So the total memory copies could be reduced to two for Intel processor graphics. Still, it will be better to eliminate all the extra copies.

Using the Intel DirectX 9 media surface sharing extension of OpenCL , a DX9 image surface in the NV12 format (used internally by QSV) can be converted directly into an NV12 OpenCL image2D memory object without copying, and vice-versa, as shown in Figure 2. This can avoid the extra data copies to convert video frames going between QSV and an OpenCL filter. So for a transcode application, the shared media surface extension can save at least two extra memory copies.

Figure 2:Video transcode and processing pipeline using MSDK/OCL surface sharing extensions

OpenCL DirectX 9 Media Sharing APIs

Intel has defined an Intel vendor-specific extension for OpenCL version 1.0 (and beyond), called cl_intel_DX9_media_sharing3. That OpenCL extension API includes four functions that allow applications to share DirectX 9 media surfaces with OpenCL:

  • clGetDeviceIDsFromDX9INTEL
  • clCreateFromDX9MediaSurfaceINTEL
  • clEnqueueAcquireDX9ObjectsINTEL
  • clEnqueueReleaseDX9ObjectsINTEL

Using the above API extension, the OpenCL API can be enabled to execute kernels that read and/or write memory objects that are also DirectX 9 resources. An OpenCL 2D image object can be created from a DirectX 9 media surface resource. OpenCL memory objects may be created from DirectX 9 objects if and only if the OpenCL context has been created from a DirectX 9 device.  That is, the OpenCL clCreateContext or clCreateContextFromType functions must be called with a properties list that includes the property CL_CONTEXT_D3D9EX_DEVICE_INTEL  paired with a pointer to the IDirect3DDevice9 created for QSV.  (More information on this can be found in  “Intel Extension for DirectX 9 Media Sharing”. 4)

clGetDeviceIDsFromDX9INTEL: This function is used to query OpenCL devices corresponding to DirectX 9 device on a particular platform. The application should create the OpenCL context from DirectX 9 device.

CreateFromDX9MediaSurfaceINTEL: This function is used to create an OpenCL 2D image object from a DirectX 9 media surface or a plane of a DirectX 9 media surface.

Applications can query the properties of the media surface object created from the DirectX 9 resource by using clGetMemObjectInfo and clGetImageInfo functions.

clEnqueueAcquireDX9ObjectsINTEL: This function is used to acquire OpenCL memory objects that have been created from the DirectX9 resources. The OpenCL memory objects created from DirectX 9 resources must be acquired before they can be used by any OpenCL commands queued to a command-queue. This function provides synchronization guaranteeing that any DirectX calls made before clEnqueueAcquireDX9ObjectsINTEL is called will complete execution before the execution of any subsequent OpenCL APIs submitted to the command-queue.

clEnqueueReleaseDX9ObjectsINTEL: This function is used to release OpenCL memory objects that have been created from the DirectX9 resources. This function provides the synchronization guaranteeing that any calls to DirectX9 resource made after the call to clEnqueueReleaseDX9ObjectsINTEL will not start executing until after all events in event list of this API are complete and all work already submitted to the command-queue completes execution.

To use this extension API, the application must first obtain a pointer to each of the corresponding functions.  For OpenCL v1.0 and v1.1, use the standard OpenCL function clGetExtensionFunctionAddress, passing it the extension function name as a string.  The file cl_ext.h in the Intel® SDK for OpenCL™  distribution contains function prototypes (with “_fn” appended to the API name) that can be used to correctly cast the function pointers.

In OpenCL version 1.2 and newer, Khronos has added an equivalent standard Khronos extension.5  Using that extension, the equivalents of the above API names become:

  • clGetDeviceIDsFromDX9MediaAdapterKHR
  • clCreateFromDX9MediaSurfaceKHR
  • clEnqueueAcquireDX9MediaSurfacesKHR
  • clEnqueueReleaseDX9MediaSurfacesKHR

Function prototypes for these APIs can be found in the cl_DX9_media_sharing.h file available from Khronos.org. 6 However, OpenCL v 1.2 has deprecated clGetExtensionFunctionAddress in favor of a new API function clGetExtensionFunctionAddressForPlatform, as described in the Khronos.org document “OpenCL 1.2 Extensions Specification. 5”   clGetExtensionFunctionAddressForPlatform requires both a platform ID and the Khronos extension API name as a string.  

For the KHR DX9 media surface sharing extension, as with the Intel DX9 media surface sharing extension, QSV media surfaces can only be shared into an OpenCL context created using the property CL_CONTEXT_ADAPTER_D3D9EX_KHR paired with a pointer to the IDirect3DDevice9 created for QSV.

For both Intel and KHR extensions, the OpenCL functions clGetPlatformInfo and clGetDeviceInfo can be used to get a list of named extensions supported by an OpenCL implementation, to verify that the desired extension is supported—either cl_intel_dx9_media_sharing for the Intel vendor-specific extension to OpenCL v1.0 or v1.1, or cl_khr_dx9_media_sharing for the OpenCL v1.2 Khronos standard extension.  It is also worth noting that other DirectX surface sharing extensions have been created, and are also documented on the Khronos.org site.

Intel® Media SDK User Functions

Intel Media SDK provides a USER class of functions to allow user-defined functions to participate in QSV pipeline. This allows the application to integrate OpenCL based kernels into the QSV pipeline as a “plug-in.” In this particular case, the USER class of functions is implemented to use OpenCL kernels.

The application needs to do the following to use an OpenCL plug-in inside Intel Media SDK:

  • Initialize an OpenCL plug-in, registering a set of callback functions (see later) through the MFXVideoUSER_Register function. The Intel Media SDK invokes these callback functions at appropriate times in the QSV pipeline.
  • Once initialized, the application can use the OpenCL plug-ins through the Intel Media SDK function MFXVideoUSER_ProcessFrameAsync to process data. The function returns a sync point for result synchronization, which is similar to other Intel Media SDK async functions.
  • Close the OpenCL plug-in by unregistering it via the MFXVideoUSER_Unregister function.

The application needs to include “mfxplugin.h” in addition to other Intel Media SDK files.

The application needs to implement the following callback functions, which are registered through MFXVideoUSER_Register:

  • PluginInit: Intel Media SDK calls this function to initialize the plug-in components and allocate internal resources
  • PluginClose: Intel Media SDK calls this function to close the plug-in components and free the internal resources.
  • GetPluginParam: Intel Media SDK calls this function to obtain plug-in configuration parameters.
  • Submit: Intel Media SDK calls this function to check the validity of I/O parameters and submit a task to SDK for execution.
  • Execute: Intel Media SDK calls this function to execute the submitted task after resolving all input data dependencies.
  • FreeResources: Intel Media SDK calls this when task execution finishes or to cancel the queued task.

The Intel Media SDK kit includes a document, “Intel Media Software Development Kit – Extensions for User-Defined Functions API version 1.3 7” that provides the details about all these APIs.

QSV and OpenCL Media sharing Extension flow

There are multiple use cases in which QSV and OpenCL can interoperate.

Figure 3:Intel® QSV Decoder and OpenCL* interoperability

For example,  Figure 3 shows the flow of a decoder application, where an OpenCL filter is applied before displaying the frames.  Figure 4 shows the flow of a video transcoding application that uses QSV to decode and encode the video. It applies some OpenCL filters to decoded video frames before encoding them.

Figure 4: QSV and OpenCL* Interoperability

In each of these cases, the four distinct steps are: initial setup, frame decoding, OpenCL filters, and final video frame processing.

  • Initial Setup: In this step the application must initialize QSV components using Intel Media SDK. It must also perform DX9 surface allocation. The application will use the Intel Media SDK API function MFXVideoUSER_Register to register the OpenCL plug-in. After that, the application can initialize other components and create the OpenCL surface using the clCreateFromDX9MediaSurfaceINTEL function and map to DX9 surfaces created by Intel Media SDK.
  • Frame Decoding: After initializing, the application can start decoding frames. When a decoded frame is available, the applications calls the MFXVideoUSER_ProcessFrameAsync function. This stage could be left off if the application is only generating video frames for QSV encode.
  • At this point Intel Media SDK calls the “Submit” and then “Execute” plugin callback functions to execute the OpenCL kernel. The “Execute” plug-in callback will implement the OpenCL host side part of the kernel. It will call clEnqueueAcquireDX9ObjectsINTEL to lock the frame, then standard OpenCL API functions like clSetKernelArgs and clEnqueueNDRangeKernel to run OpenCL kernels to process the frame, followed by clEnqueueReleaseDX9ObjectsINTEL.
  • Frame Encoding / Frame Rendering: After frame processing, the DX9 surfaces are ready for the next stage, which could be encoding or display depending on the application.

Use Cases

OpenCL can be used to accelerate a range of video processing applications, and using the Intel or Khronos DX9 media sharing extensions will help insure the best possible performance by minimizing surface copy overhead.

Examples of applications using a combination of OpenCL and QSV include video editing, in which OpenCL is used to apply special effects or video transitions (such as blending two decoded video frames, and encoding the result); video playback with image enhancement prior to rendering for display; and algorithmic video synthesis, for example creating motion video by manipulating and animating still images and subsequently encoding the resulting sequence of frames.  

Again, it is important to note that the minimum amount of image surface copies can be obtained only by having OpenCL accept and process frames in NV12 format from QSV, and/or deliver to QSV in the same format. If color conversion is required, for example NV12 to (or from) an RGB format, a developer may be able to merge that into the OpenCL processing pipeline as part of the first (or last) kernel to process the image2D object, thereby avoiding an added pass through memory. If QSV is asked to do the color conversion as part of its post processing, it will be a separate processing stage and additional pass through memory.

Example Code

The best example code for understanding how to integrate OpenCL-based plugins into a QSV pipeline is provided as sample code, available on the Intel® Developers Zone.8 The Intel Media SDK code samples also include an OpenCL sample, but that sample does not use the DX9 media surface sharing extension.

About the Author

Tom Craver is an Intel application engineer in Chandler, Arizona. He is currently focused on performance of applications using OpenCL on Intel processor graphics, but has extensive background in SIMD, OpenCL and threaded parallel coding for performance, primarily for media applications such as audio and video codecs and video effect processing.


  1. Intel® Media Server Studio is available for download at https://software.intel.com/en-us/intel-media-server-studio/try-buy
  2. Intel® Integrated Native Development Experience is available for download at https://software.intel.com/en-us/intel-inde
  3. OpenCL Specifications, versions 1.0 and later, are available at www.Khronos.org under sub-sections “OpenCL” and “Specs & Headers”
  4. Intel extension for DirectX 9 Media Sharing is at www.Khronos.org   for OpenCL version 1.0 and newer,  under the “OpenCL” and “Specs & Headers” subsections in the Extension Specifications section, listed as  cl_intel_dx9_media_sharing
  5. OpenCL 1.2 Extensions Specification, with the DirectX 9 Media Sharing extension for OpenCL version 1.2 and newer  at www.Khronos.org under subsections “OpenCL” and “Specs & Headers”, in the list of specification files for OpenCL v1.2 and newer
  6. The file  cl_DX9_media_sharing.h  with the Khronos DirectX 9 Media Sharing extension for OpenCL version 1.2 and newer,  is available at www.Khronos.org under subsections “OpenCL” and “Specs & Headers”
  7. Intel Media Software Development Kit – Extensions for User-Defined Functions  is distributed with the Intel Media SDK  version 1.3 or newer,  located in
  8. Various Sample code for integrating OpenCL into an Intel Media SDK (QSV) pipeline is available at http://software.intel.com/en-us/vcsource/samples/opencl-and-intel-media-sdk  and in the Media SDK samples linked to from the  Intel Media SDK  file  download.htm   located in  <install-folder>\samples
  9. Intel Media Software Development Kit – Reference Manual Version 1.3 or newer distributed with the Intel Media SDK  and located in  <install-folder>\doc\ mediasdk-man.pdf
For more complete information about compiler optimizations, see our Optimization Notice.


Manish K.'s picture

can you share the sample code which includes the encoder as well.

The code at the link:


contains only decoder and opencl

Manish K.'s picture

Hi Tom,

Is this the right way to make the pipeline?
Please copy the below code in a notepad.

mfxStatus CDecodingPipeline::DecodeOneFrame(int Width, int Height, IDirect3DSurface9 *pDstSurface, IDirect3DDevice9* pd3dDevice)
  mfxU16 nOCLSurfIndex=0;
  mfxFrameSurface1_OCL*   pOutSurface = NULL; // output surface.#ifdef DUMP_YUV

    mfxStatus stsOut = MFX_ERR_NONE;

    if(m_Tasks[m_TaskIndex].m_DecodeSync || m_Tasks[m_TaskIndex].m_OCLSync || m_Tasks[m_TaskIndex].m_EncodeSync)

    {// wait task is finished and copy result texture to back buffer
        mfxStatus   sts = MFX_ERR_NONE;
        //mfxFrameSurface1_OCL*   pOutSurface = NULL; // output surface.
        //wait the previous submitted tasks
            sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
            MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
            pOutSurface = m_Tasks[m_TaskIndex].m_pDecodeOutSurface;
            sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_OCLSync, MSDK_VPP_WAIT_INTERVAL);
            MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
            pOutSurface = m_Tasks[m_TaskIndex].m_pOCLOutSurface;#ifdef DUMP_YUV1


            sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_EncodeSync, MSDK_ENC_WAIT_INTERVAL);
            MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
                  //pOutSurface = m_Tasks[m_TaskIndex].m_pEncodeOutSurface;

        if(m_Tasks[m_TaskIndex].m_pDecodeOutSurface && m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked)
        if(m_Tasks[m_TaskIndex].m_pOCLOutSurface && m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked)

        if(m_Tasks[m_TaskIndex].m_pEncodeOutSurface && m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked)


    // clear sync task for further using
    m_Tasks[m_TaskIndex].m_OCLSync = 0;
    m_Tasks[m_TaskIndex].m_pOCLOutSurface = 0;
    m_Tasks[m_TaskIndex].m_DecodeSync = 0;
    m_Tasks[m_TaskIndex].m_pDecodeOutSurface = 0;

    m_Tasks[m_TaskIndex].m_EncodeSync = 0;
    m_Tasks[m_TaskIndex].m_pEncodeOutSurface = 0;
    {// feed decoder
        mfxSyncPoint        DecodeSyncPoint = 0;
        static mfxU16      nDecoderSurfIndex = 0; // index of free surface
        mfxStatus   sts = MFX_ERR_NONE;
        m_pmfxDecodeSurfaceLast = NULL; // reset curretn decoder surface to get new one from Decoder
        while(MFX_ERR_NONE <= sts || MFX_ERR_MORE_DATA == sts || MFX_ERR_MORE_SURFACE == sts || MFX_WRN_DEVICE_BUSY == sts)
        {// loop until decoder report that it get request for new frame
            if (MFX_WRN_DEVICE_BUSY == sts)
                Sleep(1); // just wait and then repeat the same call to DecodeFrameAsync
            else if (MFX_ERR_MORE_DATA == sts)
            { // read more data to input bit stream
                sts = m_FileReader.ReadNextFrame(&m_mfxBS);
            else if (MFX_ERR_MORE_SURFACE == sts || MFX_ERR_NONE == sts)
            {// find new working-output surface in m_pmfxDecodeSurfaces
                //nDecoderSurfIndex = 0;
                nDecoderSurfIndex = GetFreeSurfaceIndex(m_pmfxDecodeSurfaces, m_mfxDecoderResponse.NumFrameActual,nDecoderSurfIndex);
                if (MSDK_INVALID_SURF_IDX == nDecoderSurfIndex)
                    return MFX_ERR_MEMORY_ALLOC;

            // send request to decoder
            sts = m_pmfxDEC->DecodeFrameAsync(
            // ignore warnings if output is available,
            // if no output and no action required just repeat the same call
            if (MFX_ERR_NONE < sts && DecodeSyncPoint)
                sts = MFX_ERR_NONE;

            if (MFX_ERR_NONE == sts)
            {// decoder return sync point then fill the curretn task nad switch to OCL Plugin feeding
                m_Tasks[m_TaskIndex].m_DecodeSync = DecodeSyncPoint;
                m_Tasks[m_TaskIndex].m_pDecodeOutSurface = m_pmfxDecodeSurfaceLast;
                // look for output process


        if(MFX_ERR_NONE != sts)
            printf("ERROR: Decoder returns error %d!\n",sts);
            stsOut = sts;
        //decoder sync point
        //sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
        //MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);


    if(m_pOCLPlugin && m_pOCLPlugin->m_OCLFlag)
    {// OPENCL part
        mfxU16 nOCLSurfIndex=0;
        mfxSyncPoint        OCLSyncPoint = 0;
        mfxStatus   sts = MFX_ERR_NONE;
        // get index for output surface for OCL plugin
        nOCLSurfIndex = GetFreeSurfaceIndex(m_pmfxOCLSurfaces, m_mfxOCLResponse.NumFrameActual);

        mfxHDL pOutSurf = &m_pmfxOCLSurfaces[nOCLSurfIndex];
        mfxHDL inp = m_pmfxDecodeSurfaceLast;

        // OCL filter
          sts = MFXVideoUSER_ProcessFrameAsync(m_mfxSession, &inp, 1, &pOutSurf, 1, &OCLSyncPoint);

            if (MFX_WRN_DEVICE_BUSY == sts)
                Sleep(1); // just wait and then repeat the same call

        // ignore warnings if output is available,
        if (MFX_ERR_NONE < sts && OCLSyncPoint)
            sts = MFX_ERR_NONE;

            printf("ERROR: OpenCL filter return error %d!\n",sts);
            return sts;

            m_Tasks[m_TaskIndex].m_OCLSync = OCLSyncPoint;
            m_Tasks[m_TaskIndex].m_pOCLOutSurface = &m_pmfxOCLSurfaces[nOCLSurfIndex];
            // look for output process


    {// feed encoder
        static mfxU16      nEncSurfIdx = 0; // index of free surface
        mfxSyncPoint EncSyncP;
        mfxStatus sts = MFX_ERR_NONE;

        for (;;)
            // at this point surface for encoder contains either a frame from file or a frame processed by vpp
            sts = m_pmfxENC->EncodeFrameAsync(NULL, &m_pmfxOCLSurfaces[nOCLSurfIndex], &m_mfxEncBS, &EncSyncP);

            if (MFX_ERR_NONE < sts && !EncSyncP) // repeat the call if warning and no output
                if (MFX_WRN_DEVICE_BUSY == sts)
                    MSDK_SLEEP(1); // wait if device is busy
            else if (MFX_ERR_NONE < sts && EncSyncP)
                sts = MFX_ERR_NONE; // ignore warnings if output is available


            else if (MFX_ERR_NOT_ENOUGH_BUFFER == sts)
                sts = AllocateSufficientBuffer(&m_mfxEncBS);
                MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
                printf("\n BUFFER allocated");

                // get next surface and new task for 2nd bitstream in ViewOutput mode

      if (MFX_ERR_MORE_DATA == sts) 
        sts = MFX_ERR_NONE;


      if (MFX_ERR_NONE == sts)
            m_Tasks[m_TaskIndex].m_EncodeSync = EncSyncP;

      if (MFX_ERR_NONE == sts)
      {// encoder return sync point then fill the curretn task nad switch to encoder feeding
          m_Tasks[m_TaskIndex].m_pEncodeOutSurface = &m_pmfxEncSurfaces[nEncSurfIdx];

    // increase task index to point to next task.
    m_TaskIndex = (m_TaskIndex+1)%SYNC_BUF_SIZE;
    return stsOut;

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.