Using Intel® Media Server Studio with OpenCL* Framework

Contents

Contents
About the Sample
Introduction and Motivation
Intel® Media SDK and OpenCL* Interoperability
Pipeline Description
Conclusion
Acknowledgements
References

About the Sample

This sample demonstrates how to extend a typical Intel® Media SDK decoding pipeline with an OpenCL* framework to perform simple post-processing on the decoded frames. It will demonstrate:

  • How to seamlessly utilize a custom OpenCL kernel while decoding and get the benefit of hardware acceleration.
  • Details on sharing surfaces between the Media SDK and OpenCL kernels using the cl_khr_dx9_media_sharing_extension
  • A simple OpenCL Kernel to flip the video orientation

(You can access the Intel® Media SDK for mobile/client usages, or for desktop/servers via Intel® Media Server Studio.)

Introduction and Motivation

The Intel Media SDK provides a feature rich media pipeline to application developers that can be extended to meet the demands of even the most seasoned professionals. Media pipelines can be extended with propriety algorithms using a variety of different methods. One of the original ways was to leverage the Media SDK’s USER plugin infrastructure. While this method is suitable for some applications, it may be a bit “heavy” for simple tasks such as post-processing with an OpenCL kernel. This white paper demonstrates an alternative that is lightweight and suited towards OpenCL due to the inherent surface sharing capabilities of the Intel® OpenCL SDK for Applications.

Intel Media SDK and OpenCL Interoperability

The Intel Media SDK provides fast decoding and encoding of video streams, while OpenCL is good for the pre or post processing of the video frames. OpenCL interoperates with Media specific APIs through the use of dedicated extensions to avoid expensive copies between GPU and CPU. The sample uses “DX9 Media surface sharing” OpenCL extensions. These extensions allow applications to use media surface as OpenCL memory object, which allows efficient sharing of data between OpenCL and DX9 adapter APIs. If this extension is supported, an OpenCL image object can be created from a media surface and the OpenCL API can be used to execute kernels that read and/or write memory objects that are media surfaces. Note that OpenCL memory objects may be created from the adapter media surface if and only if the OpenCL context was created from that adapter [1]. To check if this extension is supported the string “cl_khr_dx9_media_sharing” will be present in the CL_PLATFORM_EXTENSIONS. The interfaces for this extension are provided in cl_d3d9.h header file (Note: The media sharing extensions are no longer available in cl_dx9_media_sharing.h header file). For a list of other extensions supported by OpenCL, refer to https://www.khronos.org/registry/cl/specs/opencl-1.2-extensions.pdf

Pipeline Description

Media SDK is a framework that enables media applications by providing APIs for ease of development. These APIs are optimized for the underlying hardware/accelerators and provide good abstractions for most of the heavy-duty media algorithm implementations. So, as a developer, we need to understand the sequence these APIs should be called in to set-up the media pipeline and say go. The Figure below shows the basic structure of the Media SDK application. Let’s discuss each of these stages in detail.

Initialize Session

  • Initialize Media SDK session and Create a decoder: MFXInit() function initializes the media session for an implementation specified and for the version available. It’s recommended to use MFX_IMPL_HARDWARE or MFX_IMPL_AUTO_ANY if unsure about underlying driver support.
    
    		mfxIMPL impl = MFX_IMPL_AUTO_ANY;
    		mfxVersion ver = {0, 1};
    		MFXVideoSession mfxSession;
    		sts = mfxSession.Init(impl, &ver);
    		MFXVideoDECODE mfxDEC(mfxSession);
    
    		
  • Create D3D9 Device Context: OpenCL functions allow applications to use media surface as OpenCL memory objects, thus enabling efficient data sharing between OpenCL and media surface APIs. Using the OpenCL APIs we can execute the kernels that read and/or write memory objects that are also media surfaces. OpenCL memory objects may be created from media surfaces if and only if the OpenCL context has been created from the media adapter [1]. So let’s create a D3D9 device context and a device handle pointing to this device. Make sure that surfaces for this device are allocated with shared handles. This is done by setting CreateShareHandle= true, the last argument in the function CreateHWDevice.
    		
    		mfxHDL deviceHandle;
    		sts = CreateHWDevice(mfxSession, &deviceHandle, NULL, true);
    		MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);   
    		mfxStatus CreateHWDevice(mfxSession session, mfxHDL* deviceHandle, HWND window, bool bCreateSharedHandles)
    		{
    			// If window handle is not supplied, get window handle from coordinate 0,0
    			if (window == NULL) {
    				POINT point = {0, 0};
    				window = WindowFromPoint(point);
    			}
    
    			g_bCreateSharedHandles = bCreateSharedHandles;
    
    			HRESULT hr = Direct3DCreate9Ex(D3D_SDK_VERSION, &pD3D9);
    			if (!pD3D9 || FAILED(hr)) return MFX_ERR_DEVICE_FAILED;
    
    			RECT rc;
    			GetClientRect(window, &rc);
    
    			D3DPRESENT_PARAMETERS D3DPP;
    			memset(&D3DPP, 0, sizeof(D3DPP));
    			D3DPP.Windowed                   = true;
    			D3DPP.hDeviceWindow              = window;
    			D3DPP.Flags                      = D3DPRESENTFLAG_VIDEO;
    			D3DPP.FullScreen_RefreshRateInHz = D3DPRESENT_RATE_DEFAULT;
    			D3DPP.PresentationInterval       = D3DPRESENT_INTERVAL_ONE;
    			D3DPP.BackBufferCount            = 1;
    			D3DPP.BackBufferFormat           = D3DFMT_A8R8G8B8;
    			D3DPP.BackBufferWidth            = rc.right - rc.left;
    			D3DPP.BackBufferHeight           = rc.bottom - rc.top;
    			D3DPP.Flags                     |= D3DPRESENTFLAG_LOCKABLE_BACKBUFFER;
    			D3DPP.SwapEffect                 = D3DSWAPEFFECT_DISCARD;
    
    			hr = pD3D9->CreateDeviceEx( GetIntelDeviceAdapterNum(session),
    										D3DDEVTYPE_HAL,
    										window,
    										D3DCREATE_SOFTWARE_VERTEXPROCESSING | D3DCREATE_MULTITHREADED | D3DCREATE_FPU_PRESERVE,
    										&D3DPP,
    										NULL,
    										&pD3DD9);
    			if (FAILED(hr)) return MFX_ERR_NULL_PTR;
    
    			hr = pD3DD9->ResetEx(&D3DPP, NULL);
    			if (FAILED(hr)) return MFX_ERR_UNDEFINED_BEHAVIOR;
    
    			hr = pD3DD9->Clear(0, NULL, D3DCLEAR_TARGET, D3DCOLOR_XRGB(0, 0, 0), 1.0f, 0);
    			if (FAILED(hr)) return MFX_ERR_UNDEFINED_BEHAVIOR;
    
    			UINT resetToken = 0;
    
    			hr = DXVA2CreateDirect3DDeviceManager9(&resetToken, &pDeviceManager9);
    			if (FAILED(hr)) return MFX_ERR_NULL_PTR;
    
    			hr = pDeviceManager9->ResetDevice(pD3DD9, resetToken);
    			if (FAILED(hr)) return MFX_ERR_UNDEFINED_BEHAVIOR;
    
    			*deviceHandle = (mfxHDL)pDeviceManager9;
    
    			return MFX_ERR_NONE;
    		}
    
    		

In order to provide the device manager to Media SDK, use the function MFXVideoCore_SetHandle. This sets the system handle that the SDK implementation might need.


		sts = mfxSession.SetHandle(DEVICE_MGR_TYPE, deviceHandle);
		virtual mfxStatus SetHandle(mfxHandleType type, mfxHDL hdl) { return MFXVideoCORE_SetHandle(m_session, type, hdl); }

		
  • OpenCL Initialization: Make sure to pass the address of the D3D device created. To verify if surface sharing is supported on the platform, searching for the string “cl_khr_dx9_media_sharing” in the CL_PLATFORM_EXTENSIONS.
    		
    		// --- OCL processing initialization
    		OCLProcess oclProcess;
    		cl_int clSts = oclProcess.OCLInit(GetDevice());
    		MSDK_CHECK_RESULT(clSts, CL_SUCCESS, clSts);
    		 clSts = clGetPlatformInfo(m_clPlatform, CL_PLATFORM_EXTENSIONS, sizeof(str), str, &len);
    		 if (NULL == strstr(str, "cl_khr_dx9_media_sharing") || CL_SUCCESS != clSts) {
    
    		

Now we have to hook up to the D3D extension. Calling the OpenCL API function clGetExtensionFunctionAddressForPlatform returns the address of the extension function name for a given platform. The interfaces for theses extensions are in the cl_d3d9.h header file which we should include in our application.

		
		#define EXT_DECLARE(_name) _name##_fn _name
		#define EXT_INIT(_p, _name) _name = (_name##_fn) clGetExtensionFunctionAddressForPlatform((_p), #_name); res &= (_name != NULL);

		EXT_DECLARE(clGetDeviceIDsFromDX9MediaAdapterKHR);
		EXT_DECLARE(clCreateFromDX9MediaSurfaceKHR);
		EXT_DECLARE(clEnqueueAcquireDX9MediaSurfacesKHR);
		EXT_DECLARE(clEnqueueReleaseDX9MediaSurfacesKHR);
		inline int InitDX9MediaFunctions(cl_platform_id platform) // get DX9 sharing functions
		{
			bool res = true;
			EXT_INIT(platform,clGetDeviceIDsFromDX9MediaAdapterKHR);
			EXT_INIT(platform,clCreateFromDX9MediaSurfaceKHR);
			EXT_INIT(platform,clEnqueueAcquireDX9MediaSurfacesKHR);
			EXT_INIT(platform,clEnqueueReleaseDX9MediaSurfacesKHR);
			return res;
		}
		
		

clGetDeviceIDsFromDX9MediaAdapterKHR queries a media adapter for any associated OpenCL devices. Adapters with associated OpenCL devices can enable media sharing between two. Create Context with the D3D device.

		
		clSts = clGetDeviceIDsFromDX9MediaAdapterKHR(   m_clPlatform,
						1,
						&type,
						(void**)&m_pD3DDevice,
						CL_ALL_DEVICES_FOR_DX9_MEDIA_ADAPTER_KHR,
						sizeof(devices)/sizeof(devices[0]),
						devices,
						&numDevices);
			m_clContext = clCreateContext(props, numDevices, devices, NULL, NULL, &clSts);
		
		

Until now we have D3D device context initialized in the OpenCL Context. Next step is to create Media Resources as OpenCL Image Objects so we can share the resources - Use the function clCreateFromDX9MediaSurfaceKHR to create an OpenCL image object from a media surface.

		
		oclProcess.OCLPrepare(  mfxVideoParams.mfx.FrameInfo.CropW,
											mfxVideoParams.mfx.FrameInfo.CropH, 
											mfxResponse.mids,	// Input surfaces (D3D surfaces & shared handles)
											numSurfaces,		// Number of input surfaces
											outputSurfaceMids,	// Output surfaces (D3D surfaces & shared handles)
											1);					// Number of output surfaces

		// Y plane
					m_pOCLBuffers[i][idx].OCL_Y = clCreateFromDX9MediaSurfaceKHR(m_clContext, 0, CL_ADAPTER_D3D9EX_KHR, &info, 0, &clSts);

		// UV plane
					m_pOCLBuffers[i][idx].OCL_UV = clCreateFromDX9MediaSurfaceKHR(m_clContext, 0, CL_ADAPTER_D3D9EX_KHR, &info, 1, &clSts);

		

Set Parameters

In this stage we set all the required video parameters for the decoder and initialize the Media SDK decoder with those parameters. It is optionally validating the provided video parameters. If an SDK function operates on video memory surface at both input and output, the application must specify the access pattern IOPattern at the initialization in MFX_IOPATTERN_IN_VIDEO_MEMORY for input and MFX_IOPATTERN_OUT_VIDEO_MEMORY for output. We are setting the CODEC parameter to AVC.

		
		mfxVideoParam mfxVideoParams;
		memset(&mfxVideoParams, 0, sizeof(mfxVideoParams));
		mfxVideoParams.mfx.CodecId = MFX_CODEC_AVC;
		mfxVideoParams.IOPattern = MFX_IOPATTERN_OUT_VIDEO_MEMORY
			
		// Initialize the Media SDK decoder
		sts = mfxDEC.Init(&mfxVideoParams);
		MSDK_IGNORE_MFX_STS(sts, MFX_WRN_PARTIAL_ACCELERATION);
		MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
		
		

We call the DecoderHeader() that parses in the input bit stream, searching for header and fills the mfxVideoParam structure with appropriate values, such as resolution and frame rate. The application can then pass the resulting mfxVideoParam structure to the MFXVideoDECODE_Init function for decoder initialization. You can call this function at any time before or after decoder initialization.

		
		sts = mfxDEC.DecodeHeader(&mfxBS, &mfxVideoParams);
		MSDK_IGNORE_MFX_STS(sts, MFX_WRN_PARTIAL_ACCELERATION);
		MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
		
		

Query

Now that we have initialized the session, set the parameters and queried their support, we have to allocate the surface buffers that will be used by the SDK pipeline. QueryIOSurf function returns the minimum and suggested number of frame surface required for decoding. Where the surfaces are allocated is very important. Using the underlying hardware gives the best performance. Thus allocating buffers in the video memory is very crucial for best performance of the SDK as it eliminates the copying from system memory to video memory. During the session initialization, use video memory for IO for input and output to specify the usage.Query the number of required surfaces for the decoder.

		
		mfxFrameAllocRequest Request;
		memset(&Request, 0, sizeof(Request));
		sts = mfxDEC.QueryIOSurf(&mfxVideoParams, &Request);
		MSDK_IGNORE_MFX_STS(sts, MFX_WRN_PARTIAL_ACCELERATION);
		MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
		
		

Allocate Surfaces

During allocation of surfaces, use SDK provided alloc functions instead of memset() or new() that would allocate the buffers in the system memory.

		
		// Allocate required surfaces
		mfxFrameAllocResponse mfxResponse;
		sts = mfxAllocator.Alloc(mfxAllocator.pthis, &Request, &mfxResponse);
		MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);

		mfxU16 numSurfaces = mfxResponse.NumFrameActual;

		// Allocate surface headers (mfxFrameSurface1) for decoder
		mfxFrameSurface1** pmfxSurfaces = new mfxFrameSurface1*[numSurfaces];
		MSDK_CHECK_POINTER(pmfxSurfaces, MFX_ERR_MEMORY_ALLOC);       
		for (int i = 0; i < numSurfaces; i++)
		{       
			pmfxSurfaces[i] = new mfxFrameSurface1;
			memset(pmfxSurfaces[i], 0, sizeof(mfxFrameSurface1));
			memcpy(&(pmfxSurfaces[i]->Info), &(mfxVideoParams.mfx.FrameInfo), sizeof(mfxFrameInfo));
			pmfxSurfaces[i]->Data.MemId = mfxResponse.mids[i]; // MID (memory id) represent one D3D NV12 surface  
		}  
	
		

Find Free Surface

The SDK uses the surfaces allocated and initialized to do the processing. So, to begin processing, use the GetFreeSurface() function to get a surface that is free (not locked by other process) and can be used for the processing. This functionality is as simple as find an unlocked surface for use.

		
		nIndex = GetFreeSurfaceIndex(pmfxSurfaces, numSurfaces); // Find free frame surface
		
		

After finding the free surface, read the bit stream (for decoder), or the frame (for encoder) into the surface and pass the surface for processing.

Processing Loop

It’s recommended to use asynchronous calls to process frames. In a loop that terminates when the input is empty, fire asynchronous functions for either decode/encode/VPP. This way, you are processing multiple frames in parallel thus improving the performance significantly. Also the post processing on the RAW video frames via OpenCL is done.

		
		while (MFX_ERR_NONE <= sts || MFX_ERR_MORE_DATA == sts || MFX_ERR_MORE_SURFACE == sts)          
			{
				if (MFX_WRN_DEVICE_BUSY == sts)
					Sleep(1); // Wait if device is busy, then repeat the same call to DecodeFrameAsync

				if (MFX_ERR_MORE_DATA == sts)
				{
					sts = ReadBitStreamData(&mfxBS, fSource); // Read more data into input bit stream
					MSDK_BREAK_ON_ERROR(sts);            
				}

				if (MFX_ERR_MORE_SURFACE == sts || MFX_ERR_NONE == sts)
				{
					nIndex = GetFreeSurfaceIndex(pmfxSurfaces, numSurfaces); // Find free frame surface 
					MSDK_CHECK_ERROR(MFX_ERR_NOT_FOUND, nIndex, MFX_ERR_MEMORY_ALLOC);
				}
				
				// Decode a frame asychronously (returns immediately)
				//  - If input bitstream contains multiple frames DecodeFrameAsync will start decoding multiple frames, and remove them from bitstream
				sts = mfxDEC.DecodeFrameAsync(&mfxBS, pmfxSurfaces[nIndex], &pmfxOutSurface, &syncp);

				// Ignore warnings if output is available, 
				// if no output and no action required just repeat the DecodeFrameAsync call
				if (MFX_ERR_NONE < sts && syncp) 
					sts = MFX_ERR_NONE;

				if (MFX_ERR_NONE == sts)
					sts = mfxSession.SyncOperation(syncp, 60000); // Synchronize. Wait until decoded frame is ready

		if (MFX_ERR_NONE == sts)
				{
					++nFrame;

					// OCL processing
					int idx = GetSurfaceIdxFromMemId(pmfxSurfaces, numSurfaces, pmfxOutSurface->Data.MemId);
					clSts = oclProcess.OCLProcessSurface(idx, 0);
					MSDK_CHECK_RESULT(clSts, CL_SUCCESS, clSts);
					// ---

		#ifdef ENABLE_OUTPUT
					sts = WriteRawFrameB((IDirect3DSurface9*)outputSurfaceMids[0], mfxVideoParams.mfx.FrameInfo.CropW, mfxVideoParams.mfx.FrameInfo.CropH, fSink);
					MSDK_BREAK_ON_ERROR(sts);

					printf("Frame number: %dr", nFrame);
		#endif
				}
	
		
  • Lock and enqueue OpenCL Kernels for processing: We have created the memory objects from the Media surface. Now we have to acquire the OpenCL image objects created from Media surfaces before putting them in the command queue. Use the function clEnqueueAcquireDX9MediaSurfacesKHR to lock the surface and then enqueue the kernel for processing. Once the processing is done release the lock before they may be accessed by media adapter (D3D9 I our case). Accessing a media surface while its corresponding OpenCL memory object is acquired is an OpenCL error and may results in data corruption.
		
		clSts = clEnqueueAcquireDX9MediaSurfacesKHR(m_clQueue, 4, m_oclSurfaces, 0, NULL, NULL);
			if (clSts) printf("clEnqueueAcquireDX9MediaSurfacesKHR err=%dn", clSts);

		// Enqueue kernel processing of Y plane
			clSts = clSetKernelArg(m_clKernel, 0, sizeof(cl_mem), &m_pOCLBuffers[OCL_IN][midIdxIn].OCL_Y);  // In
			if (clSts) printf("clSetKernelArg1 err=%dn", clSts);
			clSts = clSetKernelArg(m_clKernel, 1, sizeof(cl_mem), &m_pOCLBuffers[OCL_OUT][midIdxOut].OCL_Y); // Out
			if (clSts) printf("clSetKernelArg2 err=%dn", clSts);
			clSts = clSetKernelArg(m_clKernel, 2, sizeof(cl_int), &m_frameSizeY[1]); // Frame height
			if (clSts) printf("clSetKernelArg3 err=%dn", clSts);

			clSts = clEnqueueNDRangeKernel(m_clQueue, m_clKernel, 2, NULL, m_GlobalWorkSizeY, m_LocalWorkSizeY, 0, NULL, NULL);
			if (clSts) printf("clEnqueueNDRangeKernel Y err=%dn", clSts);

			// Enqueue kernel processing of UV plane
			clSts = clSetKernelArg(m_clKernel, 0, sizeof(cl_mem), &m_pOCLBuffers[OCL_IN][midIdxIn].OCL_UV);  // In
			if (clSts) printf("clSetKernelArg1 err=%dn", clSts);
			clSts = clSetKernelArg(m_clKernel, 1, sizeof(cl_mem), &m_pOCLBuffers[OCL_OUT][midIdxOut].OCL_UV); // Out
			if (clSts) printf("clSetKernelArg2 err=%dn", clSts);
			clSts = clSetKernelArg(m_clKernel, 2, sizeof(cl_int), &m_frameSizeUV[1]); // Frame height
			if (clSts) printf("clSetKernelArg3 err=%dn", clSts);

			clSts = clEnqueueNDRangeKernel(m_clQueue, m_clKernel, 2, NULL, m_GlobalWorkSizeUV, m_LocalWorkSizeUV, 0, NULL, NULL);
			if (clSts) printf("clEnqueueNDRangeKernel UV err=%dn", clSts);

			clSts = clEnqueueReleaseDX9MediaSurfacesKHR(m_clQueue, 4, m_oclSurfaces, 0, NULL, NULL);
			if (clSts) printf("clEnqueueReleaseDX9MediaSurfacesKHR err=%dn", clSts);
		
		

Drain and Cleanup

With the exception that the input parameter to the FrameAsync function is NULL, we are draining the pipeline at this stage. For pseudo-code, all you need to do is replace the first parameter for FrameAsync function with NULL in the while loop above. Once the pipeline draining is done, we deallocate all the buffers used and close the file handles if any.

		
		// Decode a frame asychronously (returns immediately)
		sts = mfxDEC.DecodeFrameAsync(NULL, pmfxSurfaces[nIndex], &pmfxOutSurface, &syncp);
			mfxDEC.Close();
			// mfxSession closed automatically on destruction

				mfxDEC.Close();
			// mfxSession closed automatically on destruction

				for (int i = 0; i < numSurfaces; i++)
				 delete pmfxSurfaces[i];
			MSDK_SAFE_DELETE_ARRAY(pmfxSurfaces);
			MSDK_SAFE_DELETE_ARRAY(mfxBS.Data);

			// Decoder surfaces are deallocated automatically, no need to call mfxAllocator.Free(...)
			//mfxAllocator.Free(mfxAllocator.pthis, &mfxResponse);

			fclose(fSource);
			fclose(fSink);

			CleanupHWDevice();

			// --- OCL release
			if(outputSurfaceMids[0] != NULL)
				((IDirect3DSurface9*)outputSurfaceMids[0])->Release();
			oclProcess.OCLRelease();
			// --- End OCL release
		
		

Pre/Post Processing:

The class OCLProcess is a generic processing class. This class can be used for both pre and post processing. All the steps described above are mostly the setup process and doing the surface sharing. This need to be done only once. To call your specific kernel file and the function within the kernel, all we have to do is set these global values. “ocl_flip.cl” is the kernel file name and the kernel program name is “Flip”

		
		const char* g_kernelFileName =          "ocl_flip.cl";
		const char* g_kernelFunctionName =      "Flip";
	
		

Conclusion

Simplicity of sharing the Media Surface with OpenCL buffer is the key takeaway from this paper. There are other methods of sharing the media surfaces that Media SDK supports which is also known as user defined plugin model. But it involves lot of heavy lifting that the ISV has to do.

Acknowledgements

Many people helped in making this white paper possible; providing the source code, reviewing, providing feedback and encouragement. Thanks to Sravanthi Kota Venkata, Eric Sardella, Antony Pabon and Jeff McAllister.

References

  1. https://www.khronos.org/registry/cl/specs/opencl-1.2-extensions.pdf
  2. https://software.intel.com/en-us/intel-media-server-studio
  3. https://software.intel.com/en-us/articles/opencl-and-intel-media-sdk
  4. https://software.intel.com/en-us/articles/framework-for-developing-applications-using-media-sdk

About the Author

Sudhakar Draksharapu is an Application Engineer in Intel's Software and Services Group, Developer Relations Division, and work with customers to enable and integrate their solutions with the Media Server Studio. His Domain expertise lies is in Media SDK, Optimization techniques and OpenCL technologies.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Kronos

有关编译器优化的更完整信息,请参阅优化通知