HW acceleration slower than software implementation about H264 decoder on Sandbridge system

HW acceleration slower than software implementation about H264 decoder on Sandbridge system

I use sample_decode.exe to decode H264(720P) with/without the "-hw" option on Sandbridge system(3GHz, 8GB, windows 7 64Bit). The test uses the system memory(not 3D3 surface). The results are as the followings:
Software implementation(without "-hw"): 1497 fps.
Hardware implementation(with "-hw"): 556 fps.
My question is that whythe performance of using hardware accelerationis slow thanusing software implementation.

Thank you in advance!


12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

No any answers. Maybe the questions is not very clear.

Actually, the testing resultsof both settings with system memory are very fast, because the results is based on only counting decoding functions( DecodeFrameAsync, and m_mfxSession.SyncOperation(syncp, DEC_WAIT_INTERVAL)), not include any stream read/write function. It is tested only for the performance of decoding function.

Using hw acceleration with system memory the decoder performance has a little bitof dropping down, i think, because the decoder need to copy the decoding datafrom GPU memory to system memory. The memory of copy operation may affect the performance. Am i right?

I also tested the decoder with both hw andd3d options, the perfomance is far faster than any other options.

Because our decoder application has to use system memory and HW acceleration (can not using D3D surface ), we still need to improve the performance. Any suggestions?



Sorry for the delay. I was actually writing the response to the original as you posted the 2nd message. To answer your question, yes the decode must copy the data from the GPU back to system memory. There will be an inherent performance penalty when not using D3D surfaces.



Thanks for your answer.How doesthe decoder with HW acceleration copy data from GPU to system memory (one time total frame data copy or many time MACROBLOCK data copy)?

Thank you in advance!


Your welcome! The Media SDK always uses complete frames when copying data. Hope this helps.




Hi Eric,

I'm considering adding support for DXVA decoding to my DirectShow video renderer. Due to quality concerns, I need to do the chroma upsampling and color conversion myself, though. Because of that I fear I have to transfer the decoded data from GPU memory back to system memory. I've done some benchmarks and found that the GPU RAM > system RAM transfer runs with only about 5fps (H55) up to 20fps (G45) for 1080p content. I've done the transfer by doing a simple LockRect(READ_ONLY) on the DXVA NV12 D3D surface. Now I'm wondering if the Media SDK uses a faster method for GPU RAM -> system RAM transfer? If so, how fast can the Media SDK transfer decoded 1080p frames with Intel GPUs? And is the Media SDK transfer method available for use outside of the Media SDK, too?

(Just for your information: DxvaNv12Surface.LockRect(READ_ONLY) is painfully slow with ATI GPUs, too, while I get up to 600fps with NVidia GPUs. Weird stuff...)

Thanks, Mathias.

Hi Mathias,
Intel Media SDK does use an own accelerated method for GPU RAM -> system RAM transfer on Intel architectures. This method is accessible throughmfxCoreInterface::CopyFrame function which is a part of the specific API extension for user plugins. Please check out the mediasdkusr_man.pdf under \doc for the details on how to get access and use this function.
As for performance figures - they may vary so I suggest you benchmark on your particular system. All I can say it will be way faster than LockRect.Best regards,Nina

Thank you, Nina, that's very helpful! :)

Nina, if I understand correctly, HW enc/dec on SNB exists in a form,
similar to HW enc/dec on Atom 6xx family, i.e. as strong HW core, such
If this true, that any postprocessing in transcoding pipeline realized as GPU kernel with zero-buffer copy (or near-zero:)?
And, if its true, how I made load own postprocessing GPU kernels into this pipeline with same zero-buffer copy approach?

With SandyBridge it's not possible to write custom processing filters on GPU.
With future Intel platforms the access to GPU acceleration will become available through Intel OpenCL.

Hi, Nina

I have been testing memcpy() from video memory to system memory after H264 decoding(1080P).

Howerver cpu occupancy are very varied as following stream counts(window 7 32bit, i5-2400).

In 1~5 streams, cpu occupancy is 1~2%.

In more than 6 stream, cpu coopuancy is 90%.(I think HW decoding is change to SW decoding)

I saw your answered messages that copyframe and copybuffer are less than memcpy() in cpu's occupancy.

copyframe and copybuffer are used mfxcoreinterface of mfxplugin.

How can I use copyframe and copybuffer useless of mfxplugin.

Best regards,

Leave a Comment

Please sign in to add a comment. Not a member? Join today