Decoder input bitstream (and general performance) optimization

Decoder input bitstream (and general performance) optimization

Is there anything particular we should be aware
of when trying to get the highest performance out of hardware decoding with
low-latency streaming media? We're trying to run quite a few such decoding
sessions in parallel (>10), and I seem to be noticing peculiar differences
in the ability of the decoding to keep up depending on the management of the
input MfxBitstream.

I had thought that circular buffer-style management of the
bitstream read offset pointer (along with similar added write logic) would be
an improvement over the method used in the example code, which involves a
bitstream whose offset data is copied back to the starting offset as often as
possible. It seems like only moving pointers would be much more efficient than
constantly memcpying encoded video data -- but instead the best-performing
solution we've found is a double-buffered approach, similar to the one used in
the example except with a second input bitstream.

Any idea why this is the case? Is there some reason decoding
operations perform better when they tend to start from the beginning of the
MfxBitstream buffer, rather than at some offset along it?

And why does the guide recommend decoding all remaining frames
from the input buffer before adding more? Shouldn't write operations appending
to the end of a section of memory be transparent to the decoder, who is simply
reading along it's own (earlier) section of memory? It's not as though there's
any sort of inherent lock implemented in the bitstream data, since it's
randomly accessible.

Again, this is all in a low-latency setup, with AsyncDepth set to
1 and the bitstream flag set to indicate a complete frame is available every
time DecodeFrameAsync is called. Though increasing the AsyncDepth and clearing
that flag don't seem to improve throughput at the expense of latency, as I
would have expected. Maybe it's because of the large number of simultaneous
decoding sessions, such that the hardware can't work on any one in parallel,

Any advice, insights, or assistance are appreciated!



2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi James,

based on your description there is no reason performance should differ regardless of how treat the bitstream buffer. The sample implementation just showcases one way of how to load into and process data from the buffer.Also, there is no specific reason that the bitstream buffer should be drained first before adding more data. The recommendation just stems from how the sample was designed, which is a simplistic approach.

I suspect the reason for different performance lies somewhere else?

Media SDK does not limit the number of simultaneous ongoing sessions. The HW resources are shared equally between the running Media SDK workloads.

If you can provide some more details about your architecture, configuration and bottlenecks we may be able to help you assess this further.


Leave a Comment

Please sign in to add a comment. Not a member? Join today