IPP7 H264VideoDecoder high CPU with threading

IPP7 H264VideoDecoder high CPU with threading

This issue is reproducible in the simple_player example. Just take an H.264 video clip and play it on in the simple_player with -t0 and try it again with -t1. The one I used to test is the Serenity trailer fromhttp://www.h264info.com/clips.html.

I'm using IPP 7.0.6 with the 7.0.6 sample code. I'm on a quad core i7 and with threads set to 0 (internally this would be 8) my CPU sits at roughly 20% the entirety of the clip. This is nearly 2 of my CPUs fully maxed (including hyperthreading). Now if I take the same clip and play it with threads set to 1 then it sits at 1% to 2% CPU with some rare spikes up to 5%. I would expect slightly higher CPU with threading, but nothing more than an extra 1-2%.

To make matters worse if I try to enable threading in our application which decodes and displays several different cameras at a time (up to 48) then if enough cameras are being displayed it will sit consistently at 99% CPU and make the entire PC unusable until you close the application. In IPP 6.1.6 this worked fine. Granted, if I force the threads to be limited to 2 then everything is much more usable and it actually performs better than 6.1.6 with threads set to 0. So if it weren't for things going crazy with 8 threads then it would actually be a nice performance bump.

Also has there ever been any thought in implementing a thread pool feature so that you could allow decoders to share threads for situations like ours? That way our potential 48 H264VideoDecoders could share just 8 threads rather than creating 384 threads.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello,

There some known case on when introducing the internal OpenMP threading, it may increase the CPU usage. More discussion on this problem could be found here:

http://software.intel.com/en-us/articles/high-cpu-usage-and-intel-ipp-th...

If one CPU core is enough to decoding the video, and you are using the threading the high level, I suggest you disabling the internal threading in the H.264 decoder, and use the serial code to decode the video.

Thanks,
Chao

That document helps, but I really think what it comes down to is the section in the user manual under "Avoiding Nested Parallelization." So with that in mind it seems like the IPP sample code is sort of breaking your own rules since the H.264 decoder uses multiple threads by default and does not use OpenMP. Granted you can use the colorspace conversion and resizing without an H.264 decoder, but it just feels like this should be something that should happen internally rather being bad by default.

I finally got around to testing this out and looking into it more. I tried the suggestions of ippSetNumThreads(1) and setting the environment variable KMP_BLOCKTIME=0 and neither of them worked in the sample application. I'm pretty certain there is a bug in the new H.264 mult-threading code introduced in the IPP 7 sample code. I'm curious why no one else is reporting it though.

Something seems a little odd to me with the functionbool TaskBrokerTwoThread::GetNextTaskInternal(H264Task *pTask) in umc_h264_task_broker.cpp:1780. Basically the logic as I read it says if there are no tasks to process (!m_FirstAU) and m_isExistMainThread (always true) and if this isn't the main thread (after !pTask->m_iThreadNumber) then it calls AwakeThreads. Why does one of the worker threads need to wake the other threads up to signal that there is no data? Not to mention that this happens in a loop so its going to wake up all the other threads and then go to sleep because there is no work. Then the other threads are going to do the exact same thing hence the problem. Perhaps I'm missing something, but something about that definitely feels wrong to me. Unfortunately I haven't followed the logic enough to know what is trying to be achieved by doing that. Removing that call drops the CPU down and you still get additional performance so that definitely seems to be the right track.

At first I thought taking that code out resulted in lower performance, but it turned out to be some code I had to work around another IPP 7 issue. Unlike IPP 6 when you hit the maximum number of frames and pass in the next frame it returns the first decoded frame, but does not read in any of the data. In IPP 6 samples it would return the first decoded frame AND queue up the new data. Our application needs to keep track of how many frames are buffered so our frame stepping is more accurate so this change is a little annoying.

Also in comparison to IPP 6.1.6 sample code AwakeThreads was only called when the main thread is inGetNextTaskInternal. Not that its safe to make a comparison between the old code and the new code since a lot has changed, however that particular function seems fairly similar still.

Leave a Comment

Please sign in to add a comment. Not a member? Join today