H.264 encoder extremely slow in 7.1.1

H.264 encoder extremely slow in 7.1.1

I have just upgraded my rather old code from IPP 5.3 to 7.1.1. Turned out to be a huge job due to the API changes, but that's OK, it happens. And my code ended up being much smaller and cleaner as many features I had to try and emulate are now in the sample code.

My problem now is that I am getting very high CPU usage, and very slow frame rates, the two being, of course, closely related. First, I am using the "max slice size" option as I am trying to send RFC 3984 compliant packetisation mode zero RTP packets. Second, I am encoding a y4m file to minimise any possible interactions with cameras etc. Finally, I am using contant bit rate set to 2Mbps. My test code (effectively) takes a YUV420P frame from the file, feeds it through the codec, then splits it up into separate RTP/NALU's by searching for the start codes etc, and finishes by throwing away the result. The build is using VS2012, 32 bit.

A CIF sized image maxes out the single thread and yields 10fps. HD720 is about 4fps and HD1080 is about 2fps.

That is seriously non-linear for a start. The HD stuff, I can sort of understand eating CPU for breakfast, but really, CIF should be able to do 30fps, with time to spare. Even in one thread.

I have played with the num_slices and m_iThreads parameters, as well as resolutions and CBR bit rates, and nothing seems to makes a lot of difference.

Can anyone think of something I am doing wrong?

Oh yeah, this is on a realtively old i7, but I got 10 times this performance with my old code, and IPP 5.3, two years ago.

Robert.
22 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

I had a very similar problem while migrating from IPP 6.1 to IPP 7.1. Performance decreased 4 times.

The solution was to set manually quantity of threads inside IPP to 1

ippSetNumThreads(1)

After this, performance became normal.

Bild des Benutzers Pavel V.Vlasov (Intel)

Hi Robert,

"max slice size" mode doesn't support threading. So I recommend you to completely disable openmp in project properties. There are some performance problems with openmp with limited threads number.

Have a nice day.

>>...Finally, I am using contant bit rate set to 2Mbps. My test code...
>>
>>...The build is using VS2012, 32 bit.

I understood that you're using Microsoft C++ compiler and could you post command line options for a review?

Quote:

Roman T. wrote:

The solution was to set manually quantity of threads inside IPP to 1

ippSetNumThreads(1)

After this, performance became normal.

Unfortunately, this made no difference, thanks anyway.

Robert.

Quote:

Pavel V.Vlasov (Intel) wrote:

"max slice size" mode doesn't support threading. So I recommend you to completely disable openmp in project properties. There are some performance problems with openmp with limited threads number.

Thank you for the suggestion, that certainly helps. CIF now gives me 30fps and uses 50% of one core. However, HD720 is still only 7.5fps and HD1080 is only up to 3fps. At least the ratios look a bit more linear. :-)

This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!

A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.

Robert.

Quote:

Sergey Kostrov wrote:

>>...Finally, I am using contant bit rate set to 2Mbps. My test code...
>>
>>...The build is using VS2012, 32 bit.

I understood that you're using Microsoft C++ compiler and could you post command line options for a review?

I use what is generated by the IPP samples supplied script: perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,mt,release

/GS /TP /analyze- /W3 /Zc:wchar_t /I"C:/Program Files (x86)/Intel/Composer XE 2013/ipp/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/codec/video/h264/enc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/codec/video/common/cc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/io/umc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/umc/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/vm/include" /I"C:/Work/IPP-Codecs/ipp-samples/sources/audio-video-codecs/core/vm_plus/include" /Zi /Gm- /Od /Fd"C:/Work/IPP-Codecs/ipp-samples/__cmake/audio-video-codecs.ia32.vc2012.d.mt/__lib/debug/h264_enc.pdb" /fp:fast /D "WIN32" /D "_WINDOWS" /D "_DEBUG" /D "IA32" /D "WINDOWS" /D "_SBCS" /D "_WIN32" /D "_WIN32_WINNT=0x501" /D "CMAKE_INTDIR=\"debug\"" /errorReport:prompt /WX- /Zc:forScope /GR /Gd /Oy- /MDd /openmp- /Fa"debug" /EHsc /Fo"h264_enc.dir\debug\" /Fp"h264_enc.dir\debug\h264_enc.pch"

Robert.
Bild des Benutzers Pavel V.Vlasov (Intel)

This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!

Our encoder threading implementation relies majorly on slicing and slice size limiter implementation is in conflict with it.

A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.

perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,st,release - in cmake script it tied with st/mt key for threaded libs

Have a nice day.

Quote:

Pavel V.Vlasov (Intel) wrote:

This, along with observations from Task Manage, implies that we are now doing the full encoding inside one thread. Well, to get HD you will need to use multiple threads, CPU is just not fast enough in raw clock. There must be a way to do both. Someone out there must be doing HD encoding to RTP, surely!

Our encoder threading implementation relies majorly on slicing and slice size limiter implementation is in conflict with it.

So, what you appear to be telling me is, it is impossible to do RFC 3984 compliant packetisation, except for low resolution. That is extremely disappointing!

While the IPP sample code may not allow it, it should be possible to mix the two modes. I cannot think of any reason why you could not divide the video frame into mega-slices, X scan lines each, and each creates a sequence of NAL "real" slices to the max size limit. The mega-slices could be encoded in parallel. Sure, you might sometimes get a tiny slice, one macrobock, at the end of one mega-slice, but that is small price to pay for it to be possible.

Quote:

Pavel V.Vlasov (Intel) wrote:

A further question: how does one turn off OpenMP when using the build.pl script? To test the above, I just did it manually in the VS2012 IDE.

perl build.pl --cmake=audio-video-codecs,ia32,vc2012,d,st,release - in cmake script it tied with st/mt key for threaded libs

I saw that, but assumed it would then link with single threaded run time, as in completely single threaded. No mutexes, nothing. I don't want that. If you are saying that only disables OpenMP, then very good.

Robert.

>>...I use what is generated by the IPP samples supplied script: perl build.pl --cmake=audio-video-codecs,ia32,
>>vc2012,d,mt,release
>>...

You're using a Debug Configuration with all C++ compiler optimizations turned off (!).

Here is a consolidated set of compiler command line options related to Debug Configuration:

/Od - turns off all C++ compiler optimizations
...
/D "_DEBUG" - debug versions of CRT memory management functions ( slower ) will be used, like malloc_dbg and free_dbg
...
/D "CMAKE_INTDIR=\"debug\""
...
/MDd - debug versions of the run-time library will be used ( slower )
...
/openmp- - OpenMP is Not used
...
/Fa"debug"
/Fo"h264_enc.dir\debug\"
/Fp"h264_enc.dir\debug\h264_enc.pch"
...

Apologies, I should have made it clear. Of course I have tried the release, fully optimised version. It makes some difference, maybe 10-20%, but I need a factor of 10!

Turing on OpenMP made is slower. Especially on smaller images.

Here is the optimised command line, stripped of the /I arguments to make it smaller:

/GS /TP /analyze- /W3 /Zc:wchar_t /Gm- /O2 /fp:fast /D "WIN32" /D "_WINDOWS" /D "IA32" /D "WINDOWS" /D "_SBCS" /D "_WIN32" /D "_WIN32_WINNT=0x501" /D "CMAKE_INTDIR=\"release\"" /errorReport:prompt /WX- /Zc:forScope /GR /Gd /Oy- /MD /openmp- /Fa"release" /EHsc"

Robert.
Bild des Benutzers Pavel V.Vlasov (Intel)

So, what you appear to be telling me is, it is impossible to do RFC 3984 compliant packetisation, except for low resolution. That is extremely disappointing!

I can only recommend to tune encoding parameters through par file, such as iQuality, iRefFramesNum, bEntropyMode, iSearchX/Y and others. Decreasing quality and ref frames number can give significant speed up.

Have a nice day.

>>>>...I use what is generated by the IPP samples supplied script: perl build.pl --cmake=audio-video-codecs,ia32,
>>>>vc2012,d,mt,release
>>>>...
>>
>>You're using a Debug Configuration with all C++ compiler optimizations turned off (!).

Let me know if you're interested in sets of Release Configuration command line options for Intel and Microsoft C++ compilers.

>>>>You're using a Debug Configuration with all C++ compiler optimizations turned off (!).
>>
>>Let me know if you're interested in sets of Release Configuration command line options for Intel and Microsoft C++ compilers.

Here is a set of compiler options for Intel C++ compiler:

[ Intel C++ compiler ]

/c
/nologo
/O3
/D "NDEBUG"
/Ot
/Oy
/GF
/MT
/fp:fast=2
/Wp64
/Qopenmp
/TP
/Zi
/W5
/Oi
/GS-
/Gr
/Qfp-speculation:fast
/Qopt-matmul
/Qparallel
/Qstd=c99
/Qstd=c++0x
/Qrestrict

Here is a set of compiler options for Microsoft C++ compiler:

[ Microsoft C++ compiler ]

/c
/nologo
/O2
/D "NDEBUG"
/Ot
/Oy
/GF
/MT
/fp:fast
/Wp64
/openmp
/TP
/Zi
/W4
/Gm
/EHsc
/errorReport:prompt

I tried all those compiler flags and few more besides. No significant effect. The only compiler flag that made a difference was turning OFF openmp which gave a five fold speed up. Now that is totally non-intuitive! Actually, it smells like a bug ...

Robert.

>>>>...My test code (effectively) takes a YUV420P frame from the file, feeds it through the codec, then splits it up into
>>>>separate RTP/NALU's by searching for the start codes etc, and finishes by throwing away the result...
>>
>>...Actually, it smells like a bug ...

I would suggest you to compare execution times for different pieces of your test codes using both versions of IPP library in order to identify exactly a block of codes responsible for that negative performance impact.

Quote:

Pavel V.Vlasov (Intel) wrote:

I can only recommend to tune encoding parameters through par file, such as iQuality, iRefFramesNum, bEntropyMode, iSearchX/Y and others. Decreasing quality and ref frames number can give significant speed up.

So, m_iQuality is not used by anything other than mpeg2. Perhaps you mean m_QualitySpeed? It was zero which is the fastest. I tried several of those fields and some made no difference, some made small differences. But nothing like the order of magnitude needed.

Now here is a simple comparison: Using x264, I get VGA at 30fps and use 7%-9% of total CPU on my 2.6GHz i7.Doing HD720 starts to hit limits getting 26fps and around 30% of the CPU. On the same hardware, for VGA, IPP can't even get 5fps and uses 12% of the CPU, basically maxing out one core.This cannot be right.

Robert.

>>...Using x264, I get VGA at 30fps and use 7%-9% of total CPU on my 2.6GHz i7.Doing HD720 starts to hit limits getting 26fps and
>>around 30% of the CPU. On the same hardware, for VGA, IPP can't even get 5fps and uses 12% of the CPU...

You need to start looking at some specific pieces of codes. Since your test application could have hundreds or thousands code lines such simple performance evaluations can not help ( it is not a wasting of time but close to that ). In overall, every piece of codes needs to be evaluated for both IPP versions and for both formats.

You appear to be telling me I have to debug the Intel proprietary code?

Right.

Wading through thousands of lines of unfamiliar, and I have to say, not very "pretty" code is not going to happen.

I will tell my boss that IPP is a bust and we should negotiate a license for x264.

Robert.

>>...I tried all those compiler flags and few more besides. No significant effect...

That looks very strange because in Debug Configurations applications always work slower. I'd like to give you a very small example and take a look at these two tests:

[ Debug Configuration - /Od ]
...
Calculating...
... Pass 01 - Completed: 8.20400 secs
... Pass 02 - Completed: 6.96900 secs
... Pass 03 - Completed: 6.93700 secs
... Pass 04 - Completed: 6.93800 secs
... Pass 05 - Completed: 6.92200 secs
...

[ Release Configuration - /O3 ]
...
Calculating...
... Pass 01 - Completed: 2.82800 secs
... Pass 02 - Completed: 2.18800 secs
... Pass 03 - Completed: 2.17100 secs
... Pass 04 - Completed: 2.17200 secs
... Pass 05 - Completed: 2.18800 secs
...

As you can see in Release Configuration tests are executed ~3.2x fatser and the code was the same.

Quote:

Sergey Kostrov wrote:

>>...I tried all those compiler flags and few more besides. No significant effect...

That looks very strange because in Debug Configurations applications always work slower. I'd like to give you a very small example and take a look at these two tests:

....

As you can see in Release Configuration tests are executed ~3.2x fatser and the code was the same.

I agree, the optimised version should be faster. So why isn't it?

As I think I mentioned earlier, I can see 8 threads started, each using 5% of the CPU. Which means it is spending a very large amount of time spinning it's wheels and not doing anything.

I really hoped someone would know what silly thing I might have done that causes a 10-fold decrease in speed,

Robert.

Melden Sie sich an, um einen Kommentar zu hinterlassen.