IPP H264 performance

IPP H264 performance

imagem de peter@moonlight.com.ru

Hi!

I've evaluated H264 routines from IPP 4.0 trial on PC, and have found than performance doesn't increase compared to my C code. Since I've used merged libs, and called directly w7_ and a6_ routines, as well as px_, this basically means that all three contain equivalent code.
So, my questions:
1. I'm sure that optimized H264 routines will be available very soon for all platforms. Could you give me any hint when?
2. Will these routines be available for regular XScale and WMMX?
3. Is there any way to participate in pre-release code testing?

Thanks in advance!

Peter

9 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Ying Song (Intel)

Hi, Peter,


To get general H.264 Performance you may run the Intel IPP performance benchmark tool "perfsys" located in directory ipp40 oolsperfsys. You can choose the ps_ippvc.exe to run to get the H.264 performance data on your target system.

We will consider H.264 support for Intel XScale and WMMX in future releases as well, you may periodically check our web site at http://www.intel.com/software/products/ipp for update.

If you are interested in participating the pre-release test, please submit a request under Intel IPP productsvia Intel Premier Support.

Thanks,
Ying S
Intel Corp.

imagem de peter@moonlight.com.ru

Thanks! I'll put a request.

imagem de marc_ba

Hello,


Should you try to run this test, it would be great to share the results here if you have time ...


Thanks a lot


Marc


imagem de peter@moonlight.com.ru

I've run the tests. Nothing new. See my results in attachement. Similar results you can find in tools/perfsys/data. For instance, look at worst-case horizontal quarter-pixel interpolation for luma and regular 8x8 idct for comparison.

ps_ippvcpx.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,35,px,0.719
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,60,e,4.84
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,54,e,4.34
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,41,e,3.32
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,55,e,4.45
-----------------------

ps_ippvca6.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,11,px,0.236
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.45
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,53,e,4.25
-----------------------

ps_ippvcw7.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,10,px,0.214
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,4.89
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,4.12
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.43
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,4.25
-----------------------

ps_ippvct7.csv:
CPU,Intel Pentium 4 Processor HT 1x2128 MHz, L1=8/12K, L2=1024K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,12,px,0.365
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,25,e,3.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,41,e,5.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,5.08
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,41,e,5.03
-----------------------

my box:
CPU,Intel Pentium 4 Processor HT 2x2594 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,10,px,0.256
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,6.04
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,5.13
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,4.24
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,5.2
-----------------------

To my mind, these results are unambiguous.

imagem de vladimir-dudnik (Intel)

You can see the performance of ippiDCT function has improvement onlatest architectures. It is because this function was tightly optimizaed by hand on assemble level. Yes, you are right, the performance of ippiInterpolateLuma_H264 does not show performance gain, it is because this function initially was optimized in C code, now we work onoptimization of this function on assemble level. You will see improved performance in the next version of libraries.


Regards,
Vladimir


imagem de Deleted user

In your Ver. 4.1 release, do you improve the performance of ippiInterpolateLuma_H264? I tried with Ver. 4.1 lib on P4 and did not see any
improvement.

Thanks.

imagem de peter@moonlight.com.ru

I've tried 4.1 beta and I can confirm luma and chroma interpoltation were MMX Ext and SSE2. As well as deblocking. Dequant was not MMX enhaced. I haven't tried 4.1 release for x86 yet, but 4.1 release for XScale doesn't differ form 4.1 beta for XScale too much. At least I haven't noticed any differences in H264 part.

imagem de vladimir-dudnik (Intel)

Hi,

You are right, this function has 15 different branches inside. Each branch has their special conditions and was optimized separately. So, we still work on some of branches and we are hoping we will improve this function in future.

Regards,
Vladimir

Faça login para deixar um comentário.