Significant slow down of Intel IPP

Significant slow down of Intel IPP

Hi Experts:

    I think it should a very tough question here. I list the code below, where chunks=94 and fftLen=8192.

for(Int i=0; i<chunks; i++)
{ ......
ippsAdd_32f(data2, data3+i*fftLen, data2, fftLen);
}

This piece of code exists in two projects, but have quite different behavior. In first project, it only cost about 0.2ms, but in second project it cost about 1ms.

I try to changed the code in second project as: 

for(Int i=0; i<chunks; i++)
{ .......
ippsAdd_32f(data2, data3, data2, fftLen);
}

Then the time elapsed of the second project changed from 1ms to 0.2ms.

I can understand moving the data needs time. But I feel confused that why in first project everthing is fine?

I appreciate your expert view on that.

Best Regards,

Sun Cao

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

Hi Igor:

    I can understand the data locality will affect the performance. But it can not explain why the first project is fine.

Best Regards,

Sun Cao

>>...I can understand the data locality will affect the performance. But it can not explain why the first project is fine.

Here are a couple of advises:

1. Check allignment of your data
2. Check project settings
3. You don't take into account a time for calculation of offset:
...
[ 1ms ] ippsAdd_32f( data2, data3+i*fftLen, data2, fftLen );
...
[ 0.2ms ] ippsAdd_32f( data2, data3, data2, fftLen );
...
and try to declare a local fftLen variable as closer as possible to a call to ippsAdd_32f function or use a constant 8192 instead ( if fftLen is Not changing ).

Hi Sun Cao,

as I've already said above - I guess that these 2 projects are different and have different order of calculations and different data flows - so some other data extrudes data3 vector from cache. There is no enough information to provide you another answer. The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference. what IPP library do you use - threaded or not?

regards, Igor

>>... The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference...

Sun Cao,

Do you have VTune Amplifier XE? If Yes, do a Hotspots Anaysis, or another one, to undesrtand what could be wrong.

Leave a Comment

Please sign in to add a comment. Not a member? Join today