Performance difference between static and dynamic linking

Performance difference between static and dynamic linking

I am using the ippiResize_8u_C3R function. I did some testing withdynamic linking and measured the performance. When tried using static linking via theipp*merged and ipp*emerged libraries the performance is about 3 times slower.I stepped into the disassembly in both cases using a P4 processor. The dynamic version calls a function using SSE2 instructions and xmm registers, which is what I would expect. The static version calls a function called t7_ippiResize_8u_C3R which does not use SSE2 instructions. I have tried using ippiStaticInitCpu to force selection of CPU and stepping into the disassembly. It calls thepx*, a6*, w7* and t7* versions of the function correctly, depending on CPU type, however they all run slow.In summary, it appears the static merged libraries contain different t7* code to the dynamic libraries, and they are not optimized so well.ippiResize_8u_C3Ris the only function I have tried so far.

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello,

it looks a bit strange, we use the same code for dynamic and static libraries. The only difference is that dynamic version of IPP use OpenMP threading in some functions, but seems that is not a case. Resize does not contain OpenMP threading. Well, could you also specify:
-what version of IPP, what exactly processor and under which OS did you useto test this function?
- what kind of interpolation did you use in function?
- what sizes of images did you test on?
- did you use aligned memory (allocated with IPP functions)?

Regards,
Vladimir

OK, here are some more details :- IPP version 5.1 for Windows (installer was w_ipp_ia32_p_5.1.017.exe)- Pentium4 with H/T (ippGetCpuType returns ippCpuP4HT2)- Windows XP Professional SP2- Using ippiResize_8u_C3R with IPPI_INTER_CUBIC interpolation- Image size was 1024 x 1024 x 24 bit, resizing to 600 x 600- Using 16 byte aligned memory allocated by my own functions- Dynamic test version links ippcore.lib,ippi.lib and ipps.lib. Uses ippit7-5.1.dll.- Static test version links ippcorel.lib, ippimerged.lib, ippiemerged.lib, ippsmerged.lib and ippsemerged.lib. I call ippStaticInit at the start.- Performance difference is about 3 times slower for static testHope this helps.Richard

Are you sure you're linking to the correct versions of your functions? If you link with the *emerge libraries (i.e. use dynamic dispatching with static linking), do you still get such poor performance? I'm using that method on Windows and Linux, and I haven't seen a noticeable difference between that and dynamic linking. You could be inadvertently linking with the plain x86 functions instead of the MMX/SSE/etc. optimized functions.
Good luck,
Scott

Hello Richard.

could you please print out output from any ippGetLibVersion function for static linking case? The code like this should be enough (please insert it after ippStaticInit call)

const IppLibraryVersion* ippj = ippiGetLibVersion();

printf("Intel Integrated Performance Primitives
");
printf(" version: %s, [%d.%d.%d.%d]
",
ippi->Version, ippi->major, ippi->minor, ippi->build, ippi->majorBuild);
printf(" name: %s
", ippi->Name);
printf(" date: %s
", ippi->BuildDate);

You can test ippi->Name string to see that static dispatcher chooses right cpu-specific code.

Regards,
Vladimir

OK. The output from the static linked version is :

Intel Integrated Performance Primitives
version: v5.1, [5.1.217.80]
name: ippiw7l.lib
date: Mar 1 2006

Doing the samefor the dynamic linked version (without ippStaticInit) gives :

Intel Integrated Performance Primitives
version: v5.1, [5.1.217.80]
name: ippiw7-5.1.dll
date: Feb 28 2006

I have the copied the first lines of the disassembly when I call ippiResize_8u_C3R for each case. I might as well include this for info. The static case is :

_w7_ippiResize_8u_C3R@68:
004983B0 push ebp
004983B1 mov ebp,esp
004983B3 and esp,0FFFFFFC0h
004983B6 sub esp,80h
004983BC mov ecx,dword ptr [ebp+8]
004983BF fld qword ptr [ebp+38h]
004983C2 fld qword ptr [ebp+40h]
004983C5 mov edx,dword ptr [ebp+48h]
004983C8 cmp edx,8
004983CB je _w7_ippiResize_8u_C3R@68+0B6h (498466h)
004983D1 fld qword ptr ds:[5411A8h]
004983D7 mov eax,dword ptr [ebp+10h]
004983DA fxch st(2)
004983DC fstp qword ptr [esp+30h]
004983E0 fstp qword ptr [esp+38h]
004983E4 fst qword ptr [esp+40h]
004983E8 fstp qword ptr [esp+48h]
004983EC mov dword ptr [esp],ecx
004983EF mov ecx,dword ptr [ebp+0Ch]
...etc

The dynamic version calls :

00ACDDB8 push ebp
00ACDDB9 mov ebp,esp
00ACDDBB and esp,0FFFFFFC0h
00ACDDBE push edi
00ACDDBF sub esp,7Ch
00ACDDC2 movsd xmm3,mmword ptr [ebp+38h]
00ACDDC7 movsd xmm4,mmword ptr [ebp+40h]
00ACDDCC mov eax,dword ptr [ebp+48h]
00ACDDCF cmp eax,8
00ACDDD2 je 00ACDE6D
00ACDDD8 movsd xmm2,mmword ptr ds:[0D885C0h]
00ACDDE0 mov edx,dword ptr [ebp+0Ch]
00ACDDE3 mov ecx,dword ptr [ebp+10h]
00ACDDE6 mov edi,dword ptr [ebp+14h]
00ACDDE9 movsd xmm0,mmword ptr [ebp+18h]
00ACDDEE movsd xmm1,mmword ptr [ebp+20h]
00ACDDF3 movsd mmword ptr [esp+30h],xmm3
00ACDDF9 movsd mmword ptr [esp+38h],xmm4
00ACDDFF movsd mmword ptr [esp+40h],xmm2
00ACDE05 movsd mmword ptr [esp+48h],xmm2
...etc

Hi,
I know this is an old thread, but I didn't see any response to it. I am using IPP 5.1.1, and my application uses IIR and FFT functions from the Signal Processing library. Previously I was using the dynamic DLL-finding linking. But recently I figured out how to do the static linking with the emerged and merged lib files. This is preferable to I don't have to ship so many different DLLs with my app.

But I found that there is a slight performance decrease when I use the static emerged/merged libs, about 10% slower than the DLL version.

I am testing on a Core Duo, and I verified through ippGetLibVersion() that in the static lib case, I am using the t7 version functions.

This was kind of disappointing, I was hoping to use the merged static linkage.

Is there any difference in the performance of the static merged libs vs the DLL libs? Has this been fixed in particular versions?

As a side note, I noticed that at least for my program, which only uses IIR and FFT functions, there was no performance improvement going from the generic px version to t7 version. Is this expected?

Thanks,
Ching-Wei

Hello,

could you please specify which exactly functions do you use? It would be nice if you can provide simple test case which demonstrate performance issue, we expect that difference between PX and T7 variants of these functions should be at least 2..4X for reasonable signal length.

Regards,
Vladimir

It hasn't changed much in 5.2.57

_w7_ippiResize_8u_C3R@68:
00: 55 push ebp
01: 8B EC..mov ebp,esp
03: 83 E4..and esp,0FFFFFFC0h
06: 81 EC..sub esp,80h
0C: 8B 4D..mov ecx,dword ptr [ebp+8]
0F: DD 45..fld qword ptr [ebp+38h]
12: DD 45..fld qword ptr [ebp+40h]
15: 8B 55..mov edx,dword ptr [ebp+48h]
18: 83 FA..cmp edx,8
1B: 0F 84..je 000000B2
21: 8B 45..mov eax,dword ptr [ebp+10h]
24: D9 EE..fldz
26: 89 0C..mov dword ptr [esp],ecx
29: 8B 4D..mov ecx,dword ptr [ebp+0Ch]
2C: 89 4C..mov dword ptr [esp+4],ecx
30: 89 44..mov dword ptr [esp+8],eax
34: 8B 45..mov eax,dword ptr [ebp+14h]
37: 89 44..mov dword ptr [esp+0Ch],eax
3B: 8B 4D..mov ecx,dword ptr [ebp+18h]
3E: 8D 44..lea eax,[esp+10h]
42: 89 08..mov dword ptr [eax],ecx
44: 8B 4D..mov ecx,dword ptr [ebp+1Ch]
47: 89 48..mov dword ptr [eax+4],ecx
4A: 8B 4D..mov ecx,dword ptr [ebp+20h]
4D: 89 48..mov dword ptr [eax+8],ecx
50: 8B 4D..mov ecx,dword ptr [ebp+24h]
53: 89 48..mov dword ptr [eax+0Ch],ecx
56: 8B 45..mov eax,dword ptr [ebp+28h]
59: D9 CA..fxch st(2)
5B: DD 5C..fstp qword ptr [esp+30h]
5F: DD 5C..fstp qword ptr [esp+38h]
63: DD 54..fst qword ptr [esp+40h]
67: DD 5C..fstp qword ptr [esp+48h]

which is nearly the same as _a6_'s, and from first looks, is the same as _px_'s. _t7_ and _v8_ is the same, too. x87 code, you know, not sse2. Same with _??_ippiResize_8u_C1R code. I can image there beingmore routines like that. Bummer, dude.

There is optimized code for ippiResize function. Of course it depends a lot on parameters you use. So, what were you image size, did you try zooming or decimating, what was interpolation parameter? Did you call ippStaticInit function at the beginning of your application (in case of static linkage)?

Actually, with IPP static libraies you can compare performance of different processor specific code with call ippStaticInitCpu with desired processor type as a parameter. I think you should see performance difference between PX and T7 code.

Please let us know if you still have performance issue

Regards,
Vladimir

Hi Vladimir,

Thanks for your response. Unfortunately I don't have time to give you a test case...But I can tell you I am using ippsIIR_64f and ippsIIR_64f_I (both with N=256) and for the FFT I am using ippsFFTFwd_RToCCS_32f with N=2048.

I'm not expecting too much support, since I don't have time to spend diving into this too far. Just wanted to get a quick sense of:

1) Why do I get a slight performance decrease when I use the static linking vs the DLL linking? Is this unexpected?

2) Why don't I see much difference between px and t7 versions? This question may have more to do with my particular code, I may not be using IPP aligned buffers, my N is only 256 in the IIR case, or there could be other bottlenecks etc...But just wondering...

Thanks!!!

-Ching-Wei

"Actually, with IPP static libraies you can compare performance of different processor specific code with call ippStaticInitCpu with desired processor type as a parameter. I think you should see performance difference between PX and T7 code."

I don't think so. Here's the ever lovin' proof that px is a6 is w7 is (t7) is v8 (all static is using the px code, at least in the imaging I've looked at - not much, but so far .000)

ipp 5.2.57 static

_w7_ownpiDecimateSuper:
00: push ebp
01: mov ebp,esp
03: and esp,0FFFFFFC0h
:
90: fldz
92: fld qword ptr [ebp+38h]
95: fcomp st(1)
97: fnstsw ax
99: sahf
9A: jbe 00000995
A0: fld qword ptr [ebp+40h]
A3: fcomp st(1)
A5: fnstsw ax
A7: sahf
A8: jbe 00000995
AE: lea eax,[ecx+edx]
B1: cmp edi,eax
B3: jge 000000B9
B5: mov edx,edi
B7: sub edx,ecx
B9: mov eax,dword ptr [ebp+1Ch]
BC: lea edi,[eax+ebx]
BF: cmp esi,edi
C1: jge 000000C8
C3: mov ebx,esi
C5: sub ebx,dword ptr [ebp+1Ch]
C8: fld qword ptr [_2il0floatpacket.1]
CE: mov dword ptr [esp+78h],edx
D2: fild dword ptr [esp+78h]

and then there is

ipp 5.2.57 w7 dll ( 5,902,336 bytes : ippiw7_5.2.dll )

ownpiDecimateSuper:
E174: push ebp
E175: mov ebp,esp
E177: and esp,0FFFFFFC0h
:
E208: movsd xmm0,mmword ptr [ebp+38h]
E20D: pxor xmm3,xmm3
E211: comisd xmm0,xmm3
E215: jbe 1033EB46
E21B: movsd xmm0,mmword ptr [ebp+40h]
E220: comisd xmm0,xmm3
E224: jbe 1033EB46
E22A: mov eax,dword ptr [ebp+18h]
E22D: movsd xmm1,mmword ptr ds:[10523468h]
E235: mov dword ptr [esp+6Ch],edi
E239: mov edi,dword ptr [ebp+10h]
E23C: mov dword ptr [esp+70h],edx
E240: mov ecx,esi
E242: mov dword ptr [esp+74h],ebx
E246: mov ebx,dword ptr [ebp+20h]
E249: lea edx,[eax+ebx]
E24C: sub ecx,eax
E24E: cmp esi,edx
E250: cmovge ecx,ebx
E253: cvtsi2sd xmm0,ecx
E257: mulsd xmm0,mmword ptr [ebp+38h]
E25C: mov ebx,edi
E25E: addsd&nbsp
; xmm0,xmm1
E262: mov edx,dword ptr [ebp+24h]
E265: mov esi,dword ptr [ebp+1Ch]
E268: lea eax,[esi+edx]
E26B: sub ebx,esi
E26D: cmp edi,eax
E26F: cmovge ebx,edx
E272: cvtsi2sd xmm2,ebx
E276: mulsd xmm2,mmword ptr [ebp+40h]

In other words, the static imaging library, at least the resize and perhaps more, was compiled for straight x87 math, no SSEx, for a6, w7, t7, and v8. All are the same code as the _px path. Denmark usually doesn't smell rotten, but . . .

Resize is in the critical path so that it's running half speed or worse makes quite a difference.

Hello,

I see two issues here

1. There was such an error in IPP 5.1 which cause PX code in static version for ippiResize function. This bug was fixed in IPP 5.2.

2. There is several execution branches inside of ippiResize. Those branches are optimized in a different manner, depending on how much we can get from SSE vs x87. Supersampling interpolation is not optimized with SSE, but other interpolations you should be able to clearly see the difference between PX and T7 code, Did you try other interpolations?

Regards,
Vladimir

Why does the DLL use SSE2 code while thevery same API in the static library uses x87? The two examples already given should bereason enough to check why this is going on. Super is already very slow so having to always do it in x87 is not what IPP is all about. Is it? The w7 DLL uses SSE2 for Super. Why not the static library? (As well as the any other APIs for that matter.) How can this possibly be the plan? What gain is there in making the w7 DLL have better code than the w7 static lib? Yes, some decimates are SSE2 in the w7 static lib, but most are x87. But even if everything but the resize/super were SSE2, that pair IS x87 in the static lib but SSE2 in the w7 DLL. Why? The very same APIs (nevermind "branches -- this is the same code path -- one to static liband one to DLL. The path is the same. The code is different. Why the code is different is the question. Or better, why not fix the static library to use SSE2 for thoseAPIs where the DLL is already using SSE2-- nevermind why it isn't doing so now.

I know I've said the same thing several times.

Also, the problem that this thread stated with back in May 2006 is still there in 5.2.57, as I have already shown with the disasm. What I've done is to carry on with that same report, because it seems it was not fixed: I have already shown that the original report is as valid today as it was May 2006. OK, then, perhaps this thread will simply die like it did last year. As long as I know why it's slow I can deal with it. Many won't ever know, unless they read this particular topic.

Adios

Hi, Adios,

For the latest IPP5.2 release, what is the performance difference you find now?

Also what are the arguments are used for ippiResize roi, factors, etc. If you have atest code that can show the performance problem, you can submit to premier support website (https://premier.intel.com).

Belloware some performance results we test for the function. Dynamic and static code have very similar performance.

function

data

chan

size

inter

factor

w7,static

w7,dynamic

s/d

ippiResize

8u

C1R

64x64

super

0.66 0.66

142

144

0.99

ippiResize

8u

C1R

256x256

super

0.66 0.66

138

140

0.99

ippiResize

8u

C1R<
/P>

720x480

super

0.66 0.66

138

141

0.98

ippiResize

8u

C3R

64x64

super

0.66 0.66

86

86

1.00

ippiResize

8u

C3R

256x256

super

0.66 0.66

85

85

1.00

ippiResize

8u

C3R

720x480

super

0.66 0.66

85

85

1.00

ippiResize

8u

C4R

64x64

super

0.66 0.66

85

85

1.00

ippiResize

8u

C4R

256x256

super

0.66 0.66

83

84

0.99

ippiResize

8u

C4R

720x480

super

0.66 0.66

84

84

1.00

ippiResize

8u

AC4R

64x64

super

0.66 0.66

87

87

1.00

ippiResize

8u

AC4R

256x256

super

0.66 0.66

85

85

1.00

ippiResize

8u

AC4R

720x480

super

0.66 0.66

86

86

1.00

ippiResize

8u

P3R

64x64

super

0.66 0.66

142

144

0.99

ippiResize

8u

P3R

256x256

super

0.66 0.66

138

140

0.99

ippiResize

8u

P3R

720x480

super

0.66 0.66

139

141

0.99

ippiResize

8u

P4R

I can't useDLLs for projects. I cannot, and have not, claimed that the DLLs are faster (I would assume well-done SSE2 code is faster than x87 code, however). I have claimed, and proven beyond any doubt, that the DLL code uses SSE2 while the static library code uses x87, for resize/super, and I will venture many other routines.

Why Intel Softwarehas releasedseparate DLLs for different CPUs is a mystery, then, if, as you attempt to show with your table from nowhere, that the performance is the same regardless of the generated code.

I will say it again. The w7 static library is using 100% x87 code for rezize/super (at least those, but likely much more). The w7 DLL is using SSE2 for resize/super. I have SHOWN this to be the case. There is no denying. If you want to throw out a table that claims x87 code runs the same as SSE2, it makes me wonder that you have not run the testing correctly, or are using data from some internal release. It is a simple matter to LOOK at the generated code. Is it that no one available understands disassembly? Find someone and let him consider this problem.

Hi,

I'm using IPP 5.2.057 and I experience exactly the same problem. When I link my app with the static libs it runs about 3 times slower than when it is linked with the dynamic libs. Yes, I'm calling ippStaticInit(). My configuration is Win 2003 SP2 32bit, Athlon64.

Is it planed that problem to be fixed?

Smith

Hi Smith,

we are investigating issue reported for ippiResize function. Could you please provide exact function name and parameters you use when you see 3X performance difference between dynamic and static version of IPP?

Keep in mind that IPP dynamic libraries incorporate internal threading in some of functions which is not a case for static libraries, so you can see different performance on multi-core systems.

Regards,
Vladimir

Hi,

Thanks for the quick response. It turned out that the problem is on my side. One of my tests was running slowly (3x) because it was not calling ippStaticInit() :-(. I fixed it and now it is running only about 20% slower that with dynamic libs. I cannot says why exactly it is running 20% slower because there are a lot of IPP calls (video transcoding application). Maybe it is because of the threading, 20% is ok.

Smith

Thanks all for pointing to the issue, we are trying to investigate the reason for that, but unfortunately we were not able to reproduce it.

For your reference I've attached simple test case for ippiResize function and results we got on Intel Core 2 Duo system.

We would appreciate if you can reproduce performance issue with that simple tests (in that case please provide us as much details as possible to help us identify the reason).

You can find in attachment simple command line application which call ippiResize function and prints out information about IPP version and performance.

So, performance measured for DLL is like
PX - 240 processor clocks per output image pixel
W7 - 96 processor clocks per output image pixel

and for IPP static libraries
PX - 241 processor clocks per output image pixel
W7 - 96 processor clocks per output image pixel

Please see attached report for the more details.

Regards,
Vladimir

Attachments: 

AttachmentSize
Downloadapplication/zip resize.zip49.22 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today