Windows vs. Linux performance

Windows vs. Linux performance

Bild des Benutzers pvonkaenel

Hi,

We've started porting our video processing pipeline from Windows to Linux and we're seeing that many of the Linux IPP routines are slower than the Windows version.  In particular the resizers such asippiResizeFilter_8u_C1R() using theippResizeFilterLanczosfilter option.

General question: is it expected that the Linux IPP routines perform the same as the Windows equivalets?

Note that I've performed the timing test on two identical HW platforms which have dual Xeon X5680 3.33GHz CPUs.

Thanks,

Peter

22 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Sergey Kostrov

>>... I've performed the timing test on two identical HW platforms which have dual Xeon X5680 3.33GHz CPUs.

Could you post your data, please? What about a test-case? Thanks in advance.

Bild des Benutzers pvonkaenel

Hi Sergey,

Thanks for responsing. I can work on distilling down an example which is based on our system code, but before I start that, do you expect there to be performance differences between the Linux and Windows versions of IPP routines?

Thanks,
Peter

Bild des Benutzers iliyapolak

There should be some differences between those two OSs , because of different architecture implementation.

Bild des Benutzers pvonkaenel

How much of a difference would you expect. I've just finished running some loop timings on the Lanczos resizer I mentioned earlier, and for Windows the loop runs in 3 minutes 36 seconds, and under Linux it's 4 minutes 2 seconds. This seems like a big difference to me. Regardless of the OS, don't they both have the same asm instructions available? Do the asm implentations differ between Linux and Windows, or do they share the same low level code?

Thanks,
Peter

Bild des Benutzers Chuck De Sylva (Intel)

I would be great if we had a test case, as Sergey mentioned. It will help debugging if there are any issues.

Bild des Benutzers Sergey Kostrov

>>... I've just finished running some loop timings on the Lanczos resizer I mentioned earlier, and for Windows the loop runs in
>>3 minutes 36 seconds, and under Linux it's 4 minutes 2 seconds. This seems like a big difference to me...

I wouldn't expect absolutely identical numbers and some difference in times could be contributed by:

- different C/C++ compilers
- different optimization options selected for C/C++ compilers
- significant differences in OSs ( as Iliya mentioned )
- different workload of OSs ( services, network support, etc )

So, there are many things that affect performance on both platforms.

What compilers did you use?

I have lots of test cases and tests compiled with MS or Intel C/C++ compilers for Windows almost always outperform tests compiled with MinGW C/C++ compiler for Windows. Many tests compiled with a legacy Borland C/C++ compiler ( 15+ year-old technology ) outperform all modern C/C++ compilers (!) mentioned above.

In your case a test on Linux is ~10% slower than a test on Windows and in overall it matches to my numbers.

Bild des Benutzers iliyapolak

>>>How much of a difference would you expect. I've just finished running some loop timings on the Lanczos resizer I mentioned earlier, >>>
It is hard to say exactly how much of a difference you can excpect.Such a difference could be described as function of many variables which are tightly coupled to Linux internal architecture.

Bild des Benutzers pvonkaenel

Thanks for the feedback. I'm currently working on putting together an isolated sample which demonstrates what I'm seeing. In the meantime, I can answer some of the questions. Under Windows I'm using the Visual Studio 2010 C++ compiler with default release build optimization settings. Under Linux we're using gcc with -O2. However, I would not expect the compiler to affect the speed much since a majority of the work is being performed within the IPP calls which should resolve to hand coded asm (correct?).

In both test cases there are 24 logical cores available, and I've made sure to have a minimal load from other processes. Only core OS services should be running beside the test.

Peter

Bild des Benutzers Sergey Kostrov

>>...I would not expect the compiler to affect the speed much since a majority of the work is being performed within the IPP calls which
>>should resolve to hand coded asm (correct?).

Yes, that is correct. Here are a couple of more comments:

- Try to boost a priority of your application to 'High' or 'Real-Time' on both platforms in order to preempt as many as possible processes and threads existing / working at the same time. I always do this when measuring performance of some piece(s) of codes.

- Try to set ( force ) a process / thread affinity mask to one CPU

Note: For these two cases I could provide two small examples for Windows but for Linux you'll need to understand how to do the same

- Try to use the same -O2 optimization option with Visual Studio 2010

Bild des Benutzers Sergey Kostrov

By the way, do you have the same BIOS settings on both computers? The most important settings are as follows:

- Intel Hyper-Threading Technology
- Intel TurboBoost Technology
- Intel SpeedStep Technology

Bild des Benutzers iliyapolak

>>> Try to boost a priority of your application to 'High' or 'Real-Time' on both platforms in order to preempt as many as possible processes and threads existing / working at the same time. I always do this when measuring performance of some piece(s) of codes.>>>

Exact implementation of scheduler and dispatcher on Linux platform could differ from Windows OS.Moreover kernel code activity(mostly interrupt)driven drivers could also affect fine grain time measurement on Linux platform.
So I think that you can not directly compare those two OSs.Even a few dozens of different asm instruction(directly compiled from the kernel source)which are not used on Win OS components and those instructions could pollute the results of such a comparision.

Here you can see the exact comparision between thos two OSs :http://widefox.pbworks.com/w/page/8042290/Architecture
For example please look at scheduler latency results you can see that Linux scheduler has much lower latency that its Windows counterpart.

Bild des Benutzers pvonkaenel

Thanks for all the input. At this point I think I'll need a better understanding of Linux and make sure I'm making a fair comparison, and then try different system level tricks on Linux to match my Windows performance. I do, however, think you've answered my main question - yes we should expect to see IPP performance difference between Windows and Linux.

Thanks for the help,
Peter

Bild des Benutzers Sergey Kostrov

Here are my notes:

>>...Linux scheduler has much lower latency that its Windows counterpart...

A scheduling engine for family of Windows NT based OSs was designed by one of the best expert in multi-processing from VAX in 1990th. That design was initially introduced in first versions of Windows NT ( 1.x, 2.x, 3.x, or so ).

Take a look at:
http://widefox.pbworks.com/w/page/8042322/Scheduler

>>Kernel Comparison: Linux (2.6.28) versus Windows (Vista SP1)
>>...
>>...
>>Timeslice - Multiprocessor
>>Scheduler - Multiprocessor (timeslice) Linux Windows
>>timeslice - range 10ms-200ms' 15ms-180ms (Client)
>>180ms (Server')
>>...
>>...
>>timeslice - default 100ms' 30ms, 60ms, 90ms (Client)
>>180ms (Server)'
>>...
>>...
>>Performance
>>Scheduler (performance) Linux Windows
>>scheduling latency (average) 0.009mS' 2mS10'
>>scheduling latency (worse) 0.3mS' 16mS10'
>>...

- The report favours Linux by default but I don't defend Windows

- The report compares latest version of kernel for Linux with some older release(s) of Windows:

...
Compared Version: Linux Q1 2009 vs Windows Q1 2008
Initial Release: Q4 2008 vs Windows Q1 2007
Latest Release: Q3 2011 vs Windows Q1 2011
...

- Latencies could NOT be deterministic.

One of IDZ users tested my test-case on his computer with a latest Intel CPU and reported that a switch from thread A to thread B was completed in 38 nanoseconds.

Personally, I wouldn't worry about performance differences on different OSs if numbers for some test(s) differ by less than ~10%.

Bild des Benutzers Sergey Kostrov

>>...One of IDZ users tested my test-case on his computer with a latest Intel CPU and reported that a switch from thread A to thread B
>>was completed in 38 nanoseconds

Forum topic: Synchronizing Time Stamp Counter
Web-link: http://software.intel.com/en-us/forums/topic/332570

Bild des Benutzers iliyapolak

>>>A scheduling engine for family of Windows NT based OSs was designed by one of the best expert in multi-processing from VAX in 1990th.>>>
Was it Dave Cutler?

>>>- The report compares latest version of kernel for Linux with some older release(s) of Windows:>>>
You are right I have forgotten to mention it in my post.

Bild des Benutzers Chuck De Sylva (Intel)

Another thing you can do is wrap timer calls around just the IPP code to narrow it down in both the Linux and Windows cases.

Bild des Benutzers Sergey Kostrov

>>>A scheduling engine for family of Windows NT based OSs was designed by one of the best expert in multi-processing
>>from VAX in 1990th.
>>
>>Was it Dave Cutler?

That is possible. I don't remember his name but I remember that I read it in a book about history of Microsoft ( possibly written by Bill Gates or somebody else from Microsoft ).

Bild des Benutzers Sergey Kostrov

>>...Windows vs. Linux...

By the way, in the middle of 1990th everybody was comparing Windows vs. OS/2 Wrap. Does anybody remember it? Unfortunately, OS/2 Wrap has not survived.

PS: Some time in June 1995, two months before a release of Windows 95, I had a very exciting dialog about Windows and OS/2 OSs with a very experienced system software developer and I will reproduce our conversation later...

Bild des Benutzers pvonkaenel

Zitat:

Chuck De Sylva (Intel) schrieb:

Another thing you can do is wrap timer calls around just the IPP code to narrow it down in both the Linux and Windows cases.

That's how I tend to time IPP routines. I put a timer just around the call, run it in a long loop, and then get a moving average. One thing I've noticed (at least under Windows), is that if you have a lot of other things happening in between the IPP call, you can get some IPP slowdown. It's like the call needs to be warmed up and then kept active. I would guess this has to do with cache misses, but I've even seen this where the image content changes from call to call.

Peter

Bild des Benutzers pvonkaenel

Zitat:

Sergey Kostrov schrieb:

>>...Windows vs. Linux...

By the way, in the middle of 1990th everybody was comparing Windows vs. OS/2 Wrap. Does anybody remember it? Unfortunately, OS/2 Wrap has not survived.

PS: Some time in June 1995, two months before a release of Windows 95, I had a very exciting dialog about Windows and OS/2 OSs with a very experienced system software developer and I will reproduce our conversation later...

I really liked OS/2 (even back to version 1.3). While everyone else at work was using Windows for Workgroups or Solaris, I used OS/2 since I could use Windows apps and Hummmingbird to access the Spac stations. At the time there was a goos book about the inner workings of OS/2 which I forget the name of.

Bild des Benutzers Sergey Kostrov

>>...One thing I've noticed (at least under Windows), is that if you have a lot of other things happening in between the IPP call,
>>you can get some IPP slowdown...

That is why I recommended to boost a priority of a test process during tests. But, be carefull when an IPP function uses OpenMP threads because they are executed at 'Normal' priorities. If you boost a priority of your main thread to 'High' or 'Real-Time' then IPP OpenMP threads will be preempted and your test will work slower. So, you need to disable OpenMP threads for IPP functions.

Here is an example of boosting a priority of a thread for a Windows platform:
...
::SetThreadPriority( ::GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );

uiTicksStart = SysGetTickCount();
//...
// Some processing
//...
uiTicksEnd = SysGetTickCount();

::SetThreadPriority( ::GetCurrentThread(), THREAD_PRIORITY_NORMAL );
...

Melden Sie sich an, um einen Kommentar zu hinterlassen.