Puzzling openMP speed differences between new Linux and old Windows

Puzzling openMP speed differences between new Linux and old Windows

Hello Forum members,

I run identical multi threaded (openMP) fortran programs on an 'old' windows machine using the Visual Fortran Compiler XE 12.1.0.233 [Intel(R) 64] run on Windows 7 and a 'new' red hat linux server using Fortran Composer_xe_2013.1.117 (run on Red Hat Server 6.3).

When I run my code with 1 thread only the linux code is faster (as expected since it's a newer and faster machine). However, as I increase the thread count to about 20 the windows machine executes the code faster than the linux machine. My guess is that I made a mistake somewhere when calling the compiler under linux. The Linux box should be faster at whatever thread count is enabled.

Here are some more details about these puzzling results. Run time in seconds (W is for the windows box and L for the linux box)
=============================
iter = 1 (parallelized section gets executed only once)
=============================
1 Threads: W = 292, L = 202 (Linux beats windows, as expected)
2 Threads: W = 242, L = 152 (Linux beats windows, as expected)
3 Threads: W = 208, L = 132 (Linux beats windows, as expected)
4 Threads: W = 196, L = 123 (Linux beats windows, as expected)
10 Threads: W = 109, L = 91 (Linux beats windows, as expected)
20 Threads: W = 82, L = 80 (why is windows faster?)
=============================

When increasing the iterations more of the parallelized codes get executed. So now we have:
iter = 2
=============================
4 Threads: W = 324, L = 225 (Linux beats windows, as expected)
10 Threads: W = 202, L = 166 (Linux beats windows, as expected)
20 Threads: W = 138, L = 143 (Why??)
=============================
iter = 3
=============================
4 Threads: W = 471, L = 332 (Linux beats windows, as expected)
10 Threads: W = 294, L = 246 (Linux beats windows, as expected)
20 Threads: W = 192, L = 205 (why??)
=============================
iter   = 5
=============================
10 Threads: W = 473, L = 395 (Linux beats windows, as expected)
20 Threads: W = 304, L = 350 Why??
40 Threads: W = N/A, L = 307
60 Threads: W = N/A, L = 310
=============================

Here are the exact details about how I compile the code on the two machines.

[A] Windows compiler: Intel(R) Visual Fortran Compiler XE 12.1.0.233 [Intel(R) 64]
==============================================================================
Intel Xeon CPU X5680 @ 3.33 GHz (24 GB) total of 24 threads

Compiler command line (from within Visual Studio):
/nologo /O3 /QxHost /Qopt-prefetch=3 /Qipo /recursive /Qopenmp /warn:none /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /check:none /libs:static /threads /c

Linker command line (from within Visual Studio):
/OUT:"x64\Release\Console1.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"C:\AAPapers\JungChambersTran\ObamaII\Fortran\Health_l_Theta0\Console1\Console1\x64\Release\Console1.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /STACK:2100000000,2100000000 /IMPLIB:"C:\AAPapers\JungChambersTran\ObamaII\Fortran\Health_l_Theta0\Console1\Console1\x64\Release\Console1.lib"

[B} Linux: composer_xe_2013.1.117 (run on Red Hat Server 6.3)
==========================================================
Intel Xeon CPU E5-4620 @ 2.20GHz (128GB total of 64 threads)

Call sequence from command line:

source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64

ifort -nologo -O3 -xhost -opt-prefetch=3 -ipo -check none -openmp -openmp-link=static -threads para_module.f90 grid.f90 sortAlgorithms_module.f90 f_medIncome_module.f90 f_solver_module.f90 f_steadystate_module.f90 main.f90 -o myprog.out

ulimit -s unlimited
chmod +x myprog.out
KMP_STACKSIZE=400m ./myprog.out

Does anybody know as to why the more powerful Linux machine is slower than a 2 year old Windows box when the thread count goes close to 20? Which compiler options do i need to set here to make the Linux version run faster?

Thanks a lot.

Juergen

14 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

If you don't set KMP_AFFINITY, you are relying entirely on the scheduler to place the threads. I suppose the Windows scheduler would have trouble as well on a 4 socket machine. You should try a setting which distributes the threads evenly across the sockets and cores.
If you don't like to use hooks into Intel OpenMP, taskset might do the job (but has no close equivalent on Windows).

Hi Juergen,

>>...Does anybody know as to why the more powerful Linux machine is slower than a 2 year old Windows box when the thread count
>>goes close to 20?

I want to be as neutral as possible in that Everlasting Dispute over which task scheduler is the best.

If you can easily reproduce these numbers and Windows outperforms Linux all the time when the thread count is greater than 20 it is a demonstration that a task scheduler on Windows does a better job than on Linux, or OpenMP implementation on Windows is better than on Linux, or something else affects performance on Linux and you need to understand what could be wrong with it. I know that many readers of my post and fans of Linux could say opposite because they simply Love Linux.

We recently had a couple of discussions related to that subject ( performance / task scheduler ) and please take a look at:

Forum topic: Windows vs. Linux performance
Web-link: http://software.intel.com/en-us/forums/topic/341938

Forum topic: Synchronizing Time Stamp Counter
Web-link: http://software.intel.com/en-us/forums/topic/332570

I wouldn't worry about performance differences if numbers for some test(s) differ by less than ~10%. It is inevitable and sometimes old computer or old C/C++ compiler could demonstrate what they actually can do.

Another hint:
...
KMP_STACKSIZE=400m -> 400 megabytes
...
Did you set a stack size to 400m in both cases? Do you really need a stack of such size?

>>...Does anybody know as to why the more powerful Linux machine is slower than a 2 year old Windows box when the thread count
>>goes close to 20?
As it was stated by me and by Sergey there is a very large dependency on the internal architecture of the particular OS.Bear in mind also that kernel mode overhead has to be added to the "equation".Some of the drivers in Linux are user mode thus they are spawning own threads which also have to be scheduled to run.So you can not obtain your answer so quickly by only performing some multithreading related tests.Even if the single machine is under tests you can not predict exactly the OS and hardware activities.And such a unpredictable activity can shift up or down your results.

I agree with Sergey that excessive KMP_STACKSIZE would be an issue as you increase the number of threads. I have never seen an application benefit from more than KMP_STACKSIZE=40m. I didn't catch that you might not be using KMP_STACKSIZE consistently between your tests.

Thanks guys.
I did reduce the stacksize to 20m, you were right 400m was excessive. However, no speed differences. The above tables are still accurate. I next started playing with the KMP_AFFINITY environmental command. So far, no speed improvement on the Linux machine. Any tips on how to set KMP_AFFINITY to make this faster?
This new Linux box must be faster than the two year old Windows box, especially when the thread count increases. That was the whole reason for investing into this new server. Thanks guys. J.

KMP_AFFINITY=scatter would spread the threads as widely as possible, presumably balancing the work across CPUs.
KMP_AFFINITY=compact would pack the threads into the minium number of cores and CPUs, using HyperThreading if it is enabled.
KMP_AFFINITY=compact,1,1 would space the threads out 1 per core, using contiguous cores.

Depending on the nature of your test, any of those might prove effective, provided that any other jobs present are pinned to a non-conflicting set of cores.

>>...This new Linux box must be faster than the two year old Windows box, especially when the thread count increases...

I suggest you to spend some time on implementation of a new as simple as possible test-case ( for Windows and Linux ) that creates different number of threads, does some simple processing and doesn't use OpenMP (!).

As I already mentioned that '...OpenMP implementation on Windows...' could be '...better than on Linux...' and you need to take it into account. So, try to remove vendor specific software components currently used in your existing test-case.

If there's interest in testing the ability of libiomp5 to deal with such situations on a 4 socket server on linux, there are plenty of alternate OpenMP libraries which are freely available for trial, including libgomp. I don't know how you can say the OpenMP implementation on Windows is better if you don't test under similar conditions, and don't attempt to isolate the effects of the OS scheduler. There's no doubt that the current Windows scheduler is greatly improved over earlier ones, and may even succeed in spreading the threads out across Westmere cores. The question of whether the linux scheduler is stumbling over HyperThreading across 4 sockets could be checked simply by disabling HT by means other than setting affinity, if you have the privilege to do so.

>>>he question of whether the linux scheduler is stumbling over HyperThreading across 4 sockets>>>
I think that also an actual code beign executed should be also accounted for the performance penalties when HT is involved.
In the case of heavily loaded thread with the floating point data and instructions which also happens to contain interdependencies HT could not be very helpful in achieving some kind of instruction level parallelism.

>>...I run identical multi threaded (openMP) fortran programs on an 'old' windows machine using the Visual Fortran Compiler XE 12.1.0.233
>>[Intel(R) 64] run on Windows 7 and a 'new' red hat linux server using Fortran Composer_xe_2013.1.117 (run on Red Hat Server 6.3)...

I read the initial post again and I see that two different versions of Fortran compiler are used, that is 12.1.0.233 vs. 2013.1.117.

juejung,

You need to try identical versions of Fortran compiler.

Gentlemen,

The difference in speeds of my openMP code on 20 threads was due to private allocatable vectors inside the parallel loop. For some reason the old compiler 12.1.0.233 with flag -O3 under Windows handles this better than Fortran_Composer_xe_2013.1.117 with -O1, -O2 or -O3 flag under Linux. After I replaced the allocatable vectors with alternate code not requiring allocatable arrays, the speed differences disappeared. I'm not sure whether this large speed difference with allocatable arrays inside an openMP loop is due to the linux OS or due to the newer compiler version? In any case, thanks for all your comments. J.

Thanks for the update!

>>...difference with allocatable arrays inside an openMP loop is due to the linux OS or due to the newer compiler version...

Since you have not changed a version of Linux I think these two reasons are the most possible.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi