Hello Forum members,
I run identical multi threaded (openMP) fortran programs on an 'old' windows machine using the Visual Fortran Compiler XE 12.1.0.233 [Intel(R) 64] run on Windows 7 and a 'new' red hat linux server using Fortran Composer_xe_2013.1.117 (run on Red Hat Server 6.3).
When I run my code with 1 thread only the linux code is faster (as expected since it's a newer and faster machine). However, as I increase the thread count to about 20 the windows machine executes the code faster than the linux machine. My guess is that I made a mistake somewhere when calling the compiler under linux. The Linux box should be faster at whatever thread count is enabled.
Here are some more details about these puzzling results. Run time in seconds (W is for the windows box and L for the linux box)
=============================
iter = 1 (parallelized section gets executed only once)
=============================
1 Threads: W = 292, L = 202 (Linux beats windows, as expected)
2 Threads: W = 242, L = 152 (Linux beats windows, as expected)
3 Threads: W = 208, L = 132 (Linux beats windows, as expected)
4 Threads: W = 196, L = 123 (Linux beats windows, as expected)
10 Threads: W = 109, L = 91 (Linux beats windows, as expected)
20 Threads: W = 82, L = 80 (why is windows faster?)
=============================
When increasing the iterations more of the parallelized codes get executed. So now we have:
iter = 2
=============================
4 Threads: W = 324, L = 225 (Linux beats windows, as expected)
10 Threads: W = 202, L = 166 (Linux beats windows, as expected)
20 Threads: W = 138, L = 143 (Why??)
=============================
iter = 3
=============================
4 Threads: W = 471, L = 332 (Linux beats windows, as expected)
10 Threads: W = 294, L = 246 (Linux beats windows, as expected)
20 Threads: W = 192, L = 205 (why??)
=============================
iter = 5
=============================
10 Threads: W = 473, L = 395 (Linux beats windows, as expected)
20 Threads: W = 304, L = 350 Why??
40 Threads: W = N/A, L = 307
60 Threads: W = N/A, L = 310
=============================
Here are the exact details about how I compile the code on the two machines.
[A] Windows compiler: Intel(R) Visual Fortran Compiler XE 12.1.0.233 [Intel(R) 64]
==============================================================================
Intel Xeon CPU X5680 @ 3.33 GHz (24 GB) total of 24 threads
Compiler command line (from within Visual Studio):
/nologo /O3 /QxHost /Qopt-prefetch=3 /Qipo /recursive /Qopenmp /warn:none /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /check:none /libs:static /threads /c
Linker command line (from within Visual Studio):
/OUT:"x64\Release\Console1.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"C:\AAPapers\JungChambersTran\ObamaII\Fortran\Health_l_Theta0\Console1\Console1\x64\Release\Console1.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /STACK:2100000000,2100000000 /IMPLIB:"C:\AAPapers\JungChambersTran\ObamaII\Fortran\Health_l_Theta0\Console1\Console1\x64\Release\Console1.lib"
[B} Linux: composer_xe_2013.1.117 (run on Red Hat Server 6.3)
==========================================================
Intel Xeon CPU E5-4620 @ 2.20GHz (128GB total of 64 threads)
Call sequence from command line:
source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
ifort -nologo -O3 -xhost -opt-prefetch=3 -ipo -check none -openmp -openmp-link=static -threads para_module.f90 grid.f90 sortAlgorithms_module.f90 f_medIncome_module.f90 f_solver_module.f90 f_steadystate_module.f90 main.f90 -o myprog.out
ulimit -s unlimited
chmod +x myprog.out
KMP_STACKSIZE=400m ./myprog.out
Does anybody know as to why the more powerful Linux machine is slower than a 2 year old Windows box when the thread count goes close to 20? Which compiler options do i need to set here to make the Linux version run faster?
Thanks a lot.
Juergen



