I'm learning openmp and Lattice Boltzmann Method(LBM). I modified the example LBM program 'F90-bgk' a little and analyzed it with Amplifier Vtune. The results revealed that the 'computeFeq' 'collide' 'stream' 'computeMacros' subroutines took up most of the cpu usage, so i tried to parallel them with openmp. But the performance of the parallel code is even worse than the serial one. The comparison is as follows:
the serial one:
>ifort -g unsteady.f90
> time ./a.out
7233900.95342814 cells per second
the paralle one:
>ifort -g -openmp unsteady.f90 -o a-mp
> time ./a-mp
1155580.41337423 cells per second
I analyzed the parallel program with Amplified Vtune and the results shows a significant time was spent and waiting and synchronization. I have no idea why threads took so many time to start and clone.
My OS is opensuse 12.3 and the version of ifort is 14.0.0 and VTune(TM) Amplifier XE 2013 Update 13. Need your help please. The source code and the Amplifier project are attached.