I am attempting to speed up a large program. I have identified the hot spots using profiling, but when I use OpenMP to parallelize the key loops, I get only a slight speed-up (about 30%), instead of the ideal factor of 8, for the key loops.
I created a smaller test program to figure out what is happening. This is the part containing the OpenMP loops (the complete program is attached).
!$OMP PARALLEL DEFAULT(SHARED) !$OMP DO PRIVATE(iRadius) DO iRadius = 1, nRadii Dradius(iRadius)=DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! diffusion coefficient at iRadius sRadius(iRadius)=SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! if compression, this will account for it END DO !$OMP END DO !$OMP END PARALLEL !$OMP PARALLEL DEFAULT(SHARED) !$OMP DO PRIVATE(iRadius) L_R2_Parallel: & DO iRadius = 1, nRadii Z(iRadius)=ZCalc(iRadius, iSpecies) G(iRadius, 1)=Dradius(iRadius-1)*dt*A1(iRadius, 1)+B(iRadius, 1)-sRadius(iRadius-1)*omSqRun*dt*A2(iRadius, 1) G(iRadius, 2)=Dradius(iRadius)*dt*A1(iRadius, 2)+B(iRadius, 2)-sRadius(iRadius)*omSqRun*dt*A2(iRadius, 2) G(iRadius, 3)=Dradius(iRadius+1)*dt*A1(iRadius, 3)+B(iRadius, 3)-sRadius(iRadius+1)*omSqRun*dt*A2(iRadius, 3) END DO L_R2_Parallel nThreads=omp_get_num_threads() !$OMP END PARALLEL
VTune shows a large amount of time was spent in __kmp_fork_barrier and _kmpc_barrier. I don't understand why any significant time is being spent at barriers, since an even division of the workload for each of the loops should result in all threads finishing at the same time. Task Manager shows 100% CPU usage while the program is running, as expected. I have attached the VTune summary; it also shows a large "spin time" which is mostly the sum of these two.
Compiled under Visual Studio as x64 release build, with option /Qopenmp.
3.4 GHz Haswell (8 logical CPUs), Windows 7 64-bit, Visual Studio 2017 15.6.7, Intel XE2018 update 1.