I had an interesting finding when I played with the following OpenMP code:
!$OMP PARALLEL SHARED(vals, N2, N1g, dF, F) PRIVATE(value, ind_offset, l0) !$OMP DO do k = ks, ke do ii = iis, iie value = vals(ii) do j = js, je ind_offset = ( (k-1)*N2 + (j-1) ) * N1g do i = is, ie l0 = ind_offset + i dF(l0) = dF(l0) + value * F(l0 + ii) end do end do end do end do !$OMP END DO !$OMP END PARALLEL
I got different profiling results (wallclock time) for this 4-level loop piece when I compiled (ifort 13.1.0) the code with different compiler options:
Compiler options Wallclock time (seconds)
-O2 12.43 serial run
-O2 -openmp 8.04 OMP_NUM_THREADS=1
I could not understand why running OpenMP code with only one thread could lead to such a big change in performance. It should be still running in a serial way. So later I checked the vectorization report and assembly file:
Compiler options Vectorization report (line 08) Assembly checking (line 10)
-O2 Not vectorized. Existence of vector dependence mulsd, addsd
-O2 -openmp LOOP WAS VECTORIZED mulpd, addpd
Without '-openmp', I need to add !DIR$ SIMD to vectorize the innerloop (line 08). Now From the vectorization report and assembly checking, it seems to me that '-openmp' could make the vectorization easier (did not need !DIR$ SIMD). Then vector instructions were used and the performance was improved.
Is this an correct explanation and something I should expect? Are there any other optimizations behind this option '-openmp'?
Many thanks for your time and help.