-openmp results in easier vectorization?

I had an interesting finding when I played with the following OpenMP code:

!$OMP PARALLEL SHARED(vals, N2, N1g, dF, F) PRIVATE(value, ind_offset, l0)
        do k = ks, ke
          do ii = iis, iie
            value = vals(ii)
            do j = js, je
              ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
              do i = is, ie
                l0 = ind_offset + i
                dF(l0) = dF(l0) + value * F(l0 + ii)
              end do
            end do
          end do
        end do

I got different profiling results (wallclock time) for this 4-level loop piece when I compiled (ifort 13.1.0) the code with different compiler options:

Compiler options                    Wallclock time (seconds)
          -O2                                                12.43                         serial run
    -O2  -openmp                                      8.04                         OMP_NUM_THREADS=1

I could not understand why running OpenMP code with only one thread could lead to such a big change in performance. It should be still running in a serial way. So later I checked the vectorization report and assembly file:

Compiler options                         Vectorization report (line 08)                               Assembly checking (line 10)
          -O2                       Not vectorized. Existence of vector dependence                      mulsd, addsd
    -O2  -openmp                            LOOP WAS VECTORIZED                                          mulpd, addpd

Without '-openmp', I need to add !DIR$ SIMD to vectorize the innerloop (line 08). Now From the vectorization report and assembly checking, it seems to me that '-openmp' could make the vectorization easier (did not need !DIR$ SIMD). Then vector instructions were used and the performance was improved.

Is this an correct explanation and something I should expect? Are there any other optimizations behind this option '-openmp'?

Many thanks for your time and help.

Best regards,

The non-openmp compilation may have become tangled up in fancier multi-level loop optimizations, such as interchanging to push k iterations into lower level loops. 

