This is also and OpenMP question.
I'm trying to parallelise this algorithm for the MIC. It does cubic spline interpolation of SourceNum*Ni number of vectors of length Nj. So the interpolation is done along the j direction. There are Ni of these vectors - all independent. Due to a compromise of data layout elsewhere in the program I have to arrange it so that i is the least significant index, not j. This makes the algorithm a bit more complicated.
Basically the k loop and the i loop are both embarrassingly parallel. The middle j loop has a flow dependence and hence is serial. The k loop has a trip count of 3, j - ~180, and i - ~500. What I want to do is have nested parallelism. There's enough threads on Phi that I reckon it'll be worth it. I'm reluctant to implement it as nested parallel do's just because there's the overhead of starting those inner parallel loops.
What I tried already was to make the k loop a parallel do, and to have the j loop spawn tasks of chunks of the i loop. This resulted in horrible bugs and race conditions, however, that I could not figure out. Likely due to my ignorance of how tasks work. Could someone take a fresh look at it and give me some suggestions of how to parallelise it with OMP, please?
NB Evolve_q, TimeSteps, Src, and ddSrc are static global vars inside the containing module.
subroutine InitSourceInterpolation_MIC use omp_lib, only: omp_get_wtime integer i,j,k,Ni,Nj real(dl) :: t0 real(dl), allocatable :: u(:,:) Ni = TimeSteps%npoints Nj = Evolve_q%npoints allocate(u(Ni,Nj)) t0 = omp_get_wtime() do k=1,SourceNum ! Set the upper and lower boundary condtions to be natural (0) do i=1,Ni ddSrc(i,1,k) = 0._dl ddSrc(i,Nj,k) = 0._dl u(i,1) = 0._dl u(i,Nj) = 0._dl end do ! do the decomposition loop of the tridiagonal algorithm for all the systems ! of equations do j=2,Nj-1 do i=1,Ni call spline_j(Evolve_q%points(j-1), Evolve_q%points(j), & Evolve_q%points(j+1), & Src(i,j-1,k), Src(i,j,k), Src(i,j+1,k), & u(i,j-1), ddSrc(i,j-1,k), & u(i,j), ddSrc(i,j,k) ) end do end do ! do the backsubstitution loop of the tridiagonal algorithm do j=Nj-1, 1, -1 do i=1,Ni ddSrc(i,j,k) = ddSrc(i,j,k)*ddSrc(i,j+1,k) + u(i,j) end do end do end do ! clean up deallocate(u) contains pure subroutine spline_j( xMin1, x, xAdd1, & yMin1, y, yAdd1, & uMin1, y2Min1, & u, y2 ) real(dl), intent(in) :: xMin1, x, xAdd1 real(dl), intent(in) :: yMin1, y, yAdd1 real(dl), intent(in) :: uMin1, y2Min1 real(dl), intent(out) :: u, y2 real(dl) :: sig, p, pp sig = (x-xMin1) / (xAdd1-xMin1) p = sig * y2Min1 + 2 pp = (yAdd1-y)/(xAdd1-x) - (y-yMin1)/(x-xMin1) y2 = (sig-1._dl)/p u = ( 6._dl * pp / (xAdd1-xMin1) - sig*uMin1)/p end subroutine spline_j end subroutine InitSourceInterpolation_MIC