Hello everyone,

I am writing code for an n-body simulation that I would like to put on a xeon phi. The subroutine I am working on involves calculating the energy of the system via a pairwise summation. The O(n2) algorithm would look like this, and is simply a double sum over all the particle pairs and np = number of particles. If a two particles are closer than a distance rcut2, their energy is added.

subroutine nearest_int implicit none double precision :: dx,dy,dz double precision :: x1,y1,z1 double precision :: x2,y2,z2 double precision :: dr2,dr2i,dr6i,dr12i integer :: i,j integer :: T1,T2,clock_rate,clock_max potential = 0.0d0 call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position) !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),& !$omp& shared(position,rcut2,np) do i = 1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z !dir$ simd reduction(+:potential) do j = i+1,np x2 = position(j)%x; y2 = position(j)%y; z2 = position(j)%z dx = x2-x1 dy = y2-y1 dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.rcut2)then dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i potential = potential + 4.0d0*(dr12i-dr6i) endif enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T2,clock_rate,clock_max) print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential end subroutine nearest_int

Here, position is an array of structures as follows

type atom double precision :: x,y,z end type atom type(atom) :: position

Now when I run my code using the O(n2) algorithm, the vectorization intensity is 6.51 which is good given that gather/scatter operations are being applied. Screenshots of the vtune summary are attached png's below entitled n2-1.png and n2-2.png. Now since O(n2) gives poor scaling to larger systems, an O(N) algorithm is preferred. To do this, we store in an array the particles that are close (1.2*rcut to be exact) and how many neighbors each particle has. The O(n2) algorithm now transforms to the following which involves using pointers to access the position array.

subroutine nearest_int implicit none double precision :: dx,dy,dz double precision :: x1,y1,z1 double precision :: x2,y2,z2 double precision :: dr2,dr2i,dr6i,dr12i integer :: i,j integer :: T1,T2,clock_rate,clock_max integer :: neigh potential = 0.0d0 call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position,vlistl,numneigh) !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),& !$omp& shared(position,neigh_alloc,vlistl,numneigh,rcut2,np) do i = 1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z !dir$ simd reduction(+:potential) do j = 1,numneigh(i) neigh = vlistl(j + neigh_alloc*(i-1)) x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z dx = x2-x1 dy = y2-y1 dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.rcut2)then dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i potential = potential + 4.0d0*(dr12i-dr6i) endif enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T2,clock_rate,clock_max) print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential end subroutine nearest_int

In my code I allocated vlistl as follows

neigh_alloc = 500 allocate(vlistl(500*np))

Now, although the -vec-report6 is telling me that the inner loop is indeed vectorized, I get a vectorization intensity of zero and horrible performance (barely beats serial, although this would make sense if the vectorization intensity is zero). The screen shots from the vtune analysis are given below in n-1.png and n-2.png. Here are my questions:

1. why I am getting this vectorization intensity inside a vectorized loop, and what can I do to improve my performance here?

2. If I can get this code to work on a MIC, I know to expect latency issues at line 27 (neigh = vlistl(j + neigh_alloc*(i-1)) in the O(n) algorithm. I would like to prefetch here. I know that the gather can bring in up to 8 pieces of data (I'm working in DP), on 16 cache lines. Can someone tell me the appropriate way to prefetch here to help mask the latency? I have fiddled with placing the following loop immediately after 27, but it didn't change the performance

do k = 0, 15 call mm_prefecth(position(vlistl(j + neigh_alloc*(i-1) + k+8)%x,1) enddo

3.I compiled with the -align array64byte flag so I believe all arrays should be aligned on 64byte boundaries. Does this mean that the arrays are also padded to a multiple of the cache line size? If not, how would I do this?

The simd pragma is supposedly getting the loop to vectorize. I sort my particles so that particles that are close in space are close in memory. I know the AOS structure isn't as good as SOA for vectorization, however I still was getting good performance using AOS in the O(n2 algorithm). Also, I know this is the data structure intel has implemented in various softwares employing this algorithm (lammps for example, although this is in c++ and not fortran). The full code compiles with (modules are attached). The subroutine of interest is the sole subroutine in module mod_force.f90. The numneigh and vlistl array are created in the subroutine build_neighbor_n2 in mod_neighbor.f90 and are allocated in subroutine init_list in mod_neighbor.f90. All arrays are defined as globals in the module global.f90

ifort -align array64byte -openmp global.f90 get_started.f90 mod_init_posit.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -O3 -o new.out

Sorry for the long question/description. But I wanted to give a decent account of what I have tried already. Any help is appreciated.