How to declare variables local to MIC

How to declare variables local to MIC

Hello, I would like to run a simulation using the host and MIC. In a recent question in this forum I received a large amount of help from the Intel team (thank you, intel) . I was shown that a SOA data type format was most advantageous for me. The issue with this for me here, is that the arrays in this structure should be allocatable in my code. Thus, I can't transfer them from the host to the MIC. On the host, my algorithm has a AOS type with only doubles and integers in the data type, and hence is bit wise copyable and should be able to be transferred to the MIC. Thus, I would like to copy the AOS array to the MIC, and then move the AOS data to the SOA variable. I also have another data type that is constant, but whose values are determined at run time. I would like to have this variable on both the host and MIC.  However, I have no idea how to declare variables solely on the MIC. Also one of the SOA variables, needs to be deallocated and reallocated on MIC if a conditions is met.  In the code below, I have written my AOS array as a normal array for simplicity.

1. read in variable from file (np). Allocate my AOS variable on Host. Allocate SOA variable on MIC

2. read in variable from file (cellT). Calculate list on host and MIC

3. Transfer AOS data to MIC, if necessary, deallocate and reallocate tr%start and tr%end

I have written the following code showing the procedure with ample comments to hopefully make it clear. All of the action is in the subroutine offload

program vec
  !--- all global variables are in modcell_en below
  use ifport
  use modcell_en
  implicit none

  !--- convention array. This would be my AOS in my actual code
  double precision,allocatable :: rnew(:,:)
  integer,allocatable :: start(:), fin(:)

  !--- bs variables
  double precision :: energy
  double precision :: x1,y1,z1,x2,y2,z2,dr2
  integer :: i,j,k,l,count,np,cellT
  integer :: c1s,c1e,c2s,c2e
  
  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(rnew(3,np)

  !--- in my real code, this is also read in from file
  cellT = 1000
  list%ncell = cellT
  allocate(start(cellT),fin(cellT))

  !--- allocate arrays on MIC
  call offload(0)

  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     rnew(1,i) = 10*rand(); rnew(2,i) = 10*rand(); rnew(3,i) = 10*rand()
  end do

  !--- don't care about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,cellT
     start(i) = count*60+1
     fin(i)   = (count+1)*60

     count = count + 1
  enddo
  
  
  call get_energy(energy)

 
  stop
end program vec

 

module modcell_en

  !--- global structure of array position variable
  !--- this variable will only be used on the MIC
  !--- Hence, I would like to declare and allocate it on the MIC
  !--- data must be passed to the mic however, to populate this data
  type r
     double precision,allocatable :: x(:),y(:),z(:)
  end type r

  !--- global structure of array variable
  type(r) :: part


  !--- This variable will also only be used on the MIC
  !--- This variable may need to be deallocated and reallocated during the simulation
  !--- information must be passed to the MIC to populate this varaible
  type tr
     integer,allocatable ::start(:),fin(:)
  end type tr

  !--- number of cells in simulation
  !--- this variable will be used both on host and MIC BUT IS CONSTANT
  !--- would like to transfer once and have it stay there forever
  type list
     integer :: ncell
  end type list

contains

  subroutine offload(check)
    integer :: check
    integer :: i,j
   
    !--- simulation just started. allocate space on MIC for variables 
    !--- should allocate part and tr here, but dont know how
    !--- should transfer list here, and have it stay there
    if(check.eq.0)then
       !dir$ offload_transfer target(mic:0) in(r:alloc_if(.true.) free_if(.false.)),&
       !dir$                                in(list:alloc_if(.true) free_if(.false.)),&
       !dir$                                in(start:alloc_if(.true.) free_if(.false.)),&
       !dir$                                in(fin:alloc_if(.true.) free_if(.false.))
    endif


    if(check.eq.1)then
 
       !--- need to transfer data from variable on host to variable on MIC
       !--- np is large in my simulation (~60,000-70,000), so the MIC should vectorize and transfer this easily I hope
       
       !dir$ offload_transfer target(mic:0) nocopy(r:alloc_if(.false.) free_if(.false.)),&
       !dir$                                nocopy(start:alloc_if(.false.) free_if(.false.)),&
       !dir$                                nocopy(fin:alloc_if(.false.) free_if(.false.))

       !$omp parallel do 
       do i = 1,np
          r%x(i) = r(1,i)
          r%y(i) = r(2,i)
          r%z(i) = r(3,i)
       end do
       do i = 0,list%cellT
          tr%start(i) = start(i)
          tr%fin(i)   = fin(i)
       enddo
       !$omp end parallel do
       !dir$ end offload
       
    endif


    !--- number of cells in simulation has changed
    !--- need to deallocate and reallocate tr%start and tr%end
    !--- strill need to transfer r
    if(check.eq.2)then
       !dir$ offload begin  target(mic:0) in(list:alloc_if(.true) free_if(.false.)),&
       !dir$                             in(start:alloc_if(.true.) free_if(.false.)),&
       !dir$                             in(fin:alloc_if(.true.) free_if(.false.))
       !dir$                             nocopy(r:alloc_if(.false.) free_if(.false.))
       
       !$omp parallel do 
       do i = 1,np
          r%x(i) = r(1,i)
          r%y(i) = r(2,i)
          r%z(i) = r(3,i)
       end do

       do i = 0,list%cellT
          tr%start(i) = start(i)
          tr%fin(i)   = fin(i)
       enddo

       !$omp end parallel do
       !dir$ end offload
    endif

  end subroutine offload


  subroutine get_energy(energy)
    implicit none
    double precision :: energy, pot
    double precision :: x1,y1,z1
    double precision :: x2,y2,z2
    double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
    integer :: i,j,k,l
    integer :: c1s,c1e,c2s,c2e
    integer :: check

    if(condition.eq.1)then
       call offload(1)
    elseif(condition.eq.2)then
       call offload(2)
    endif

    

    !dir$ offload begin target(mic:0) 
    !$omp parallel do reduction(+:energy)
    !---main code I would like to port to MIC in array of structure format
    do i= 1,list%ncell          
       
       c1s = tr%start(i); c1e = tr%fin(i)
       do j = 1,ncell
               
          c2s = tr%start(j); c2e = tr%fin(j) 
          do l=c2s,c2e 
             x1 = part%x(l); y1 = part%y(l); z1 = part%z(l)
             do k = c1s,c1e
                           
                x2 = part%x(l); y2 = part%y(l); z2 = part%z(l)
                
                dr2 = (x2-x1)**2+(y2-y1)**2+(z2-z1)**2
                
                if(dr2.lt.2)then
                   dr = sqrt(dr2)
                   dri = 1.0d0/dr
                   dr2i = dri*dri
                   dr6i = dr2i**3
                   dr12i = dr6i**2
                   energy = energy + 4.0d0*(dr12i-dr6i)
                endif
             enddo
          enddo
       enddo
    enddo
    !$omp end parallel do
    !dir$ end offload
    
    
  end subroutine get_energy
  
end module modcell_en

 

35 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Perhaps my question is too longwinded, but if I could get some information on how to use SoA type data on MICs in Fortran (can't used shared virtual memory), since they aren't transferable, I would appreciate it greatly.

Our apologies for the delay. I will try to have some info posted on your questions soon.
 

I suggest you look at using the INTO identifier. Look in the reference guide "Moving Data from One Variable to Another"

This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).

This topic discusses using the INTO modifier with the OFFLOAD set of directives

The INTO modifier enables you to transfer data from a variable on the CPU to another on the coprocessor, and the reverse, from a variable on the coprocessor to another on the CPU. Only one item is allowed in the variable-ref list when using the INTO modifier. Thus a one to one correspondence is established between a single source and destination.

When you use INTO with the IN clause, data is copied from the CPU object to the coprocessor object. The ALLOC_IF, FREE_IF, and ALLOC modifiers apply to the INTO expression.

When you use INTO with the OUT clause, data is copied from the coprocessor object to the CPU object. The ALLOC_IF, FREE_IF, and ALLOC modifiers apply to the OUT expression.

The INTO modifier is not allowed with INOUT and NOCOPY clauses.

When you use the INTO modifier, the source expression generates a stream of elements to be copied into the memory ranges specified by the INTO expression. Overlap between the source and destination memory ranges leads to undefined behavior. No ordering can be assumed between transfers from different IN and OUT clauses.

Example

In the following example:

  • The "Partial copy" case copies the first 500 elements of P to the last 500 elements of P1.

  • The "Overlapping copy" case copies the first 600 elements of P into P1 but then tries to copy the last 400 elements of P into P1(100) and beyond, but P1 (100) was initialized by the previous IN clause

  • The "rank change" is an error because rank 2 data on the coprocessor is being copied back to rank 1 data on the CPU, even though the sizes are the same.

INTEGER :: P (1000), P1 (2000)
INTEGER :: RANK1 (1000), RANK2 (10, 100)

!          Partial copy
!DIR$ OFLOAD … IN ( P (1:500) : INTO ( P1 (501:1000) ) ) …

!          Overlapping copy; result undefined
!DUR$ OFFLOAD … IN ( P (1:600) : INTO ( P1 (1:600) ) ) … &
&               IN ( P (601:1000) : INTO ( P1 (100:499) ) ) …

!          Rank change is not allowed – error
!DIR $ OFFLOAD … OUT ( RANK1, OUT (RANK2) ) …

Jim Dempsey

So I still am getting the original error: "A variable used in an offload region must not be of derived type with pointer or allocatable components." [RSOA]

module global
  implicit none
  
  type r
     double precision, allocatable :: x(:),y(:),z(:)
  end type r
  
  !dir$ attributes offload:mic::rSOA
  type(r) :: rSOA

  !dir$ attributes offload:mic::x
  double precision, allocatable :: x(:)
  
  !dir$ attribites offload:mic::y
  double precision, allocatable :: y(:)

  !dir$ attributes offload:mic::z
  double precision, allocatable :: z(:)

 
end module global

 

program MIC

  use global
  use ifport
  
  implicit none
  integer :: i,np

  call seed(10)
  
  np = 60000

  allocate(x(np),y(np),z(np))

  allocate(rSOA%x(np),rSOA%y(np),rSOA%z(np))

  do i = 1,np
     x(i) = rand()*10; y(i) = rand()*10; z(i) = rand()*10
  end do
 
  !---try to transfer data
  !dir$ offload_transfer target(mic:0) in(x(1:np) : into(rSOA%x(1:np)) )
  
 
  stop
end program MIC

 

The idea of INTO is also where I was headed; however, what I have not tried yet was use of that with the allocatable component of a derived-type. I am also consulting w/Development to see what they might recommend for your scenario.

The support for offloading the allocatable component of the derive-type will be available in our upcoming 15.0 compiler release this year. The current 14.0 compiler issues the error you received.

I confirmed your small example compiles with the 15.0 compiler and believe the data transfers probably would work but I didn't take time to test that yet . We could get you setup to use the Beta compiler for short time until the release if you are interested.

MIC-only memory allocation is not available in Fortran. However, as long as you allocate the SOA on the CPU you can transfer data from CPU to MIC one array at a time. The example below shows how to transfer from CPU to MIC.

The 15.0 compiler supports this for data transfer CPU -> MIC. An update to 15.0, probably Update 1 will also support data transfer from MIC -> CPU in a similar manner, and also the ability to specify allocatable array components of derived types in NOCOPY clauses, thus allowing use of data transferred in one offload in future offloads.

module DTwithAllocatable 
      type DT_Type 
        integer :: integer_var 
        real*8,allocatable :: float_array(:) 
      end type DT_Type 

      !dir$ attributes offload : mic :: DT_Var 
      type(DT_Type) :: DT_Var 
    end module DTwithAllocatable 

    program DT_with_allocatable 
    use DTwithAllocatable 
    implicit none 

    DT_Var%integer_var=1d0 

    allocate(DT_Var%float_array(4)) 
    DT_Var%float_array(:)=2d0 

    print *,'On host before block 1...' 
    print *, 'CPU ', DT_Var%integer_var 
    print *, 'CPU ', DT_Var%float_array(:) 

! Compiler allows derived-type w/allocatable in IN 
    !DIR$ OFFLOAD_TRANSFER TARGET(mic:0) IN(DT_Var) 
    !DIR$ OFFLOAD BEGIN TARGET(mic:0) IN(DT_Var%float_array) 
      print *,'On phi block 1...' 
      DT_Var%float_array(:)=4d0 
      print *, 'PHI ', DT_Var%integer_var 
      print *, 'PHI ', DT_Var%float_array(:) 
      flush 6
    !DIR$ END OFFLOAD 

#ifdef FUTURE_15_0
! Compiler allows derived-type w/allocatable in NOCOPY 
! so we gain ability to use data persistence from previous offload 
    !DIR$ OFFLOAD BEGIN TARGET(mic:0) NOCOPY (DT_Var) 
      print *,'On phi block 2...' 
      print *, 'PHI ', DT_Var%integer_var 
      print *, 'PHI ', DT_Var%float_array(:) 
      flush 6
    !DIR$ END OFFLOAD 
#endif 

#ifdef FUTURE_15_0 
! Compiler allows derived-type w/allocatable in OUT 
! so we gain ability to transfer allocatable components from MIC to CPU
! and return updated values 
    !DIR$ OFFLOAD_TRANSFER TARGET(mic:0) OUT(DT_Var) 
    !DIR$ OFFLOAD_TRANSFER TARGET(mic:0) OUT (DT_Var%float_array) 
#endif 

    print *,'On host ...' 
    print *, 'CPU ', DT_Var%integer_var 
    print *, 'CPU ', DT_Var%float_array(:) 

    end program DT_with_allocatable

 

Ok, I must just suck it up and hand type what the length of the array has to be each time I run the simulation. I don't think the beta compiler, I believe this is the version 15.0 rajiv referred to above, could help me, since my codes are running on the super computer Stampede. That is unless they are upgrading very soon. So I think I just have to get rid of the allocatable arrays, and specify their length by hand. Is there anyway to allocate the space on the MIC for the SoA in one offload directive as below, or do I have to deal with the rSOA memory being allocated and deallocated every time I call the MIC? That could kill my performance

program MIC

  use global
  use ifport
  
  implicit none
  integer :: i,np

  call seed(10)
  
  np = 60000

  allocate(x(np),y(np),z(np))

  allocate(rSOA%x(np),rSOA%y(np),rSOA%z(np))

  do i = 1,np
     x(i) = rand()*10; y(i) = rand()*10; z(i) = rand()*10
  end do
 
  !---try to transfer data
  !dir$ offload_transfer target(mic:0) in(rSOA%x: alloc_if(.true.) free_if(.false.))

  !dir$ offload_transfer target(mic:0) in(x(1:np) : into(rSOA%x(1:np)) )
  
 
  stop
end program vec

 

So I just realized that I do need to learn how to copy the SoA with allocatable components. I can't hand type in the length of the array as I mentioned above. I would like something as follows

type min
    double precision, allocatable :: minx(:)
end type min
type(min) :: min_array

Now the array is supposed to be as long as the number of cells in my simulation. However, the number of cells fluctuates so the array has to be allocable . Thats why I had this in the following formation initially

type min
   double precision :: minx
end type min

type(min) , allocatable :: min_array(:)

As the number of cells varies, I could just deallocate and reallocate this AoS and pass it to the MIC easily (hadn't don't any performance testing). The problem is now that I have performance tested, I found use of this AoS in one of my MIC loops is killing my performance vs. openmp with 16 threads. The scatter memory access with the AoS is making the MIC actually twice as slow. If I comment out the parts with the AoS array, the MIC is about 70% faster, so I need to either convert the AoS to SoA somehow and be able to get it to the MIC, or ditch the MIC :(

I would think the Beta might be available on Stampede. I would definitely check with their sys-admins/user consultants. If you need to exploit data persistence though to avoid allocation/deallocation overhead you would have to wait until the fix is available that Rajiv noted is needed to that support.

Conor,

I haven't had time to try this, but here is the general outline:

module your_mod
type YourTypeContainer
real :: x(:)
...
end YourTypeContainer

! experiment with different named containers
type(YourTypeContainer) :: HostContainer
!dir$ attributes offload : mic :: MICContainer
type(YourTypeContainer) :: Container
...
end module your_mod

! then for each allocatable component of the container
!dir$ offload_transfer target(mic:0) IN(HostContainer%x : INTO( MICContainer%x) : allocate_if(.true.) free_if(.false))

Jim Dempsey

So I tried this:

type r 
    double precision, allocatable :: x(:), y(:),z(:)
end type r

type(r) :: rhost
type(r) :: rSOA

!---allocate rhost%x, rhost%y, rhost%z and then move on to data transfer

!dir$ offload_transfer target(mic:0),&
!dir$ in(rhost%x : into(rSOA%x) alloc_if(.true.) free_if(.false.))

I get the error: 'Alloc If of FREE_IF modifier can be specified for arrays or for a pointer or an allocatable scalar' (has an arrow pointing to into(rSOA%x)

Use:

!dir$ attributes offload : mic :: rSOA 
type(r) :: rSOA 
 

Jim Dempsey

The into(...) must reference a variable/array known in the scope of the MIC. As you had it, it was in the scope of the Host.

The addition of the !dir$ attrubutes offload : mic :: rSOA

places the rSOA in the scope of the MIC.

Jim Dempsey

Also, in my (untested) example I placed the rSOA in the module data space, not in the function/subroutine local data space.

Quote:

conor p. wrote:

So I tried this:

type r 
    double precision, allocatable :: x(:), y(:),z(:)
end type r

type(r) :: rhost
type(r) :: rSOA

!---allocate rhost%x, rhost%y, rhost%z and then move on to data transfer

!dir$ offload_transfer target(mic:0),&
!dir$ in(rhost%x : into(rSOA%x) alloc_if(.true.) free_if(.false.))

I get the error: 'Alloc If of FREE_IF modifier can be specified for arrays or for a pointer or an allocatable scalar' (has an arrow pointing to into(rSOA%x)

This code compiles error free with 15.0. With 14.0, you are bumping up against the lack of support for offloading allocatable components of a derived-type.

$ ifort -c t1.F90
t1.F90(12): error #8507: ALLOC_IF or FREE_IF modifier can be specified for arrays or for a pointer or an allocatable scalar   [RSOA]
!dir$ in(rhost%x : into(rSOA%x) alloc_if(.true.) free_if(.false.))
------------------------^
t1.F90(12): error #8545: A variable used in an OFFLOAD region must not be of derived type with pointer or allocatable components.   [RSOA]
!dir$ in(rhost%x : into(rSOA%x) alloc_if(.true.) free_if(.false.))
------------------------^
compilation aborted for t1.F90 (code 1)

 

Ok guys, I am going to try and prick your brains on one last thing. I have written a test code to test four possible scenarios.

1. use one SOA type structure for position coordinates. I can get rid of the allocatable here, and just hand type the length. The draw back is I can't preallocate space with this, and have to suffer the time of allocating and deallocating

2. All one dimensional arrays with data transfer and memory allocation

3. All one dimensional arrays using preallocated memory and data transfer

4. All one dimensional arrays using just openMP

compilation: ifort -openmp -align array64byte global.f90 MIC.f90 -O3 -o new.out

The timing results were: 

case 1: 1.509

case2 : 0.7565

case 3: 0.7251

case 4: 0.8065

My question here is where are my flops? I seem to be barely outperforming the openMP 16 threads here. Do you guys see anything I could do to speed this up in terms of speeding up. I am not sure, but I just was expecting to be faster than that. I originally thought it could be a memory allocation problem, but the third case is not much better than case 2. Perhaps the amount of data I need to send to the MIC is becoming prohibitive? I apologize for the amount of code. But fret not, its the same loop, but just some different MIC directives for each test case. If I make the second loop smaller ( do j=i+1,1000), eventually the openmp loop overtakes the MIC performance.

module global
  implicit none
  
  type r
     double precision :: x(60000),y(60000),z(60000)
  end type r
  
  !dir$ attributes offload:mic:: rSOA
  type(r) :: rSOA

  !dir$ attributes offload:mic::start
  integer, allocatable :: start(:)

  !dir$ attributes offload:mic:: end
  integer, allocatable :: end(:)

  !dir$ attributes offload:mic:: min_x
  double precision, allocatable :: min_x(:) 
  !dir$ attributes offload:mic:: min_y
  double precision, allocatable :: min_y(:)
  !dir$ attributes offload:mic:: min_z
  double precision, allocatable :: min_z(:)

  !dir$ attributes offload:mic:: x
  double precision, allocatable :: x(:) 
  !dir$ attributes offload:mic:: y
  double precision, allocatable :: y(:)
  !dir$ attributes offload:mic:: z
  double precision, allocatable :: z(:)

 
end module global
program MIC

  use ifport
  use global
  implicit none

  double precision :: energy
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: i,j,k,l,np
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,clock_rate,clock_max
  integer :: count
  
  call seed(10)

  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(x(np),y(np),z(np))

 
  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     x(i) = 1000*rand(); y(i) = 1000*rand(); z(i) = 1000*rand()
     rSOA%x(i) = x(i) ; rSOA%y(i) = y(i) ; rSOA%z(i) = z(i)
  end do

  allocate(start(1000),end(1000),min_x(1000),min_y(1000),min_z(1000))
  !--- dont car about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,1000
     start(i) = count*60+1
     end(i)   = (count+1)*60

     min_x(i) = rand() ; min_y(i) = rand(); min_z(i) = rand()

     count = count + 1
  enddo
  
 !-----------------------------------------------------------!
  ! case one: structue of arrays                             !
  !-----------------------------------------------------------!
  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  
  !dir$ offload begin target(mic:0) in(start,end,rSOA)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(rSOA,end,start,min_x,min_y,min_z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = rSOA%x(k); y1 = rSOA%y(k); z1 = rSOA%z(k)
           
           do l = c2s,c2e
              x2 = rSOA%x(l); y2 = rSOA%y(l); z2 = rSOA%z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 1:',real(T2-T1)/real(clock_rate)


  !-----------------------------------------------------------!
  ! case two: arrays with data transfer and allocation        !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.true.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 2:',real(T2-T1)/real(clock_rate)


  !-----------------------------------------------------------!
  ! case three: arrays with data transfer and no memory alloc   !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.false.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 3:',real(T2-T1)/real(clock_rate)


  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
 
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 4:',real(T2-T1)/real(clock_rate)

 
  stop
end program MIC











 

Tried your program on my system. Had to make some edits first:

type r had attribute allocateble but specifically stated dimension at (60000). I changed this to (:).

Where you allocate x(np), I added allocate for rSOA%x(np), same for y and z

You had two undeclared integers: count and end1, so I added those.

Your "!DEC$ offload transfer" was missing the "_" (at least my compiler version requires it).

 

Program compiles, runtimes:

 time case 1:  0.7422000    
 time case 2:  3.3399999E-02
 time case 3:  1.1000000E-03
 time case 4:  2.5800001E-02

I will have to debug to see why my runtimes are so short.

Jim Dempsey

I really apologize for that. I run my codes on the MIC on stampede. I can't copy and paste my code from that terminal to this forum for some reason, so I have to retype the codes by hand. I corrected the errors, and found the error that was producing your artificially low results, Jim. The new timing results are as follows. The code I have posted is now 100% correct. Again, I apologize for that.

case 1: 1.594

case 2: 1.158

case 3: 1.131

case 4: 0.817

On my system, case 3 (on MIC) is ~25x faster than on host. My host is 1 E5-2620 v2 (6 core/12 thread). I have two Xeon Phi 5110P's, one used for test.

The runtimes on my system are too short to be deemed accurate (margin for error is on order of run time). Investigating...

The issue of the very short run times is the result energy is not used, therefore the compiler optimizations eliminated the compute loops as dead code. IOW your runtimes were for performing a NOP.

I made some revisions to your code:

module global
  implicit none
  
  type r
     double precision, allocatable :: x(:),y(:),z(:) ! JGD allocated to np
  end type r
  
!dir$ attributes offload:mic:: rSOA
  type(r) :: rSOA

  !dir$ attributes offload:mic::start
  integer, allocatable :: start(:)

  !dir$ attributes offload:mic:: end
  integer, allocatable :: end(:)

  !dir$ attributes offload:mic:: min_x
  double precision, allocatable :: min_x(:) 
  !dir$ attributes offload:mic:: min_y
  double precision, allocatable :: min_y(:)
  !dir$ attributes offload:mic:: min_z
  double precision, allocatable :: min_z(:)

  !dir$ attributes offload:mic:: x
  double precision, allocatable :: x(:) 
  !dir$ attributes offload:mic:: y
  double precision, allocatable :: y(:)
  !dir$ attributes offload:mic:: z
  double precision, allocatable :: z(:)

 
end module global
program MIC

  use ifport
  use global
  implicit none

  double precision :: energy
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: i,j,k,l,np
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,clock_rate,clock_max
  integer :: count ! JGD

!dec$ define use_filter
!dec$ if defined(use_filter)
  double precision :: filter
!dec$ endif

  call seed(10)

  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(rSOA%x(np),rSOA%y(np),rSOA%z(np))
  allocate(x(np),y(np),z(np))

 
  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     x(i) = 1000*rand(); y(i) = 1000*rand(); z(i) = 1000*rand()
     rSOA%x(i) = x(i) ; rSOA%y(i) = y(i) ; rSOA%z(i) = z(i)
  end do

  allocate(start(1000),end(1000),min_x(1000),min_y(1000),min_z(1000))
  !--- dont car about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,1000
     start(i) = count*60+1
     end(i)   = (count+1)*60

     min_x(i) = rand() ; min_y(i) = rand(); min_z(i) = rand()

     count = count + 1
  enddo

!DEC$ IF DEFINED(UseWhenUserDefinedTypesSupportedInOffload)  
!-----------------------------------------------------------!
! case one: structue of arrays                             !
!-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  
  !dir$ offload begin target(mic:0) in(start,end,min_x,min_y,min_z,rSOA) inout(energy)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(rSOA,end,start)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = rSOA%x(k); y1 = rSOA%y(k); z1 = rSOA%z(k)
           
           do l = c2s,c2e
              x2 = rSOA%x(l); y2 = rSOA%y(l); z2 = rSOA%z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 1:',real(T2-T1)/real(clock_rate), ' energy:', energy
!DEC$ ENDIF

  !-----------------------------------------------------------!
  ! case two: arrays with data transfer and allocation        !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.true.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
  !$omp parallel do schedule(dynamic,1) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
!dec$ if defined(use_filter)
              filter = 0.0d0     
              if(dr2.lt.2.0d0) filter = 1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
                 
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
!dec$ else         
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
!dec$ endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 2:',real(T2-T1)/real(clock_rate), ' energy:', energy


  !-----------------------------------------------------------!
  ! case three: arrays with data transfer and no memory alloc   !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.false.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
  !$omp parallel do schedule(dynamic,1) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
!dec$ if defined(use_filter)
              filter = 0.0d0     
              if(dr2.lt.2.0d0) filter = 1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
                 
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
!dec$ else         
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
!dec$ endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 3:',real(T2-T1)/real(clock_rate), ' energy:', energy


  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
 
  !$omp parallel do schedule(dynamic,1) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)

        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz

!dec$ if defined(use_filter)
              filter = 0.0d0     
              if(dr2.lt.2.0d0) filter = 1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
                 
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
!dec$ else         
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
!dec$ endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 4:',real(T2-T1)/real(clock_rate), ' energy:', energy

 
  stop
end program MIC

Notes,

added missing integer :: count

added !dec$ define use_filter and appropriate conditional code to compile your way (not defined, and an alternate way, see code)

The case one is conditionalized out due to IVF not supporting user defined types in offload (until up coming version 15)

In case one, added missing min_x, min_y, min_z to offload (but did not add the filter code)

Case two, three, four, added conditional section for use_filter. also experimented with changing to omp schedule(dynamic,1)

Results:

 !$omp parallel do schedule(dynamic) reduction(+:energy),&
 time case 2:   1.224600      energy:   47.8019349349047     
 time case 3:  0.5719000      energy:   47.8019349349047     
 time case 4:   1.471700      energy:   47.8019349349047     

 !$omp parallel do schedule(dynamic,1) reduction(+:energy),&
 time case 2:   1.198800      energy:   47.8019349349047     
 time case 3:  0.5730000      energy:   47.8019349349047     
 time case 4:   1.464600      energy:   47.8019349349047     

 !dec$ define use_filter
 time case 2:  0.8681000      energy:   47.8019349349047     
 time case 3:  0.2321000      energy:   47.8019349349047     
 time case 4:   1.954400      energy:   47.8019349349047     

The use_filter worked significantly better on MIC .gt. 2x for case 3. On host, performance went down.

Adding -xHost dropped case 4 to 1.92

the use_filter will drag down computation when none of the vector lanes contain contributing energies. With the narrower vector, the Host would have a higher probability of having non-contributing energies.

Jim Dempsey

 

There's some kind of black magic going on that I am missing. Here is verbatim the code I ran (didn't change module globals at all) which I believe to be the same as your dec if case, jim.

program MIC

  use ifport
  use global
  implicit none

  double precision :: energy
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: i,j,k,l,np
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,clock_rate,clock_max
  integer :: count

  !dec$ define use_filter
  !dec$ if defined(use_filter)
  double precision :: filter
  !dec$ endif
  
  call seed(10)

  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(x(np),y(np),z(np))

 
  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     x(i) = 1000*rand(); y(i) = 1000*rand(); z(i) = 1000*rand()
     rSOA%x(i) = x(i) ; rSOA%y(i) = y(i) ; rSOA%z(i) = z(i)
  end do

  allocate(start(1000),end(1000),min_x(1000),min_y(1000),min_z(1000))
  !--- dont car about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,1000
     start(i) = count*60+1
     end(i)   = (count+1)*60

     min_x(i) = rand() ; min_y(i) = rand(); min_z(i) = rand()

     count = count + 1
  enddo
  
  
  !dec$ if defined(USeWhenUserDefinedTypesSupportedInOffload)
  !-----------------------------------------------------------!
  ! case one: structue of arrays                             !
  !-----------------------------------------------------------!
  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  
  !dir$ offload begin target(mic:0) in(start,end,rSOA,min_x,min_y,min_z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(rSOA,end,start,min_x,min_y,min_z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = rSOA%x(k); y1 = rSOA%y(k); z1 = rSOA%z(k)
           
           do l = c2s,c2e
              x2 = rSOA%x(l); y2 = rSOA%y(l); z2 = rSOA%z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 1:',real(T2-T1)/real(clock_rate)
  print*,'energy case 1:',energy
  !dec$ endif
  
  !-----------------------------------------------------------!
  ! case two: arrays with data transfer and allocation        !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.true.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
             
              !dec$ if defined(use_filter)
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter = 1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter

              !dec$ else
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif

              !dec$ endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 2:',real(T2-T1)/real(clock_rate)
  print*,'energy case 2:',energy


  !-----------------------------------------------------------!
  ! case three: arrays with data transfer and no memory alloc   !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.false.) free_if(.false.))

  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                     
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                       
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 3:',real(T2-T1)/real(clock_rate)
  print*,'energy case 3:',energy


  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
 
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                     
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                       
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 4:',real(T2-T1)/real(clock_rate)
  print*,'energy case 4:',energy

 
  stop
end program MIC

compiled with 

ifort -openmp -align array64byte global.f90 MIC.f90 -O3 -o new.out

results

time case2: 1.444600
energy case2: 47.809193

time case3: 0.7155000
energy case3: 47.80193

time case4: 0.8015
energy case4: 47.80193

The openmp code was ran with 16 threads, and the MIC was ran with 240 threads using kip_affinity = scatter. Can you tell why your speeds are so much better than mine, jim?

I did not set MIC_KMP_AFFINITY, nor set KMP_AFFINITY (IOW using defaults)

ifort -openmp -O3 -xHost ...

adding: -align array64byte

two: 0.87

three: 0.23

four: 1.81

A tad better on the OpenMP

Adding KMP_AFFINITY=scatter, and MIC_KMP_AFFINITY=scatter

two: 0.883

three: 0.232

four: 1.799

Essentially the same.

I am using the 15.0.0.024 201450318, but that shouldn't yield 3x better on case 3. It does permit me to compiler case 1 without error, but run time data does not match, so I excluded that from the tests. I haven't updated the copy since then ~3/18, maybe case 1 will work with the newer revisions.

Jim Dempsey

I do note that I am explicitly using inout on energy. This should not make a difference, but like you say "black magic" is happening.

Jim Dempsey

Does the fact that I am running on stampede give you any possible information here? Maybe somehow the filter directive isn't being recognized?

Conor,

See if you can modify the program to obtain two timings per case. One for the time it takes to copy the data, and the second for the elapse time for the computation.

Note, the Host and Xeon Phi will not necessarily have synchronized clocks (nor clock rates). Therefore you will have to insert a small piece of code to make the determination of the clock differences.

On my system the Xeon Phi is inserted into an X16 slot. If your card is in an x8 slot, the transfer time could be 2x or so different.

A second potential issue is if the Xeon Phi is not located on the PCIe buss attached to the CPU that is running the host part of your program.

Jim Dempsey

Note, a "hack" way to indirectly get a coarse estimate (assuming the total data is larger than combined L2 caches, 32MB),

Add a loop around the offloaded omp loop. Run one test with an iteration of 1 (essentially the same program as you have now with a do once loop). Then make a second run with a loop count of 2. Assuming the data set was large enough as to not being in cache, then the transfer overhead would be:

one iteration run time - (two iteration run time - one iteration run time)

Jim Dempsey

 

Here you go. The code I ran is

program MIC

  use ifport
  use global
  use omp_lib
  implicit none

  double precision :: energy
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: i,j,k,l,np
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,T3,T4,clock_rate,clock_max
  integer :: count
  integer :: num,tid


  !dec$ define use_filter
  !dec$ if defined(use_filter)
  double precision :: filter
  !dec$ endif
  
  call seed(10)

  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(x(np),y(np),z(np))

 
  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     x(i) = 1000*rand(); y(i) = 1000*rand(); z(i) = 1000*rand()
     rSOA%x(i) = x(i) ; rSOA%y(i) = y(i) ; rSOA%z(i) = z(i)
  end do

  allocate(start(1000),end(1000),min_x(1000),min_y(1000),min_z(1000))
  !--- dont car about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,1000
     start(i) = count*60+1
     end(i)   = (count+1)*60

     min_x(i) = rand() ; min_y(i) = rand(); min_z(i) = rand()

     count = count + 1
  enddo
  
  
  !dec$ if defined(USeWhenUserDefinedTypesSupportedInOffload)
  !-----------------------------------------------------------!
  ! case one: structue of arrays                             !
  !-----------------------------------------------------------!
  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  
  !dir$ offload begin target(mic:0) in(start,end,rSOA,min_x,min_y,min_z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(rSOA,end,start,min_x,min_y,min_z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = rSOA%x(k); y1 = rSOA%y(k); z1 = rSOA%z(k)
           
           do l = c2s,c2e
              x2 = rSOA%x(l); y2 = rSOA%y(l); z2 = rSOA%z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 1:',real(T2-T1)/real(clock_rate)
  print*,'energy case 1:',energy
  !dec$ endif
  
  !-----------------------------------------------------------!
  ! case two: arrays with data transfer and allocation        !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.true.) free_if(.false.))
  call system_clock(T2,clock_rate,clock_max)
  print*,'time for transfer MIC 2:',real(T2-T1)/real(clock_rate)
  
  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
  call system_clock(T3,clock_rate,clock_max)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
             
              !dec$ if defined(use_filter)
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter = 1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter

              !dec$  else
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
              !dec$ endif

           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T4,clock_rate,clock_max)
  print*,'timing for MIC case 2:',real(T4-T3)/real(clock_rate)
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'total time case 2:',real(T2-T1)/real(clock_rate)
  print*,'energy case 2:',energy


  !-----------------------------------------------------------!
  ! case three: arrays with data transfer and no memory alloc   !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.false.) free_if(.false.))
  call system_clock(T2,clock_rate,clock_max)
  print*,'time for transfer case 3;',real(T2-T1)/real(clock_rate)
  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
  call system_clock(T3,clock_rate,clock_max)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                    
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                      
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T4,clock_rate,clock_max)
  print*,'time for MIC case :',real(T4-T3)/real(clock_rate)
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 3 with vector lanes:',real(T2-T1)/real(clock_rate)
  print*,'energy case 3:',energy


  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!
  !$omp parallel

  tid = omp_get_thread_num()

  if(tid.eq.0)then
     num = omp_get_num_threads()
     print*,'using this many threads',num
  endif

  !$omp end parallel

  energy = 0.0d0

  call system_clock(T1,clock_rate,clock_max)
 
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                    
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                      
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 4 with filter:',real(T2-T1)/real(clock_rate)
  print*,'energy case 4:',energy

 
  stop
end program MIC


The results:
​
time for transfer MIC 2: 0.515500
timing for MIC case 2: 1.02030
total time case 2: 1.551300

time for transfer case 3: 1.000e-3
time case 3 with vector lanes: 0.716900
energy case 3: 47.80193
time for MIC case: 0.715400

time case 4 with filter: 0.7999
energy case 4: 47.80193

I will email the people at stampede and ask where there MICs are inserted

 

One more change to your program....

Linux has a first touch policy where memory mapping of the virtual memory of a process is not made until the data is "touched" (accessed). This means that first time use of memory allocated from the heap will have a higher cost than for subsequent accesses. Note, this does not refer to every time a node is allocated from the heap, rather it means the first time access overhead of any node residing in (partially or totally) the address range of each virtual memory page that has not yet been touched by the process.

With this in mind, the "standard" procedure for gathering timing statistics is:

do rep = 1,3

{ the one-shot test you have now for all three cases}

end do

Then you usually discard the first pass.

This said, your requirements may be to use only the first iteration timings (e.g. your process only performs this once).

Jim Dempsey

Results:

 Pass:            1
 time case 2:  0.8733000      energy:   47.8019349349047     
 time case 3:  0.2317000      energy:   47.8019349349047     
 time case 4:   1.808500      energy:   47.8019349349047     
 Pass:            2
 time case 2:  0.2335000      energy:   47.8019349349047     
 time case 3:  0.2316000      energy:   47.8019349349047     
 time case 4:   1.795100      energy:   47.8019349349047     
 Pass:            3
 time case 2:  0.2338000      energy:   47.8019349349047     
 time case 3:  0.2315000      energy:   47.8019349349047     
 time case 4:   1.793900      energy:   47.8019349349047     

See how first pass, case 2, experienced an additional .54 seconds.

Jim Dempsey

Ok, so I believe I have made the changes you suggested. I am still not obtaining your times, unfortunately.

program MIC

  use ifport
  use global
  use omp_lib
  implicit none

  double precision :: energy
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: i,j,k,l,np
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,T3,T4,clock_rate,clock_max
  integer :: count
  integer :: num,tid
  integer :: rep

  !dec$ define use_filter
  !dec$ if defined(use_filter)
  double precision :: filter
  !dec$ endif
  
  call seed(10)

  !--- in my real code, this variable is readin from a file
  !--- represents the total number of atoms in a simulation
  np = 60000
  allocate(x(np),y(np),z(np))

 
  call srand(10)
  !--- dont care about this. this is just initializing random positions
  do i = 1,np
     x(i) = 1000*rand(); y(i) = 1000*rand(); z(i) = 1000*rand()
     rSOA%x(i) = x(i) ; rSOA%y(i) = y(i) ; rSOA%z(i) = z(i)
  end do

  allocate(start(1000),end(1000),min_x(1000),min_y(1000),min_z(1000))
  !--- dont car about this either. just assigning pointers to rnew and part
  count = 0
  do i = 1,1000
     start(i) = count*60+1
     end(i)   = (count+1)*60

     min_x(i) = rand() ; min_y(i) = rand(); min_z(i) = rand()

     count = count + 1
  enddo
  
  
  !dec$ if defined(USeWhenUserDefinedTypesSupportedInOffload)
  !-----------------------------------------------------------!
  ! case one: structue of arrays                             !
  !-----------------------------------------------------------!
  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  
  !dir$ offload begin target(mic:0) in(start,end,rSOA,min_x,min_y,min_z)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(rSOA,end,start,min_x,min_y,min_z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = rSOA%x(k); y1 = rSOA%y(k); z1 = rSOA%z(k)
           
           do l = c2s,c2e
              x2 = rSOA%x(l); y2 = rSOA%y(l); z2 = rSOA%z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 1:',real(T2-T1)/real(clock_rate)
  print*,'energy case 1:',energy
  !dec$ endif
  
  !-----------------------------------------------------------!
  ! case two: arrays with data transfer and allocation        !
  !-----------------------------------------------------------!

  do rep = 1,3
     print*,'beginning rep:',rep
     
     energy = 0.0d0
     call system_clock(T1,clock_rate,clock_max)
     !dir$ offload_transfer target(mic:0),&
     !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(end: alloc_if(.true.) free_if(.false.))
     call system_clock(T2,clock_rate,clock_max)
     print*,'time for transfer to MIC case 2:',real(T2-T1)/real(clock_rate)
     
     
     !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
     call system_clock(T3,clock_rate,clock_max)
     !$omp parallel do schedule(dynamic) reduction(+:energy),&
     !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
     do i = 1,1000
        c1s = start(i); c1e = end(i)
        do j = i+1, 1000
           
           c2s = start(j); c2e = end(j)
           minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
           
           do k= c1s,c1e
              x1 = x(k); y1 = y(k); z1 = z(k)
              
              do l = c2s,c2e
                 x2 = x(l); y2 = y(l); z2 = z(l)
                 
                 dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
                 
                 dr2 = dx*dx + dy*dy + dz*dz
                 
                 !dec$ if defined(use_filter)
                 filter = 0.0d0
                 if(dr2.lt.2.0d0) filter = 1.0d0
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)*filter
                 
                 !dec$  else
                 if(dr2.lt.2.0d0)then
                    dr = sqrt(dr2)
                    dri = 1.0d0/dr
                    dr2i = 1.0d0*dri*dri
                    dr6i = dr2i*dr2i*dr2i
                    dr12i = dr6i*dr6i
                    
                    energy = energy + 4.0d0*(dr12i-dr6i)
                 endif
                 !dec$ endif
                 
              enddo
           enddo
        enddo
     enddo
     !$omp end parallel do
     call system_clock(T4,clock_rate,clock_max)
     print*,'timing for MIC computation case 2:',real(T4-T3)/real(clock_rate)
     !dir$ end offload
     call system_clock(T2,clock_rate,clock_max)
     
     print*,'total time case 2:',real(T2-T1)/real(clock_rate)
     print*,'energy case 2:',energy
     print*
     print*


  !-----------------------------------------------------------!
  ! case three: arrays with data transfer and no memory alloc   !
  !-----------------------------------------------------------!

  energy = 0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload_transfer target(mic:0),&
  !dir$& in(x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_x: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_y: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(min_z: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(start: alloc_if(.false.) free_if(.false.)),&
  !dir$& in(end: alloc_if(.false.) free_if(.false.))
  call system_clock(T2,clock_rate,clock_max)
  print*,'time for transfer to MIC case 3;',real(T2-T1)/real(clock_rate)
  
  !dir$ offload begin target(mic:0) nocopy(min_x,min_y,min_z,start,end,x,y,z) inout(energy)
  call system_clock(T3,clock_rate,clock_max)
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                    
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                      
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T4,clock_rate,clock_max)
  print*,'time for MIC computation case 3 :',real(T4-T3)/real(clock_rate)
  !dir$ end offload
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 3:',real(T2-T1)/real(clock_rate)
  print*,'energy case 3:',energy
  print*
  print*

  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!
  !$omp parallel

  tid = omp_get_thread_num()

  if(tid.eq.0)then
     num = omp_get_num_threads()
     print*,'using this many threads',num
  endif

  !$omp end parallel

  energy = 0.0d0

  call system_clock(T1,clock_rate,clock_max)
 
  !$omp parallel do schedule(dynamic) reduction(+:energy),&
  !$omp& default(private) shared(min_x,min_y,min_z,start,end,x,y,z)
  do i = 1,1000
     c1s = start(i); c1e = end(i)
     do j = i+1, 1000
        
        c2s = start(j); c2e = end(j)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
           
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
              
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              !dec$ if defined(use_filter)                                                                                                                                                                    
              filter = 0.0d0
              if(dr2.lt.2.0d0) filter =1.0d0
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
              
              !dec$ else                                                                                                                                                                                      
              if(dr2.lt.2.0d0)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i =1.0d0*dri*dri
                 dr6i =dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy= energy + 4.0d0*(dr12i-dr6i)
              endif
              !dec$ endif   
           enddo
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)

  print*,'time case 4 with filter:',real(T2-T1)/real(clock_rate)
  print*,'energy case 4:',energy
  print*
  print*

end do

 
  stop
end program MIC

 

The timing results are as follows:

 beginning rep:           1
 time for transfer to MIC case 2:  0.4809000    
 timing for MIC computation case 2:  0.9684000    
 total time case 2:   1.457200    
 energy case 2:   47.8019349349047     
 
 
 time for transfer to MIC case 3;  5.0000002E-04
 time for MIC computation case 3 :  0.7010000    
 time case 3:  0.7019000    
 energy case 3:   47.8019349349047     
 
 
 using this many threads          16
 time case 4 with filter:  0.7991000    
 energy case 4:   47.8019349349047     
 
 
 beginning rep:           2
 time for transfer to MIC case 2:  5.0000002E-04
 timing for MIC computation case 2:  0.7118000    
 total time case 2:  0.7126000    
 energy case 2:   47.8019349349047     
 
 
 time for transfer to MIC case 3;  3.9999999E-04
 time for MIC computation case 3 :  0.7028000    
 time case 3:  0.7036000    
 energy case 3:   47.8019349349047     
 
 
 using this many threads          16
 time case 4 with filter:  0.7991000    
 energy case 4:   47.8019349349047     
 
 
 beginning rep:           3
 time for transfer to MIC case 2:  3.9999999E-04
 timing for MIC computation case 2:  0.7092000    
 total time case 2:  0.7099000    
 energy case 2:   47.8019349349047     
 
 
 time for transfer to MIC case 3;  3.9999999E-04
 time for MIC computation case 3 :  0.6998000    
 time case 3:  0.7006000    
 energy case 3:   47.8019349349047     
 
 
 using this many threads          16
 time case 4 with filter:  0.7992000    
 energy case 4:   47.8019349349047    

 

Well you do see you knocked .48 seconds off the first use time. Your transfer times are inconsequential with respect to the run times.

I cannot explain why my MIC runs are so much faster than yours. Do you have exclusive use of the MIC?

Have you tried both compact and scatter on the MIC?

I am assuming you are using all 240 threads on the 60 cores of the 5110P?

If you are using 1 thread per core on the MIC, then I would expect the times you are observing.

Jim Dempsey

if doing the code below is the correct way to determine the number of threads the MIC is using, then it is 240 threads.

!dir$ offload begin target(mic:0)

!$omp parallel

tid= omp_get_thread_num()

if(tid.eq.0)then
    num = omp_get_num_threads()
    print*,'mic is using this many threads',num
endif
!$omp end parallel
!dir$ end offload

 

What is the temperature of the MIC? (run micsmc)

My idle temperature is 60C-70C

Also look at core usage.

Oh, in offload mode, you must not use all cores. Reserve one core for offload management.

Jim Dempsey

The default for offload mode is to make all threads active except for the core which is occupied by MPSS and offload data management.  That should be 236 threads on 60 cores.

... but you can expressly set the mode to use all cores. And in which case in offload mode, the application run time, evenly partitioning the work 240 ways could experience ~240x the work stolen time of lost productivity by the offload data management thread(s). Total latency would differ. IOW the barrier wait time in OpenMP for all threads of the OpenMP thread pool is extended to the completion time of the slowest thread(s). These would the threads getting pre-empted by the offload data management thread(s).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today