How to Achieve Peak Transfer Rate - Fortran

Compiler Methodology for Intel® MIC Architecture

How to Achieve Peak Transfer Rate - Fortran

Overview

This is a short and handy example to measure optimal data transfer rates. This example shows how to allocate and free data aligned on 4KB boundaries, which is optimal for DMA transfers to the Intel® Xeon® Phi™ coprocessor. Actual data rates are not shown in this example. This sample merely shows techniques for efficient data transfer.

Topics

Data has to be allocated with 4K alignment for optimal DMA performance. DMA is used for efficient data movement over PCIe to the Intel® Xeon® Phi™ coprocessor.

Allocate the data on the MIC side using the same allocate() call. Use the “attributes align:4096” attribute in the source code to get 4KB aligned data. Use free_if(.FALSE.) alloc_if(.FALSE.) when you do the data transfer inside the loop. One simple version of the code is given below. Try this code by compiling it with different alignments (e.g., align at 16 Bytes and align at 4096 Bytes) by modifying the “alignment” parameter and observe the performance difference at small size data transfers.

How to run the code:

-bash-4.1$ ifort bwtest.f90

-bash-4.1$ ./a.out

          <your results will be shown here>

-bash-4.1$

Source Code

program bwtest
  implicit none

  integer :: i, j
  integer :: send, receive
  character, allocatable, dimension(:) :: buf
  integer(8) :: start, end
  integer(8) :: counts_per_second 
  integer, parameter :: NITERS=200  !number of interations in benchmarking loop
  integer, parameter :: ALIGNMENT=4096
  integer, dimension(18) :: bufsizes = (/ &
    4096,  &
    8192, &
    16384, &
    32768, &
    65536, &
    131072, &
    262144, &
    524288, &
    1048576, &
    2097152, &
    4194304, &
    8388608, &
    16777216, &
    33554432, &
    67108864, &
    134217728, &
    268435456, &
    536870912 &
    /)

  write(*,*) 'Bandwidth test.'
  write(*,*) 'NITERS: ', NITERS
  write(*,*) 'Alignment = ', ALIGNMENT
  write(*,*) '          Size(B)      Send(usec) Receive(usec) Send(B/sec)   Receive(B/sec)'

  do i=1,size(bufsizes)
    !..alloc CPU buffer
    !..force the buffer on 4K boundary for best DMA xfer performance
    !DIR$ ATTRIBUTES ALIGN:alignment :: buf
    allocate(buf(bufsizes(i)))

    !alloc MIC buffer
    !DIR$ OFFLOAD begin target(mic:0) in(buf:length(bufsizes(i)) free_if(.FALSE.))
      !empty
    !DIR$ end OFFLOAD

    !The main benchmarking loop
    send = 0
    receive = 0

    do j=1,NITERS
      !send
      call system_clock(start,counts_per_second)

      !DIR$ OFFLOAD begin target(mic:0) in(buf:length(bufsizes(i)) alloc_if(.FALSE.) free_if(.FALSE.))
        !empty
      !DIR$ end OFFLOAD
      call system_clock(end,counts_per_second)
      send =  send + (end - start)

       !receive
      call system_clock(start,counts_per_second)

       !DIR$ OFFLOAD begin target(mic:0) out(buf:length(bufsizes(i)) alloc_if(.FALSE.) free_if(.FALSE.))
         !empty
       !DIR$ end OFFLOAD

       call system_clock(end,counts_per_second)
        receive = receive + (end - start)
    enddo

    send = send / NITERS;
    receive = receive / NITERS;
    write(*,*) 'Results: ', bufsizes(i), send, receive, (1e6*bufsizes(i))/send, (1e6*bufsizes(i))/receive

    !free MIC buffer
    !DIR$ OFFLOAD begin target(mic:0) out(buf:length(bufsizes(i)) alloc_if(.FALSE.))
      !empty
    !DIR$ end OFFLOAD

    !free CPU buffer
    deallocate(buf)
  end do

end program bwtest

Take Aways

This article shows how to get data buffers aligned on 4K boundaries. 4K boundaries are optimal for DMA transfers. This article also provides code to measure transfer rates for various buffer sizes. This can assist you in determining the optimal buffer sizes for your data.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Native and Offload Programming Models

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.