How to Achieve Peak Transfer Rate - Fortran

Compiler Methodology for Intel® MIC Architecture

 

How to Achieve Peak Transfer Rate

 

Overview

This is a short and handy example to measure optimal data transfer rates. This example shows how to allocate and free data aligned on 4KB boundaries, which is optimal for DMA transfers to the Intel® Xeon® Phi™ coprocessor. Actual data rates are not shown in this example. This sample merely shows techniques for efficient data transfer.

Topics

Data has to be allocated with 4K alignment for optimal DMA performance. DMA is used for efficient data movement over PCIe to the Intel® Xeon® Phi™ coprocessor.

Allocate the data on the MIC side using the same allocate() call. Use the “attributes align:4096” attribute in the source code to get 4KB aligned data. Use free_if(.FALSE.) alloc_if(.FALSE.) when you do the data transfer inside the loop. One simple version of the code is given below. Try this code by compiling it with different alignments (e.g., align at 16 Bytes and align at 4096 Bytes) by modifying the “alignment” parameter and observe the performance difference at small size data transfers.

How to compile and run the code:

**************

-bash-4.1$ ifort bwtest.f90

-bash-4.1$ ./a.out

          <your results will be shown here>

-bash-4.1$

****************

-bash-4.1$ cat bwtest.f90

cat bwtest.f90
program bwtest
implicit none

integer :: i, j
integer :: send, receive
character, allocatable, dimension(:) :: buf
integer(8) :: start, end
integer(8) :: counts_per_second
integer, parameter :: NITERS=200 !number of interations in benchmarking loop
integer, parameter :: ALIGNMENT=4096
integer, dimension(18) :: bufsizes = (/ &
4096, &
8192, &
16384, &
32768, &
65536, &
131072, &
262144, &
524288, &
1048576, &
2097152, &
4194304, &
8388608, &
16777216, &
33554432, &
67108864, &
134217728, &
268435456, &
536870912 &
/)

write(*,*) 'Bandwidth test.'
write(*,*) 'NITERS: ', NITERS
write(*,*) 'Alignment = ', ALIGNMENT
write(*,*) ' Size(B) Send(usec) Receive(usec) Send(B/sec) Receive(B/sec)'

do i=1,size(bufsizes)
!..alloc CPU buffer
!..force the buffer on 4K boundary for best DMA xfer performance
!DIR$ ATTRIBUTES ALIGN:alignment :: buf
allocate(buf(bufsizes(i)))

!alloc MIC buffer
!DIR$ OFFLOAD begin target(mic:0) in(buf:length(bufsizes(i)) free_if(.FALSE.))
!empty
!DIR$ end OFFLOAD

!The main benchmarking loop
send = 0
receive = 0

do j=1,NITERS
!send
call system_clock(start,counts_per_second)

!DIR$ OFFLOAD begin target(mic:0) in(buf:length(bufsizes(i)) alloc_if(.FALSE.) free_if(.FALSE.))
!empty
!DIR$ end OFFLOAD
call system_clock(end,counts_per_second)
send = send + (end - start)

!receive
call system_clock(start,counts_per_second)

!DIR$ OFFLOAD begin target(mic:0) out(buf:length(bufsizes(i)) alloc_if(.FALSE.) free_if(.FALSE.))
!empty
!DIR$ end OFFLOAD

call system_clock(end,counts_per_second)
receive = receive + (end - start)
enddo

send = send / NITERS;
receive = receive / NITERS;
write(*,*) 'Results: ', bufsizes(i), send, receive, (1e6*bufsizes(i))/send, (1e6*bufsizes(i))/receive

!free MIC buffer
!DIR$ OFFLOAD begin target(mic:0) out(buf:length(bufsizes(i)) alloc_if(.FALSE.))
!empty
!DIR$ end OFFLOAD

!free CPU buffer
deallocate(buf)
end do

end program bwtest

Take Aways

This article shows how to get data buffers aligned on 4K boundaries. 4K boundaries are optimal for DMA transfers. This article also provides code to measure transfer rates for various buffer sizes. This can assist you in determining the optimal buffer sizes for your data.

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to Native and Offload Programming Models

AnhangGröße
Herunterladen bwtest.f902.16 KB
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.