This article focuses on aspects of porting Fortran codes to the Intel® Xeon Phi™ coprocessor. Most of the documentation for the coprocessor is C/C++ centric. Here the focus is on Fortran, which is still the dominant language for scientific programming and for which a large amount of legacy code exits.
Native versus offload programming models
The decision as to whether code should be run natively on the coprocessor or using the offload programming model is much the same for Fortran as it is for C/C++. Legacy Fortran code can generally be run natively on the coprocessor without modification. However, for the code to obtain peak performance, it must be both vectorized and multithreaded.
If the code has large segments of single threaded code interspersed with multithreaded code, the best performance might be obtained by using the offload programming model. An exception to this is when using the offload model would require large arrays to be transferred repeatedly between the host and coprocessor. There is no equivalent to the Intel® Cilk™Plus offload for Fortran. In Intel CilkPlus, only those data elements whose values have changed will be transferred at the start and end of an offload section; in Fortran all data listed on the offload directive will be transferred at the start and end of an offload section unless the NOCOPY, IN or OUT keywords are used to restrict the transfer. Because of this, it is important to use the NOCOPY, IN and OUT options when possible rather than relying solely on the default behavior.
Choices for multithreading include OpenMP*, Pthreads* and MPI*, all of which are also available in C/C++. OpenMP is the most popular threading model in legacy Fortran codes, followed by MPI then Pthreads.
Starting with version 14.0 of the Intel Fortran compiler, coarrays and the DO CONCURRENT construct are also available. Prior to 14.0, coarrays were available on the Intel Xeon processor, but not on the Intel Xeon Phi coprocessor. There are, however, two restrictions at this time: 1) the host operating system must be Linux and 2) only the distributed memory model is available.
A coarray program can be run natively on the coprocessor, run with one or more coarray images on the host and others distributed across one or more coprocessors or run on the host with individual images offloading work to one or more coprocessor. The decision as to which model to use is similar to the decisions made when running an MPI code. Placing multiple coarray images on a single coprocessor increases memory requirements but cuts down on communication costs. As with MPI, OpenMP can be used within individual images of a coarray program to increase the number of threads running on the coprocessor in order to keep all of the cores busy.
When marking a function or subroutine or variable to be offloaded when that procedure is contained in a module, add the line:
!DIR$ ATTRIBUTES OFFLOAD:mic :: MY_PROCEDURE_NAME
immediately before the FUNCTION or SUBROUTINE STATEMENT.
In the procedure where the module is used, it is not necessary to add another attributes directive. The directive within the module is visible to the procedure which is using the module. While it is not necessary to do so, it can be to the programmer's advantage to place related functions which will be offloaded into a single module. Similarly, offload attributes for variables within the module should be place at the top of the data section for the module and it is not necessary to add the offload attribute statements for those variables to the procedure where that module is used.
It is also possible to mark a complete module as having the offload attribute. To do this, place the attribute declaration before the use statement in the procedure where the module is used.
When using common blocks in offloaded code, only those variables required by the offloaded sections must be marked with the offload attribute. If the common block is used in a number of offloaded procedures, the attributes directive along with the common block declaration can be placed in an include file used in each procedure. However, if different variables are used in each routine, individual attributes directives can be placed directly in each routine and tailored to the needs of that procedure. For a detailed example of using common blocks, see "leoF02_global_common.inc" and "leoF02_global.F90".
Common blocks on the coprocessor follow the Fortran standard. Therefore, even though all variables in a common block are not marked as having the offload attribute, space for the complete common block is allocated on the coprocessor. There is, however, no mapping from the variable address on the host to the variable address on the coprocessor for variables that do not have the offload attribute. If you try to use that variable name on the coprocessor, the compiler will warn you that the variable is not marked for offload. Although the compiler will allow you to use an equivalence statement to access that memory, this is a very dangerous thing to do. Values stored in this area will not be transferred back to the host at the end of the offload section. If you have been meaning to go through your code and remove the equivalence statements, this would be a good time to do so.
In C/C++, pointers cannot be passed into and out of offload sections. However, Fortran pointers are a different thing. Because Fortran pointers are always automatically dereference each time they are used, they can be specified in the data transfer section of an offload statement. The pointer will be dereferenced and conformable space will be allocated on the coprocessor.
But what happens if your pointer points to a structure or an element of a derived type that contains a pointer? The behavior in this case is changing with the 14.0 release of the compiler.
Asynchronous data transfers
When doing an asynchronous data transfer, you will need to add SIGNAL and WAIT options to the offload directives where the data is referenced. The value of the variable you will wait on can be any unique integer value. In order to avoid needing to pass around the value for any specific tag, it is recommended that the value be set to the host's address for the first element of data that was passed. In Fortran the easiest way to do this is to use the LOC intrinsic. This is not part of the Fortran standard but is commonly available, including in the Intel® Fortran Compiler.
The following code does asynchronous data transfers using the locations of the my_data and my_result to signal. It allocates space on the coprocessor for my_data and my_result and begins transferring the contents of my_data over to the coprocessor. While that transfer is happening, you can continue to do some work on the host. When the code reaches the next offload directive that is waiting on that tag, it checks to be sure the initial transfer has completed. As soon as the transfer is completed, it offloads some work to the coprocessor. If you add a signal option to that offload directive, the host is free to do some more work while it waits for any output from the offload section to be transferred back from the coprocessor. When the host is done with its work, it waits until the output has been transferred back to the host and frees the space on the coprocessor.
SUBROUTINE async_example(my_data, my_result, cnt)
INTEGER my_data[cnt], my_result[cnt]
INTEGER(INT_PTR_KIND()) my_data_loc, my_result_loc
my_data_loc = LOC(my_data)
my_result_loc = LOC(my_result)
!DIR$ OFFLOAD_TRANSFER TARGET(mic:0) &
IN(my_data : LENGTH(cnt) ALLOC_IF(.true.) FREE_IF(.false.)) &
NOCOPY(my_result : LENGTH(cnt) ALLOC_IF(.true.) FREE_IF(.false.))&
...do some work on the host here...
!DIR$ OFFLOAD BEGIN TARGET(mic:0) WAIT(my_data_loc) SIGNAL(my_result_loc) &
NOCOPY(my_data : ALLOC_IF(.false.) FREE_IF(.false.)) &
OUT(my_result : ALLOC_IF(.false.) FREE_IF(.false.))
...do your offloaded work here...
!DIR$ END OFFLOAD
...do some work on the host here...
!DIR$ OFFLOAD_TRANSFER TARGET(mic:0) WAIT(my_result_loc) &
NOCOPY(my_data, my_result : ALLOC_IF(.false.) FREE_IF(.true.))
END SUBROUTINE async_example
Vectorizable Functions and Subroutines
The Intel Fortran Compiler provides a method for writing vectorizable functions and subroutines. Known as SIMD-enabled functions and subroutines, they are written using the vector attribute:
SUBROUTINE MY_SIMD_ROUTINE(dummy1, dummy2, etc)
!DIR$ ATTRIBUTES VECTOR :: MY_SIMD_ROUTINE
END SUBROUTINE MY_SIMD_ROUTINE
These types of procedures can be used on the Intel Xeon Phi coprocessor as well as the Intel Xeon processor.
For code which is run in offload mode, a vectorizable function or subroutine must have the offload attribute, the same as for any other procedure.
When using the vector attribute, it is not necessary to add the vector attribute directive to both the procedure definition and to the procedure call if both are in the same file. However, it is not an error to do so. When the procedure definition and procedure call exist in separate files, the vector attribute directive must be used in both locations. For the offload attribute, the offload attribute directive must always be specified both in the procedure definition and before the procedure call. Therefore the following is recommended when using the vector attribute:
1) Add the offload attribute to the attributes directive:
!DIR$ ATTRIBUTES VECTOR, OFFLOAD:mic :: MY_SIMD_ROUTINE
If desired, you can also add the architecture type to the vector attribute. If you do so, be sure to specify the architecture for both the processor and the coprocessor:
!DIR$ ATTRIBUTES VECTOR:(PROCESSOR(core_2nd_gen_avx),PROCESSOR(mic)), OFFLOAD:mic :: MY_SIMD_ROUTINE
2) Place the directive immediately after the SUBROUTINE or FUNCTION statement and inside each procedure that calls this SUBROUTINE or FUNCTION.
Much of the documentation for array notation on the coprocessor uses Intel CilkPlus array notation. Fortran programmers should take heart - the version of array notation in Fortran uses the Fortran standard array notation, not the Intel CilkPlus array notation. Therefore, in the offload statement
!dir$ offload begin in(foo(1:6:2))
the meaning of foo(1:6:2) is: from element 1 to element 6 with a stride of 2, as a Fortran programmer would expect.