Distributing Data among Processes
cluster FFT functions store all input and output multi-dimensional arrays (matrices) in one-dimensional arrays (vectors). The arrays are stored
in the column-major order. For example, a two-dimensional matrix
Aof size (
m,n) is stored in a vector
i=1, ..., m, j=1, ..., n) .
Order of FFT dimensions is the same as the order of array dimensions in the programming language. For example, a 3-dimensional FFT with Lengths=(
m,n,l) can be computed over an array
All MPI processes involved in cluster FFT computation operate their own portions of data. These local arrays make up the virtual global array that the fast Fourier transform is applied to. It is your responsibility to properly allocate local arrays (if needed), fill them with initial data and gather resulting data into an actual global array or process the resulting data differently. To be able do this, see sections below on how the virtual global array is composed of the local ones.
If the dimension of transform is greater than one, the cluster FFT function library splits data in the dimension whose index changes most slowly, so that the parts contain all elements with several consecutive values of this index. It is the first dimension in C
and the last dimension in Fortran. If the global array is two-dimensional
, in C, it gives each process several consecutive rows.
The term "rows" will be used regardless of the array dimension and programming language.Local arrays are placed in memory allocated for the virtual global array consecutively, in the order determined by process ranks. For example, in case of two processes, during the computation of a three-dimensional transform whose matrix has size (11,15,12), the processes may store local arrays of sizes (6,15,12) and (5,15,12), respectively.
pis the number of MPI processes and the matrix of a transform to be computed has size (
, in C, each MPI process works with local data array of size (
, n, l), where Σ
q=0, ... , p-1. Local input arrays must contain appropriate parts of the actual global input array, and then local output arrays will contain appropriate parts of the actual global output array. You can figure out which particular rows of the global array the local array must contain from the following configuration parameters of the cluster FFT interface:
CDFT_LOCAL_SIZE. To retrieve values of the parameters, use the
- CDFT_LOCAL_NXspecifies how many rows of the global array the current process receives.
- CDFT_LOCAL_START_Xspecifies which row of the global input or output array corresponds to the first row of the local input or output array. IfAis a global array andLis the appropriate local array, thenL(i,j,k)=A(i,j,k+cdft_local_start_x-1), wherei=1, ..., m, j=1, ..., n, k=1, ..., l.q
"2D Out-of-place Cluster FFT Computation"shows how the data is distributed among processes for a two-dimensional cluster FFT computation.
In this case, input and output data are distributed among processes differently and even the numbers of elements stored in a particular process before and after the transform may be different. Each local array stores a segment of consecutive elements of the appropriate global array. Such segment is determined by the number of elements and a shift with respect to the first array element. So, to specify segments of the global input and output arrays that a particular process receives,
fourconfiguration parameters are needed:
CDFT_LOCAL_OUT_START_X. Use the
DftiGetValueDMfunction to retrieve their values. The meaning of the four configuration parameters depends upon the type of the transform, as shown in Table
"Data Distribution Configuration Parameters for 1D Transforms":
Meaning of the Parameter
Number of elements in input array
Elements shift in input array
Number of elements in output array
Elements shift in output array
Memory size for local data
The memory size needed for local arrays cannot be just calculated from
CDFT_LOCAL_NX (CDFT_LOCAL_OUT_NX), because the cluster FFT functions sometimes require allocating a little bit more memory for local data than just the size of the appropriate sub-array. The configuration parameter
CDFT_LOCAL_SIZEspecifies the size of the local input and output array in data elements. Each local input and output arrays must have size not less than
size_of_element. Note that in the current implementation of the cluster FFT interface, data elements can be real or complex values, each complex value consisting of the real and imaginary parts. If you employ a user-defined workspace for in-place transforms (for more information, refer to Table
"Settable configuration Parameters"), it must have the same size as the local arrays. Example
"1D In-place Cluster FFT Computations"illustrates how the cluster FFT functions distribute data among processes in case of a one-dimensional FFT computation performed with a user-defined workspace.
Available Auxiliary Functions
If a global input array is located on one MPI process and you want to obtain its local parts or you want to gather the global output array on one MPI process, you can use functions
MKL_CDFT_GatherDatato distribute or gather data among processes,