Developer Reference

  • 0.9
  • 09/09/2020
  • Public Content
Contents

Distributing Data among Processes

The
Intel® oneAPI Math Kernel Library
cluster FFT functions store all input and output multi-dimensional arrays (matrices) in one-dimensional arrays (vectors). The arrays are stored
in the row-major order
. For example, a two-dimensional matrix
A
of size (
m,n
) is stored in a vector
B
of size
m*n
so that
B[i*n+j]=A[i][j]
(
i=0, ..., m-1, j=0, ..., n-1
)
.
Order of FFT dimensions is the same as the order of array dimensions in the programming language. For example, a 3-dimensional FFT with Lengths=(
m,n,l
) can be computed over an array
Ar[m][n][l]
.
All MPI processes involved in cluster FFT computation operate their own portions of data. These local arrays make up the virtual global array that the fast Fourier transform is applied to. It is your responsibility to properly allocate local arrays (if needed), fill them with initial data and gather resulting data into an actual global array or process the resulting data differently. To be able do this, see sections below on how the virtual global array is composed of the local ones.

Multi-dimensional transforms

If the dimension of transform is greater than one, the cluster FFT function library splits data in the dimension whose index changes most slowly, so that the parts contain all elements with several consecutive values of this index. It is the first dimension in C. If the global array is two-dimensional, it gives each process several consecutive rows. Local arrays are placed in memory allocated for the virtual global array consecutively, in the order determined by process ranks. For example, in case of two processes, during the computation of a three-dimensional transform whose matrix has size (11,15,12), the processes may store local arrays of sizes (6,15,12) and (5,15,12), respectively.
If
p
is the number of MPI processes and the matrix of a transform to be computed has size (
m,n,l
), each MPI process works with local data array of size (
m
q
, n, l
), where Σ
m
q
=m
,
q=0, ... , p-1
. Local input arrays must contain appropriate parts of the actual global input array, and then local output arrays will contain appropriate parts of the actual global output array. You can figure out which particular rows of the global array the local array must contain from the following configuration parameters of the cluster FFT interface:
CDFT_LOCAL_NX
,
CDFT_LOCAL_START_X
, and
CDFT_LOCAL_SIZE
. To retrieve values of the parameters, use the
DftiGetValueDM
function:
  • CDFT_LOCAL_NX
    specifies how many rows of the global array the current process receives.
  • CDFT_LOCAL_START_X
    specifies which row of the global input or output array corresponds to the first row of the local input or output array. If
    A
    is a global array and
    L
    is the appropriate local array, then
    L[i][j][k]=A[i+cdft_local_start_x][j][k]
    , where
    i=0, ..., m
    q
    -1, j=0, ..., n-1, k=0, ..., l
    -1
    .
Example
"2D Out-of-place Cluster FFT Computation"
shows how the data is distributed among processes for a two-dimensional cluster FFT computation.

One-dimensional transforms

In this case, input and output data are distributed among processes differently and even the numbers of elements stored in a particular process before and after the transform may be different. Each local array stores a segment of consecutive elements of the appropriate global array. Such segment is determined by the number of elements and a shift with respect to the first array element. So, to specify segments of the global input and output arrays that a particular process receives,
four
configuration parameters are needed:
CDFT_LOCAL_NX
,
CDFT_LOCAL_START_X
,
CDFT_LOCAL_OUT_NX
, and
CDFT_LOCAL_OUT_START_X
. Use the
DftiGetValueDM
function to retrieve their values. The meaning of the four configuration parameters depends upon the type of the transform, as shown in Table
"Data Distribution Configuration Parameters for 1D Transforms"
:
Data Distribution Configuration Parameters for 1D Transforms
Meaning of the Parameter
Forward Transform
Backward Transform
Number of elements in input array
CDFT_LOCAL_NX
CDFT_LOCAL_OUT_NX
Elements shift in input array
CDFT_LOCAL_START_X
CDFT_LOCAL_OUT_START_X
Number of elements in output array
CDFT_LOCAL_OUT_NX
CDFT_LOCAL_NX
Elements shift in output array
CDFT_LOCAL_OUT_START_X
CDFT_LOCAL_START_X

Memory size for local data

The memory size needed for local arrays cannot be just calculated from
CDFT_LOCAL_NX (CDFT_LOCAL_OUT_NX)
, because the cluster FFT functions sometimes require allocating a little bit more memory for local data than just the size of the appropriate sub-array. The configuration parameter
CDFT_LOCAL_SIZE
specifies the size of the local input and output array in data elements. Each local input and output arrays must have size not less than
CDFT_LOCAL_SIZE
*
size_of_element
. Note that in the current implementation of the cluster FFT interface, data elements can be real or complex values, each complex value consisting of the real and imaginary parts. If you employ a user-defined workspace for in-place transforms (for more information, refer to Table
"Settable configuration Parameters"
), it must have the same size as the local arrays. Example
"1D In-place Cluster FFT Computations"
illustrates how the cluster FFT functions distribute data among processes in case of a one-dimensional FFT computation performed with a user-defined workspace.

Available Auxiliary Functions

If a global input array is located on one MPI process and you want to obtain its local parts or you want to gather the global output array on one MPI process, you can use functions
MKL_CDFT_ScatterData
and
MKL_CDFT_GatherData
to distribute or gather data among processes, respectively. These functions are defined in a file that is delivered with
Intel® oneAPI Math Kernel Library
and located in the following subdirectory of the
Intel® oneAPI Math Kernel Library
installation directory:
examples/cdftc/source/cdft_example_support.c
.

Restriction on Lengths of Transforms

The algorithm that the
Intel® oneAPI Math Kernel Library
cluster FFT functions use to distribute data among processes imposes a restriction on lengths of transforms with respect to the number of MPI processes used for the FFT computation:
  • For a multi-dimensional transform, lengths
    of the first two dimensions
    must be not less than the number of MPI processes.
  • Length of a one-dimensional transform must be the product of two integers each of which is not less than the number of MPI processes.
Non-compliance with the restriction causes an error
CDFT_SPREAD_ERROR
(refer to Error Codes for details). To achieve the compliance, you can change the transform lengths and/or the number of MPI processes, which is specified at start of an MPI program. MPI-2 enables changing the number of processes during execution of an MPI program.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804