Managing Performance of the
Cluster Fourier Transform Functions
Performance of Cluster FFT (CFFT) in different applications mainly depends on the cluster configuration, performance of message-passing interface (MPI) communications, and configuration of the run. Note that MPI communications usually take approximately 70% of the overall CFFT compute time.For more flexibility of control over time-consuming aspects of CFFT algorithms, provides the
Intel® oneAPI Math Kernel Library
Intel® oneAPI Math Kernel Library
MKL_CDFT
environment variable to set special values
that affect CFFT performance. To improve performance of your application that
intensively calls CFFT, you can use the
environment variable to set optimal
values for you cluster, application, MPI, and so on.
The
MKL_CDFT
environment variable has the following
syntax, explained in the table below:
MKL_CDFT=option1[=value1],option2[=value2],…,optionN[=valueN]
While this table explains the settings that usually improve
performance under certain conditions, the actual performance highly depends on
the configuration of your cluster. Therefore, experiment with the listed values
to speed up your computations.
Option
| Possible Values
| Description
|
---|---|---|
alltoallv | 0 (default)
| Configures CFFT to use the standard
MPI_Alltoallv function to perform global
transpositions.
|
1
| Configures CFFT to use a series of calls to
MPI_Isend and
MPI_Irecv instead of the
MPI_Alltoallv function. | |
4
| Configures CFFT to merge global transposition with data
movements in the local memory. CFFT performs global transpositions by calling
MPI_Isend and
MPI_Irecv in this case.
Use this value in a hybrid case (MPI + OpenMP), especially
when the number of processes per node equals one.
| |
wo_omatcopy | 0
| Configures CFFT to perform local FFT and local transpositions
separately.
CFFT usually performs faster with this value than with
wo_omatcopy = 1 if the configuration parameter
DFTI_TRANSPOSE has the value of
DFTI_ALLOW . See the
Intel® oneAPI Math Kernel Library |
1
| Configures CFFT to merge local FFT calls with local
transpositions.
CFFT usually performs faster with this value than with
wo_omatcopy = 0 if
DFTI_TRANSPOSE has the value of
DFTI_NONE .
| |
-1 (default)
| Enables CFFT to decide which of the two above values to use
depending on the value of
DFTI_TRANSPOSE .
| |
enable_soi | Not applicable
| A flag that enables low-communication Segment Of Interest FFT
(SOI FFT) algorithm for one-dimensional complex-to-complex CFFT, which requires
fewer MPI communications than the standard nine-step (or six-step) algorithm.
While using fewer MPI communications, the
SOI FFT algorithm incurs a minor loss of precision (about one decimal digit).
|
The following example illustrates usage of the environment variable
assuming the bash shell
:
export MKL_CDFT=wo_omatcopy=1,alltoallv=4,enable_soi mpirun –ppn 2 –n 16 ./mkl_cdft_app
Optimization Notice
|
---|
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
|
This notice covers the following instruction sets: SSE2, SSE4.2, AVX2, AVX-512.