Developer Guide

Contents

Managing Performance of the Cluster Fourier Transform Functions

Performance of
Intel® MKL
Cluster FFT (CFFT) in different applications mainly depends on the cluster configuration, performance of message-passing interface (MPI) communications, and configuration of the run. Note that MPI communications usually take approximately 70% of the overall CFFT compute time.For more flexibility of control over time-consuming aspects of CFFT algorithms,
Intel® MKL
provides the
MKL_CDFT
environment variable to set special values that affect CFFT performance. To improve performance of your application that intensively calls CFFT, you can use the environment variable to set optimal values for you cluster, application, MPI, and so on.
The
MKL_CDFT
environment variable has the following syntax, explained in the table below:
MKL_CDFT=option1[=value1],option2[=value2],…,optionN[=valueN]
While this table explains the settings that usually improve performance under certain conditions, the actual performance highly depends on the configuration of your cluster. Therefore, experiment with the listed values to speed up your computations.
Option
Possible Values
Description
alltoallv
0 (default)
Configures CFFT to use the standard
MPI_Alltoallv
function to perform global transpositions.
1
Configures CFFT to use a series of calls to
MPI_Isend
and
MPI_Irecv
instead of the
MPI_Alltoallv
function.
4
Configures CFFT to merge global transposition with data movements in the local memory. CFFT performs global transpositions by calling
MPI_Isend
and
MPI_Irecv
in this case.
Use this value in a hybrid case (MPI + OpenMP), especially when the number of processes per node equals one.
wo_omatcopy
0
Configures CFFT to perform local FFT and local transpositions separately.
CFFT usually performs faster with this value than with
wo_omatcopy
= 1 if the configuration parameter
DFTI_TRANSPOSE
has the value of
DFTI_ALLOW
. See the
Intel® MKL
Developer Reference
for details.
1
Configures CFFT to merge local FFT calls with local transpositions.
CFFT usually performs faster with this value than with
wo_omatcopy
= 0 if
DFTI_TRANSPOSE
has the value of
DFTI_NONE
.
-1 (default)
Enables CFFT to decide which of the two above values to use depending on the value of
DFTI_TRANSPOSE
.
enable_soi
Not applicable
A flag that enables low-communication Segment Of Interest FFT (SOI FFT) algorithm for one-dimensional complex-to-complex CFFT, which requires fewer MPI communications than the standard nine-step (or six-step) algorithm.
While using fewer MPI communications, the SOI FFT algorithm incurs a minor loss of precision (about one decimal digit).
The following example illustrates usage of the environment variable:
set MKL_CDFT=wo_omatcopy=1,alltoallv=4,enable_soi mpirun –ppn 2 –n 16 mkl_cdft_app.exe
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
This notice covers the following instruction sets: SSE2, SSE4.2, AVX2, AVX-512.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804