MKL FFT performance – comparison of local and distributed-memory implementations

Fast Fourier Transform (FFT) is widely used in many applications (e.g. DSP, PDE, etc.). Its performance is subject to a number of factors: hardware - number of cores, their clock rate, and the speed of interconnect; and software - data layout (distribution) and the level of available parallelism (which may be limited by the application logic). In this article we compare performance of FFT implementations provided by MKL in different usage models to help developers make a choice and get the maximum efficiency out of the available hardware.

MKL provides two flavors of the FFT functions:

•   Local FFT (also called DFTI) - this implementation is highly parallelized using OpenMP and is generally recommended for both single-processor and shared-memory machines when the input/output data fit into local memory of one machine. This implementation demonstrates good multi-core scalability and cache utilization.

•    Distributed-memory FFT (also called Cluster FFT) - this implementation is highly parallelized using MPI and is generally recommended for distributed-memory machines (clusters) when the input/output data do not fit into a single machine's local memory and/or are distributed across the cluster nodes. This implementation provides support for very large transforms while maintaining good scalability using MPI parallelism. However since the data are distributed across nodes, the total compute time is significantly influenced by the global transposes which are present in the algorithm - three for one-dimensional transforms and two for multi-dimensional ones. Also a user should keep in mind that a cluster application can be run with different number of MPI processes per node (ppn) which impacts the overall computation time noticeably.

The following charts show the performance in Gflop/s of these two implementations for 1D in-place complex-to-complex FFT (double precision) versus the vector length for different cases:
•   DFTI in sequential case (1 thread)
•   DFTI with OpenMP parallelism (8 threads, KMP_AFFINITY=compact)
•   DFTI with OpenMP parallelism (8 threads, KMP_AFFINITY=scatter)
•   DFTI with OpenMP parallelism (16 threads, KMP_AFFINITY=scatter)
•   Cluster FFT on 1 node with ppn=8
•   Cluster FFT on 8 nodes with ppn=1
•   Cluster FFT on 8 nodes with ppn=8



Local FFT (DFTI) is unparalleled for small-to-medium size transforms when threaded using Open MP on all available cores. Note that in this case KMP_AFFINITY=scatter may provide up-to 50% speed-up as opposed to KMP_AFFINITY=compact (this is due to all 8 threads residing on the same CPU socket).

As one can notice from the charts, for an 8-node system the best overall performance is achieved when all available physical cores are used (64 MPI processes, ppn=8). Also, it is easy to see that the case of 8 MPI processes with ppn=1 gives the best performance per MPI process, but such configuration leads to extremely low utilization of compute resources on each node and, hence, to a very poor overall performance.

As it was mentioned above, cluster FFTs are very sensitive to the speed of a global transpose (i.e. to the inter-node bandwidth). Our tests demonstrated that Intel MPI provides about 1.7x scaling in the achieved bandwidth as ppn increases from 1 to 8.

Larger cache sizes and bigger number of cores per node positively affect performance of both local and cluster implementations, while slower interconnect would significantly decrease the performance of Cluster FFT (making the larger total memory the only advantage over local FFT).

Also it is important to remember that for problems of fixed size there is a cross-over point after which increasing the number of nodes does not pay off - the local resources are utilized inefficiently and the communication cost becomes too high.

Below is the hardware and software configuration of the cluster that was used for measurements in the article:
Mainboard: Intel® Server Board S5520UR
CPUs: Xeon X5560 C1 step (Nehalem EP) 2.8GHz / 6.4 QPI 1333 95 W 1MB L2 cache, 8M L3 cache (two CPUs per node with SMT enabled)
HCA: MHQH29-XTC PCI-Express x8 dual QDR InfiniBand 4x (firmware 2.6.0)
InfiniBand Software Stack: 1.3.1
OS: RedHat EL5 update 2, Kernel 5.3
MPI: Intel MPI (in rdssm configuration)



Starting with Intel MKL 10.3.0 Cluster FFTs support hybrid (MPI + OpenMP) mode for 1D transforms. The efficient support was provided with the last versions of Intel MPI 4.0 ( released 2010) only.

Optimization Notice in English

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.