A Framework for Low-Communication 1-D FFT


Authors:  Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, Vladimir Petrov - Intel Corporation

This was selected as a best paper finalist during the Supercomputing 2012 conference and discusses algorithmic modifications to reduce internode data exchange needs, to greatly enhance performance of 1-D FFT (Fast Fourier Transform) algorithms.  Although the proposed approach and data presented are focused on Intel(R) Xeon(R) processors, the same concepts and approaches apply to Intel(R) Xeon Phi(tm) Coprocessor.   The original paper can be downloaded below

**:* 2013 Update:   A follow-up whitepaper and conference presentation was made at the Supercomputing 2013 Conference, entitled "Tera-Scale 1D FFT with Low-Communication Algorithm and Intel(R) Xeon Phi(tm) Coprocessors."      This is attached to this article (2nd download, below, named "sc13_fft" )

Authors:  Jongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, and Daehyun Kim.

Abstract:   This paper demonstrates the first tera-scale performance of Intel(R) Xeon Phi(tm) coprocessors on 1D FFT computations.  Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 coprocessor nodes, and reach 6.7 TFLOPS wit 512 nodes, which is 1.5x than achievable on a same number of Intel(R) Xeon(R) nodes.  It is a challenge to fully utilize the compute capability presented by many-core wide vector processors for bandwidth-bound FFT computation.  We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches.  Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Intel(R) Xeon Phi(tm) Coprocessors:  it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.



Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.