This article discusses techniques to optimize the Fast Fourier Transform (FFT) for Intel® Processor Graphics without using Shared Local Memory (SLM) or vendor extensions. The implementation leverages a run‑time code generator to support a wide variety of FFT dimensions. Performance depends on a deep understanding of how the OpenCL™ API maps to the underlying architecture.
The sample is provided as a Microsoft Visual Studio* 2013 solution and contains two projects: genFFT and sample.
genFFT is the FFT code generator which produces 1D FFT kernels for various FFT lengths power of two, data types (cl_float and cl_half) and GPU architectural details. The sample project shows one way of using genFFT to generate and enqueue FFT kernels in your application.
The implementation has already been discussed in detail in previous articles. You can find them at the following links:
International Workshop on OpenCL – IWOCL 2016
OpenCL™ Fast Fourier transform optimizations for Intel® Processor Graphics
IWOCL '16 Proceedings of the 4th International Workshop on OpenCL, Article No. 12, ACM
OpenCL™ FFT Optimizations for Intel® Processor Graphics
The sample uses Intel® MKL for functional validation. Fortunately Intel® tools are freely available through a variety of options. In order to find the option that best matches you go to Intel® Developer Zone Free Software Tools.
The sample project uses single precision by default but genFFT also supports half precision. There are several options for converting between single precision and half precision data. One is to use Intel® C++ Compiler Intrinsics for Converting Half Floats. Another is to use DirectXMath Library Conversion Functions.
The code has been tested with Microsoft Visual Studio 2013 and 5th and 6th Generation Intel® Processor Graphics.
By default the sample will perform 8-point FFT on a 1024x768 cl_flloat2 buffer. The data is read and written in column order for maximum memory bandwidth. The following command line arguments can be used in order to try other FFT configurations:
|-cols c||Signal width||Default 1024.|
|-rows r||Signal height.||Default 768. Must be a multiple of FFT length.|
|-fft l||FFT length.||Default 8.|
|-base b||FFT base length.||Default 32.|
|-simd s||Logical SIMD width.||Default 8.|
|-lx l||localSize.||Default 256. Must divide the signal width.|
|-h||Display this help.|
There are only a few constraints to consider:
The sample configures genFFT based on the above arguments, initializes the buffer with a random signal between -1.0 and 1.0, computes the reference spectrum on the CPU using MKL (single precision) and computes the spectrum on the GPU using genFFT (single precision).
|Execution time per FFT:||0.0328994us|
|Max abs error:||9.53674e-007|
When finished the sample reports the buffer size as the number of complex elements, the FFT length, the execution time per FFT and the error calculated as the maximum of the absolute difference between the spectrum computed on the CPU using MKL and the spectrum computed on the GPU using genFFT.
Dan Petre is a Graphics Software Engineer with expertise in GPGPU programming and optimization for computer vision (OpenCV), deep learning (FFT based convolution for CNN) and digital signal processing (FFT).
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804