A Runtime Generated FFT for Intel® Processor Graphics

Download the code

Introduction

This article discusses techniques to optimize the Fast Fourier Transform (FFT) for Intel® Processor Graphics without using Shared Local Memory (SLM) or vendor extensions. The implementation leverages a run‑time code generator to support a wide variety of FFT dimensions. Performance depends on a deep understanding of how the OpenCL™ API maps to the underlying architecture.

OpenCL™ Implementation

The sample is provided as a Microsoft Visual Studio* 2013 solution and contains two projects: genFFT and sample.

genFFT is the FFT code generator which produces 1D FFT kernels for various FFT lengths power of two, data types (cl_float and cl_half) and GPU architectural details. The sample project shows one way of using genFFT to generate and enqueue FFT kernels in your application.

The implementation has already been discussed in detail in previous articles. You can find them at the following links:

Parallel Universe Magazine – issue #25, pg. 53
A Runtime-Generated Fast Fourier Transform for Intel® Processor Graphics

International Workshop on OpenCL – IWOCL 2016
OpenCL™ Fast Fourier transform optimizations for Intel® Processor Graphics

IWOCL '16 Proceedings of the 4th International Workshop on OpenCL, Article No. 12, ACM
OpenCL™ FFT Optimizations for Intel® Processor Graphics

Dependencies

The sample uses Intel® MKL for functional validation. Fortunately Intel® tools are freely available through a variety of options. In order to find the option that best matches you go to Intel® Developer Zone Free Software Tools.

The sample project uses single precision by default but genFFT also supports half precision. There are several options for converting between single precision and half precision data. One is to use Intel® C++ Compiler Intrinsics for Converting Half Floats. Another is to use DirectXMath Library Conversion Functions.

The code has been tested with Microsoft Visual Studio 2013 and 5th and 6th Generation Intel® Processor Graphics.

Running the sample

By default the sample will perform 8-point FFT on a 1024x768 cl_flloat2 buffer. The data is read and written in column order for maximum memory bandwidth. The following command line arguments can be used in order to try other FFT configurations:

-cols cSignal widthDefault 1024.
-rows rSignal height.Default 768. Must be a multiple of FFT length.
-fft  lFFT length.Default 8.
-base bFFT base length.Default 32.
-simd sLogical SIMD width.Default 8.
-lx   llocalSize[0].Default 256. Must divide the signal width.
-hDisplay this help. 

There are only a few constraints to consider:

  1. The number of rows has to be a multiple of the fftSize.
  2. The number of columns has to be a multiple of the localSize[0].
  3. The logical SIMD width can be 8, 16 or 32.
  4. The base FFT can take any value power of two. For best performance this value should be less than or equal to the number of complex elements that can fit in the GPU registers alongside the other program variables.

The sample configures genFFT based on the above arguments, initializes the buffer with a random signal between -1.0 and 1.0, computes the reference spectrum on the CPU using MKL (single precision) and computes the spectrum on the GPU using genFFT (single precision).

Buffer size:786432
FFT length:8
Execution time per FFT:0.0328994us
Max abs error:9.53674e-007

When finished the sample reports the buffer size as the number of complex elements, the FFT length, the execution time per FFT and the error calculated as the maximum of the absolute difference between the spectrum computed on the CPU using MKL and the spectrum computed on the GPU using genFFT.

About the Authors

Dan Petre is a Graphics Software Engineer with expertise in GPGPU programming and optimization for computer vision (OpenCV), deep learning (FFT based convolution for CNN) and digital signal processing (FFT).

Download the code

AttachmentSize
Package icon genfftSample.zip26.31 KB
For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Top

Why is the provided code not working for number of transforms = cols and stride = cols * input_distance ?

I'm always getting a segmentation fault within the DftiCompute.

 

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.