Submitted by Dmitry Baksheev (Intel) on

## Downloads

The Intel® Math Kernel Library (Intel® MKL) provides functions to compute discrete Fourier transforms of arbitrary lengths. Real and complex types are supported in both single and double precision. In all cases, to compute Discrete Fourier Transforms (DFTs), Intel MKL uses Fast Fourier Transform (FFT) algorithms of computational complexity O(N log N), even when the transform size is a prime number.

In applications, one may select the size of Fast Fourier Transform (FFT) from a certain range. Then a natural question arises: "**Which size in the range will make FFT perform fastest?"** This article attempts to answer this question and provide a tool to support the answer.

While FFT has O(N log N) complexity, actual performance as a function of size is not smooth and may vary by a factor of magnitude for adjacent sizes. FFT is implemented by a recursive breakdown of FFT(N) into transforms of smaller size until there is a kernel for the small size that would not recurse further. If there is a kernel for size K, the transforms of size multiple of K would perform faster. The set of the kernel sizes is often referred to as supported ‘radixes'. The Intel® MKL FFT implementation employs radix 2, 3, 4, 5, 7, 8, 11, 13, 16 kernels, and several other larger size kernels.

Generally, a size that is multiple of supported radixes provides maximum performance, and 2-power sizes perform fastest. Obviously, 2-power sizes may be missing in the range in question.

Firstly, we should note that our objective in selecting the ‘best' size is **minimizing time of transform**, not maximizing the performance in Gflop/s [performance of FFT(N) is often measured in pseudo Gflop/s based on the number of floating point operations given by 5*N*log2(N)]. Therefore, the choice between a 2-power size and a smaller non-2-power size may be not so obvious.

Secondly, the question we consider is particularly important for multidimensional transforms. For such transforms, selecting 2-power sizes for each dimension may significantly degrade performance because of a peculiar property of the computer memory organization. If a working data set is located in the memory at a large 2-power stride (which is the case, for example, in a 2048x2048 transform), then the processor's cache gets drastically underutilized, and, as a result, the performance is significantly decreased. This issue can be remedied by padding the ‘bad' dimensions so as to avoid 2-power strides. Thus, layout of the data for multidimensional transforms is crucial for performance of FFT.

To describe the layout of multidimensional transforms, we follow the notation used in FFTW. In this notation, a multidimensional transform is specified by a sequence of i/o dimensions with each dimension represented by a triplet (N, is, os), where ‘is' and ‘os' denote the input/output strides along that dimension [ref]. For in-place complex-to-complex transforms, we often have ‘is==os'. The layout of the multidimensional data in one domain (e.g. input) is therefore defined by sequence of pairs (N,s) that we print following the pattern *N _{1}*:

*s*x

_{1}*N*:

_{2}*s*x.... For a quick illustration, consider a 100x200 matrix:

_{2}The following excerpt of a Fortran program describes 2-dimensional column-major layout **100:1x200:1000 **(sizes are 100 and 200, strides are 1 and 1000, respectively), where the first dimension is padded with 900 elements:

complex :: X(1000,1000) ! matrix that holds the data X(1:100, 1:200) = data

The following piece of C/C++ code describes 2-dimensional row-major layout **100:1000x200:1 **(sizes are 100 and 200, strides are 1000 and 1, respectively), where the second dimension is padded with 800 elements:

complex<double> X[1000][1000]; for (n1 = 0; n1 < 100; ++n1) for (n2 = 0; n2 < 200; ++n2) X[n1][n2] = data();

We provide an empirical tool that selects ‘the best' multidimensional layout by measuring the most promising candidate layouts. The tool does not provide FFT itself: it calls an external library (Intel® MKL in this case). The tool sorts the candidates on the basis of a heuristic that small primes generally give better performance. To rank a candidate, we use the "average prime factor" trick. For any number N, we denote an arithmetic or geometric, possibly weighed, average of its prime factors as an average prime factor (apf). The smaller this factor is for a candidate layout, the more likely FFT of size N is fast. This trick has a subtle drawback that it does not favor small sizes for the powers of the same prime, e.g. all 2-powers belong to the same equivalence class.

The tool ranks the candidates in the specified range, sorting out the candidates with unacceptably high ‘apf', then extends the list of accepted candidates with potentially better data layouts, and then measures time it takes each candidate to compute the FFT. The layout with minimum time is reported as the best layout.

Be aware that empirical measurements are generally not reproducible. If this length-advisor software runs on a loaded machine, it may misguide you into identifying a suboptimal candidate as the best one. It may happen on a dedicated machine too, because performance also depends on memory location and multiple factors other than sizes and padding.

This length-advisor supports multidimensional complex-to-complex in-place transforms. If needed, it may be extended to out-of-place transforms, r2c/c2r transforms, etc.

In order to use this length advisor, you should modify the main() function to specify the range to search in, then compile and run the program.

Below is an example output of the program advising about a 2-dimensional transform of a size near 4000-by-4000 on a certain machine. One can see that a 3840-by-3888 transform (line C) may take less time to compute than a 2-power transform 4096-by-4096 (line B). It also shows that padding the latter may increase performance by ~40% (line A).

./length-advisor.exe Looking for the best FFT size/layout in range 3800x3800 ... 4100x4100 4096:4097x4096:1 apf=2 84.785 ms 23.746 gflop/s (line A) 4096:4096x4096:1 apf=2 120.18 ms 16.752 gflop/s (line B) 3840:4096x4096:1 apf=2.18 114.34 ms 16.442 gflop/s 3840:4097x4096:1 apf=2.18 82.793 ms 22.709 gflop/s 4096:3840x3840:1 apf=2.18 108.7 ms 17.296 gflop/s 4096:3841x3840:1 apf=2.18 92.722 ms 20.277 gflop/s 4096:3888x3888:1 apf=2.24 87.093 ms 21.874 gflop/s 3888:4097x4096:1 apf=2.24 86.349 ms 22.062 gflop/s 3888:4096x4096:1 apf=2.24 118.17 ms 16.121 gflop/s 4096:4032x4032:1 apf=2.33 96.314 ms 20.557 gflop/s 4032:4097x4096:1 apf=2.33 87.718 ms 22.572 gflop/s 4032:4096x4096:1 apf=2.33 120.97 ms 16.367 gflop/s 3840:3841x3840:1 apf=2.4 88.194 ms 19.908 gflop/s 3840:3840x3840:1 apf=2.4 103.53 ms 16.959 gflop/s 4096:4000x4000:1 apf=2.45 84.416 ms 23.257 gflop/s 4000:4097x4096:1 apf=2.45 95.286 ms 20.604 gflop/s 4000:4096x4096:1 apf=2.45 119.44 ms 16.438 gflop/s 3840:3888x3888:1 apf=2.47 81.041 ms 21.952 gflop/s (line C) 3888:3840x3840:1 apf=2.47 106.12 ms 16.764 gflop/s 3888:3841x3840:1 apf=2.47 91.255 ms 19.495 gflop/s The best candidate is 3840:3888x3888:1 apf=2.47 81.041 ms 21.952 gflop/s The second best one is 3840:4097x4096:1 apf=2.18 82.793 ms 22.709 gflop/s The best layout is 3840:3888x3888:1

The result of this example suggests that transforms 3840x3888 (tightly packed) and 3840x4096 (padded with 1 element) will take least time to compute, from the set of problems in the specified range. The data for these transforms should be arranged as follows:

Fortran:

complex X(3888,3840) ! data occupies whole array complex X(4097,3840) ! data occupies subarray X(1:4096,:)

C/C++:

complex<double> *x; // Traverse the array with x(n1,n2), where 0<=n1<N1 and 0<=n2<N2 #define x(n1,n2) x[3888*n1 + n2] /* N1=3840, N2=3888 */ #define x(n1,n2) x[4097*n1 + n2] /* N1=3840, N2=4096 */

The above result was obtained by measuring 20 candidate FFT problems, where 20 is the value of the MAX_CANDIDATES parameter in the program, and none of them employed radix-11 kernels, because maximum ‘apf’ tried was 2.47. Increasing the number of candidates may produce the following result that uses radix-11 kernels:

The best candidate is 3840:3872x3872:1 apf=3.29 75.383 ms 23.497 gflop/s The second best one is 3872:3872x3872:1 apf=4.57 76.832 ms 23.258 gflop/s

What size fits your needs best?

Further work on this tool may include support of FFTW3 interface, limiting the effort of the tool by search time instead of the number of candidates tried, improving the padding variations in the maybeAddBetterPadding() function, using a higher resolution timer for small transforms, providing a command-line interface to the tool, etc.

Please feel free to use the software, and we hope you will find it useful. Your feedback would be appreciated.

## Comments (2)

TopYing H. (Intel) said on

Note: If the measurement is too complex for you, here is one simple rule to pad the leading dimension for better performance:

When computing NxM transform on the data located in N1xM1 array, align the array on the cacheline boundary and pick the leading dimension a non-2-power multiple of cachelines.

For example:

single precision complex 500x250 transform has 8 elements per cacheline (64 bytes), and therefore leading dimension for size 500 should be chosen to be 504, because div_round_up(500,8)*8 is 504, while leading dimension for size 250 should be chosen to be 264=256+8 because div_round_up(250,8)*8 is 256 which is 2-power.

or add the code is like

#define CACHELINE_SIZE 64

ld_length = good_cfft_padLen(N, sizeof(FFTdatatype));

int good_fft_padLen(int N, int size) {

return (((N+(CACHELINE_SIZE/size)-1)/(CACHELINE_SIZE/size))|1)*(CACHELINE_SIZE/size);

}

Best Regards,

Ying

Andrew M. said on

Is there code for this somewhere? Or can I find a similar utility somewhere else? Thanks!

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today