1D FFT perf

1D FFT perf

Hi,

I am trying to replicate the 1D FFT perf numbers listed here (http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm). For example, what seems like 17+ Gflops for n=1024. On my 2.8 GHz Harpertown I am only able to get about 7 Gflops for a single call to DftiComputeForward (single-precision, complex, in-place). If I average the time for 1000 repeated calls, then I get ~14 Gflops (I assume that's because the entire data set fits easily in L2).

So, does anyone know if those numbers are averages over multiple plan executions or maybe for batched 1D FFTs? I tried batched code with 16384 transforms and I'm still getting only about 7 Gflops. I've tried enabling 1 and 4 MKL from command line (Linux), but didn't see a difference in results.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello,

The numbers show an average over a sequence of computations of the same transform on single thread (you might havenoticed that changing the number of threads doesn't affect the performance). So, for size 1024 the data set is contained in L2. Changing the number of threads should be visible on batched code (that is, when DFTI_NUMBER_OF_TRANSFORMS>1). Could you tell how do you link MKL into your application and how do you set the number of threads?

Thanks,
Dima

Thanks for the reply. I'm using MKL under linux, so I link with -lmkl flag to icc. My full command was"icc -Kc++ -no-gcc -O3 -xT -lmkl -openmp fft.cpp" The other options I grabbed from a page describing fftw's benchfft experiments.

I tried setting the number of threads from command line in two ways, since I found different suggestions in documentation and googling. So, I tried both "set MKL_NUM_THREADS=4" and "set OMP_NUM_THREADS=4". In both cases I did not change the DFTI_NUMBER_OF_USER_THREADS from its default, so I assume it was set to 1.

Curiously, I wasn't able to see a benefit from multithreading even when working on batches. For example, batching 16K transforms, each of 1024 complex elements (both input and output distances were set to 1024) gave me the same results (~7.6 GFlops) with 1 or 4 threads, enabled as desribed above. Perhaps I'm not using the correct procedure to enable multi-threading? All of this is on a single socket E5462 Harpertown.

I may be wrong, but it looks like you are using csh, in which 'set VAR=val' does set shell's internal variable, not the environment variable. Could you try 'setenv MKL_NUM_THREADS 4' ?

Thanks
Dima

Hi Dima,

I experimented a bit more and now I can get scaling with multiple threads when batching. However, I can get the scaling only if all batches fit in L2 and I repeat the experiment many times (presumably to amortize for L2 loading cost). For example, 1024 batches of 1024-element transforms, 100 reps I get 23, 47, and 55 Gflops for 2, 4, and 8 threads, respectively. The entire problem is 8MB, so fits well within 12MB L2. However, if I increase the number of batches to 2048, requiring 16MB, perf drops to below 10 Gflops.

So, it seems that I'm getting FSB limited, does that make sense for these sizes?

Hmm. Number 55 gflop/s for 8 threads is not what I would call a nice scaling. I'd rather expect something in the range of 94 gflop/s, which is 47*2. That you see much less performance on 8 threads may indicate that your application shared processor cores with some other application, not necessarily that the bus limits it. Other applications may also cause your application's threadsmigrate between the cores, which would degrade the performance. So I'd also suggest taking care aboutaffinity ofyour threads. For instance,proper setting ofKMP_AFFINITY environment variable might help.

Thanks,
Dima

I should have mentioned that this is a single socket system, so I any improvement from 4 to 8 threads is just a nice bonus. I think the reason it's going above the threoretically possible 44.8 Gflops (4 cores x 2.8 GHz x 4-wide instructions, if I'm counting this correctly) is because I'm using 5 n log2 n to count flops, which isn't accurate (depends on radix etc.).

What I would like to see, though, is scaling with threads when the data set is not in L2 (either when it's not yet loaded into L2, or when all batched transforms exceed the L2 size). I would expect some scaling, since it seems there's enough FSB bandwidth for that. For example, 2048 batched 1024-element transforms doesn't scale with threads, and I'm wondering if there's any way I can coax it into scaling.

Leave a Comment

Please sign in to add a comment. Not a member? Join today