I am trying to replicate the 1D FFT perf numbers listed here (http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm). For example, what seems like 17+ Gflops for n=1024. On my 2.8 GHz Harpertown I am only able to get about 7 Gflops for a single call to DftiComputeForward (single-precision, complex, in-place). If I average the time for 1000 repeated calls, then I get ~14 Gflops (I assume that's because the entire data set fits easily in L2).
So, does anyone know if those numbers are averages over multiple plan executions or maybe for batched 1D FFTs? I tried batched code with 16384 transforms and I'm still getting only about 7 Gflops. I've tried enabling 1 and 4 MKL from command line (Linux), but didn't see a difference in results.