I'm trying to speed up FFT processing on a dual 2 GHz Pentium 4 Xeon. When I use FFTW, which is fast C code available on the web it takes ~1 second to take 20000 1k FFTs. It takes ~.75 seconds using MKL 5.2 or 6.0, and .5 seconds using IPP. Why isn't the Intel microcode faster? Is is simply that their code has not been optimized for Xeon processors yet? It is disappointing to at best only get a factor of 2 using parallelized microcode. Also IPP doesn't seem to support either Fortran calls or parallelization, both of which I need. I've asked Intel's Premier support about this a couple of weeks ago but haven't received an explanation. In general how many FLOPS should you get per clock cycle on a Xeon? Isn't it 4, or 8 GFLOPS per processor? I'm certainly not seeing this using their microcode.
For more complete information about compiler optimizations, see our Optimization Notice.