while reading the MKL description - I could not understaned whether the MKL FFT is based on SSE2 or not. This can effect the performance dramaticly.
Most likely, the double precision FFT functions in the p4 library will be SSE2. You could test p4 against generic architecture versions on your application.
The Intel Math Kernel Library Reference Manual, that is in the doc/mklman61.pdf file in the Intel MKL 6.1 installation, does state on page 1-4that the DFTs are optimized to take advantage of processor specific SIMD extensions. To see the speed up for your application; the suggestion to run it with the generic processor codeand compare it to your processor is a good one. Note that the DFT functions are the ones continuing to be optimized for new processors (as opposed to the FFTs).
The power-of-two FFTs do use the SIMD hardware, both for single-precision and double-precision transforms. Up until now the mixed-radix transforms (DFT) have not used the SIMD hardware for non-power-of-two transforms, leading to a significant difference in performance between, say, a 1024-point transform and 900-point or 1152-point transform. With the 7.0 beta software, there are dramatic improvements in those non-power-of-two transform lengths.