I wrote a simple code to test the efficiency of the cluster fft function built in MKL. It simply performs a 2D forward and then a 2D backward FFT transform. The result of my test was okay.
However, running with 32 CPUs is slower then 16 CPUs. Running with 1 CPU is always even faster.(ifort v9.0+MKL9.1.021+MPICH 1.2.7) What's the problem? Does it waste too much time distributing and gathering data? If so, why should I use the cluster fft function?