cluster fft function slow, why?

cluster fft function slow, why?

Hello everyone,

I wrote a simple code to test the efficiency of the cluster fft function built in MKL. It simply performs a 2D forward and then a 2D backward FFT transform. The result of my test was okay.

However, running with 32 CPUs is slower then 16 CPUs. Running with 1 CPU is always even faster.(ifort v9.0+MKL9.1.021+MPICH 1.2.7) What's the problem? Does it waste too much time distributing and gathering data? If so, why should I use the cluster fft function?

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Raemon,

I'm moving this over to the Intel MKL forum. They should be able to help you out.

Regards,
~Gergana

Hi Raemon,

First of all please look at the following article:

http://software.intel.com/en-us/articles/mkl-fft-performance-using-local-and-distributed-implementation

You may find answers to your questions there.

Otherwise please tell me:

- which interconnect is used?

- what size are the data?

- are you using pure MPI parallelization (i.e. link with sequential library or set OMP_NUM_THREADS=1)?

Anyway, you should consider upgrading, because the 2D CFFT performance has been noticeably improved since MKL 9.1.

Best regards,

-Vladimir

Hi, Vladimir:

Thanks for your reply. Here's my configuration:

Input data:

2D complex, double precision, 4096 X 4096 grid points

Link line file:

/opt/mpich/intel/bin/mpif77 -O3 -132 -o $1 $1.f mkl_cdft.f90 cdft_example_support.f90 -L/opt/intel/cmkl/9.1.021/lib/em64t/ -lmkl_cdft -lmkl_blacs -lmkl_em64t -lguide -lpthread -lm

As for interconnect and further detailed information please read SIRAYA.

Here are some of my tests:

Data size nodes:ppn elapsed time

4096X4096 4:4 17(s)

4096X4096 8:4 28(s)

4096X4096 16:4 59(s)

Furthermore, I have a two nodes (4core per nodes) cluster on my own and with the latest Intel compiler and MKL installed. And it has the similar behavior.

Well, now I am wondering perhaps there is something wrong with my code.(I am new to MKL and MPI) Or perhaps it really takes significant time distributing and gathering data. Is there any issue I did not notice? (I am not an expert. Any advice is welcome!)

Best Regards,

Raemon

Hello,

I think there are some bugs in my code. I'll fix it and then post my newest result here.

Hi,

I've rechecked my code but in vain. I think it has no problem.

Cluster FFT is still slow. Any idea?

Raemon,

as Vladimir mentioned above because of the 2D CFFT performance has been noticeably improved since MKL 9.1, we recomend you to check this problem with the latest version of MKL ( 10.2). You can get the evaluation version ( valid for 30 days) and check if the problem is still there. Please let us know the results.

--Gennady

Raemon,

This SIRAYA cluster has one major drawback (apart from the fact that it is not using the latest Intel CPUs :)) - the interconnect has only 1 Gibabit/s of bandwidth per each node. You should not expect too much speed from such networks.

Another thing that is very important to understand is that increasing the number of compute nodes while keeping the problem size the same will (after some breakeven point) naturally degrade performance due to the necessity of sending smaller chunks of data (thus suffering from the incurring latency). This is what you oberve in your experiments.

At last, the problem size (4096x4096) is too small for the Cluster FFT to show its advantage. Such "small" transforms are best done on one node using the MKL's DFTI set of functions. So, if your application is going to do FFTs on matrices which fit well into one node's local memory, then your choice is DFTI (without the DM suffix). However, if your real data is going to far exceed the local memory of one node, then Cluster FFT is the way to go. In the latter case, please increase the data sizes to something more realistic. In the SIRAYA case the CLLFT would demonstrate the best results if the total number of data points were somewhere near the number of nodes x 2^26 (note, that your application may need memory for its own purposes, so the actual number may have to be lowered).

I hope the above info will help you squeeze the most out of this cluster.

Best regards,

-Vladimir

Hi,Gennady

I have no authority to install the newest version of MKL on the cluster SIRAYA. I used to use fft "subroutines" inside MKL. Nowadays fft "functions" are really faster.

I think the problem is mainly due to the interconnect bandwidth. I am thinking some ways to do more tests. If I have more tests I'll post it here. :)

Best Regards,

Raemon

Hi, Vladimir

Here's my situation. I am solving a PDE problem and I have to perform the FFT transform about a million times.

Originally I perform these tasks on PCs with the use of DFTI functions (quad-core, threaded by MKL). My input data are all about 1024 X 1024 to 4096 X 4096. These days I am thinking if moving to the clusters will be faster. But things are more complicated than I think. I am still evaluating if moving to the cluster is a good way.

By the way, I've read the article "MKL FFT performance comparison of local and distributed-memory implementations" and I did not know about the parameter KMP_AFFINITY before. I'll turn the setting KMP_AFFINITY=scatter on and see if it gets faster on my PCs. Thanks.

Best Regards,

Raemon

Hi,

I've made some further test on some other machines.

Finally I decide to use DFTI function for my problem. I have another question: Some parts of my current program are threaded by OpenMP. However, the FFT part in my program is threaded by MKL internal threading.

I've read this and it indicates using MKL internal threading and OpenMP both will slow down the performance. Is that true in my case? Should I parallel the FFT part using OpenMP instead of MKL internal threading (Which I think it is also using OpenMP)?

By the way, I found that while the input data=8192X8192(double precision, complex), the memory usage would be 2GB.

However, I calculate by myself but finding that it only needs 1GB ( (8192*8192)*16/(1024*1024*1024)=1 )
So what's the another 1GB for? Are they prepared as a work space or something?

Regards,

Raemon

Hi Raemon,
1)yes, that's true for your case.
but I have to repeat what Vladimir mentioned above

yes, that's true for your case.but I have to repeat what Vladimir mentioned above:the problem size you trying to solve on Cluster version of FFT is too small to take the advantage... .

2) memoryconsumptionfor CFFT - yes, this is known problem and we are going to fix it the nearest releases.

we will inform you as soon as release will available.

Regards, Gennady

Thanks for your reply, Gennady

I have an one last question. The following results are obtained on some cluster which does not have interconnect card.

Data size nodes:ppn elapsed time (s)

16384 16384 1 : 1 58.29
16384 16384 1 : 8 21.01
16384 16384 4 : 8 34.34
16384 16384 8 : 8 29.35

(Data type: double precision, complex)

The cluster's configuration (Each node):

CPU: Intel X5450 Processor 3.0GHz Quad core X2
RAM: 16GBPC2-5300 667MHz FBD 240-pin ECC DDR2-SDRAM
NETWORK: 2 Gigabit Ethernet Network, without interconnect card
HD: 146GB 15k rpm SAS X2

The elapsed time is determined by the process time of performing "DftiComputeForwardDM" (I did not count the time distributing and gathering data)

My question is, what's the reason that nodes:ppn=4:8 case slower than nodes:ppn=1:8 case?

Is that due to the bandwidth of the interconnect? Why?

Or due to this kind of small calculation is not efficient while using CLFFT function?(I think it is large enough)

Best regards,

Raemon

I think I have to adjust my previous question.

What I'd like to ask is whether the function DftiCompute-
ForwardDM needs to send some data from one node to another? (If so, that will slow down the performance I think) Or is it just asking each node to perform FFT locally without communicating with other nodes?

Regards,

Raemon

Raemon,

The answer to your "previous question" is - DftiComputeForwardDM does need to send data between the MPI processes (and hence between the nodes).

Of course for small FFTs it gives worse performance, than simply broadcasting the data to each node and performing identical computations locally on each process.

Best regards,

-Vladimir

Leave a Comment

Please sign in to add a comment. Not a member? Join today