1D mkl FFT Multithread Use

1D mkl FFT Multithread Use

Hi, March 28, 2011

I want to use the 1D mkl (w_mkl_10.3.2.154 w_ccompxe_2011.2.154) FFT in a multi-threaded application. I noticed that the FFT does not run as multithread.

e.g. I am running timing tests with 2^20 FFT and i found that 2^20 takes about 28 milliseconds for a forward or backward FFT.

I get this timing value for 1 CPU or for 8 CPU.

Does anyone have experience with 1D FFTs and can they share their FFT code with me; perhaps I am not calling the primitives correctly.

e.g. my calling is described below, wheren = 2^20, and Exy is the complex doubleprecision array.

type(DFTI_descriptor), pointer :: desc_handle

integer :: status

complex*16 Exy(N_Bitpnt),Exy2(N_Bitpnt)

status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
status = DftiComputeForward(Desc_Handle,Exy)

Thanks,

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

Did you link your test with intel threading layer together with OpenMP library?
Please check your linking line with http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

In case of out-of-place double precision 2^20 I can see on my machine
the following performance using MKL 10.3.3 (intel64):

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: 1048576, setup: 13.15 ms, time: 21.97 ms, ``gflops'': 4.7736

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: 1048576, setup: 46.12 ms, time: 11.44 ms, ``gflops'': 9.1675

For in-place double precision 2^20 I can see the following performance:

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: i1048576, setup: 10.60 ms, time: 20.71 ms, ``gflops'': 5.0619

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: i1048576, setup: 347.00 us, time: 6.64 ms, ``gflops'': 15.788

Thanks,
-- Victor

Hi Victor,

The timing that i have cited excludes set-up times. So I guess your PC is a bit faster than mine.

I did link with the libraries as suggested. My compile & link line is shown below;

(is mkl_dfti.f90 for multi-thread use?).

ifort -c modules.f mkl_dfti.f90
ifort -extend_source -nowarn -align -Qzero -QxSSE2 -Qsave -Qopenmp -MT -Qmkl -c *.f
ifort -MT *.obj mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib /Qopenmp

Would it be possible to show me how you linked your code?

My code is written in FOTRAN.

I wonder if there is an issue in the way i am calling the MKL FFTs.

Is there any specific way of setting up the mkl calls?

I call and time the forward and backward FFT with;

etime_in1 = etime(rtm)

call zzfft(-1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

call zzfft(1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

etime_out1 = etime(rtm)

e.g. Isetup the 1D FFT with the following

if(ndir.eq.0)then
status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
end if

c the forward FFT us calledwith Exy is complex*16

if(ndir.eq.-1)then
status = DftiComputeForward(Desc_Handle,Exy)
end if

if(ndir.eq.1)then
status = DftiComputeBackward(Desc_Handle,Exy)
end if

Thanks

Hi Victor,

PLease note that i am running ia32 on a Windows XP x64 OS.

Hi Victor,

I think I found the problem. I wasn't initializing the FFT while setting MKL_NUM_THREADS to the # of CPU; i.e. i was always setting MKL_NUM_THREADS=1 for the initialization step.

Even thoughI was settingMKL_NUM_THREADS > 1 for the actual FFT forward or backward operation.

However, the FFT speed up is only 2x for 2^18 FFT and only 33% for 2^20 while you are showing a 300% improvement for 2^20.

Do you know why this might happen? Is it CPU or cache dependent?

I am using ia32 machine with 2 Qaud 5590 3.3 GHz CPUs. The L2 cache in my machine is 12 MB.

Thanks.

Leave a Comment

Please sign in to add a comment. Not a member? Join today