IPP FFT performance no improved with multiple threads

IPP FFT performance no improved with multiple threads

I have the problem with FFT (IPP ver 7.0), ippsFFTFwd_CToC_32fc. The FFT len 2^19. According to ThreadedFunctionsList.txt, "ippsFFTFwd_CToC_32fc" is threaded.

I run it on 12 cores machine (L5640 2x6),through Parallel Studio, Visual Studio 2010 under Windows Server 2008, 64bit.

And see that only one core is working. And I did all that wroted in doc.

For instance, Direct FIR function is very good parallelized.

Can you help me with FFT ?

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello,

This looks a problem we discussed in the forum before. Please find some comments from the function expert on the performance:

1)FFT function uses memory buffer ~equal to vector length for rather small FFT orders ( < ~19 depends on platform (cache size)) therefore for such orders there is no difference between in-place and out-of-place cases performance FFT is calculated in the buffer and then result is copied to the destination so for in-cache cases it doesnt matter where to copy to src or to dst vector. For rather large orders (>19) in-place version is faster as internally FFT uses buffer of smaller size (less than input vector length). I think that HDD case should not be discussed here

2) FFT is threaded for fit into shared L2 cases only and for Core2 CPUs only (and on 2 threads only). For small orders OMP overhead is greater than benefit, for large orders (out-of-cache) memory effects play negative role so customers investigation is right there is no any threading for order 19 and above.

Thanks,
Chao

Leave a Comment

Please sign in to add a comment. Not a member? Join today