IPP FFT shows no improvement with multiple threads

IPP FFT shows no improvement with multiple threads

According to ThreadedFunctionsList.txt, "ippsFFTFwd_CToC_32fc_I" is threaded. However, a simple timing loop shows no difference in execution time whether I leave the number of IPP threads at 4 (the default value per "ippGetNumThreads") or reduce it to 1 (via "ippSetNumThreads"). I have tried FFT lengths from 2^3 to 2^20. Parallel Amplifier shows CPU Usage is 1 regardless of number of threads. What's up/what should I check?

I am running Intel IPP 6.1 dynamic libraries obtained through Parallel Studio (Composer Update 4) with Visual Studio 2008 under Windows Vista SP2 on an Intel Core2 Quad CPU Q6700 processor. I have successfully written and ran other multithreaded programs using OpenMP and the Intel compiler, utilizing all four cores. Here is a code fragment from the timing program, which runs the FFT repeatedly on the same data:

is = ippsFFTInitAlloc_C_32fc(&pSpec, powerOf2, flag, hint);
is = ippsFFTGetBufSize_C_32fc(pSpec, &bufSize);
if (bufSize) pBuffer = (Ipp8u*) ippMalloc(bufSize);
else pBuffer = NULL;

is = ippSetNumThreads(1);  // or 4, the default per ippGetNumThreads()

startMsec = timeGetTime();
for (long iter = 0; iter < numIter; iter++) {
    is = ippsFFTFwd_CToC_32fc_I( (Ipp32fc*)x, (IppsFFTSpec_C_32fc*) pSpec, (Ipp8u*) pBuffer );
}
finishMsec = timeGetTime();
deltaTime = 0.001 * (finishMsec - startMsec);
perTime_mt = (deltaTime/numIter)*1e6;
cout << perTime_mt << endl;

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Paul Fischer (Intel)'s picture

That function is, indeed, marked as a threaded function. What sort of time measurements are you getting for completion of the loop and completion of each iteration of the loop? (in other words, total loop time divided by number of iterations). Are the timings identical when you change ippSetNumThreads() from 1 to 4, or are they just similar?

It's possible that the overhead associated with the loop and the OpenMP setup/teardown is overwhelming the amount of time spent doing work in the function, in which case you may not see a significant difference in time to execute (possibly even longer time to execute with 4 threads). Likewise, it might explain why you don't see the CPU usage be something other than 1, since I believe that number is a sampled result.

Thanks for your reply. I discovered my investigation was incomplete. With further study using the Parallel Amplifier, I discovered that the IPP FFT I am calling does use multiple threads, but only for lengths between 2^13 and 2^17, inclusive. Further, it only uses two threads for those lengths, and the improvement over single-threaded performance is marginal or inconsistent. Here is a summary of my per-FFT timing results for even powers-of-2 from 12 to 20, running each length 10 times:

Length  *** MaxThreads=4 ***    *** MaxThreads=1 ***    Threads Created
        min usec    max usec    min usec    max usec    For MaxThreads=4
------  --------    --------    --------    --------    ----------------
2^12       21.80       23.40       21.80       23.40        1
2^14       93.50      156.00      124.50      125.00        2
2^16      374.00      468.00      499.00      515.00        2
2^18     2800.00     3120.00     2800.00     3120.00        1
2^20    19960.00    20280.00    19960.00    20600.00        1


So, the IPP FFT is indeed multithreaded, but yields limited impact.

Paul Fischer (Intel)'s picture

Interesting study. Thank you. I will request further clarification from engineering.

Paul Fischer (Intel)'s picture
Best Reply

Regarding multi-threading and the FFT functions, from the IPP user's manual (excerpt below). This link will also show you the same information:

http://software.intel.com/sites/products/documentation/hpc/ipp/ia32/index.htm

Go to the "Supporting Multithreaded Applications" chapter in the manual.

Also, please review this article from the knowledge base, which I have updated with additional information from engineering regarding your FFT observations:

OpenMP and the Intel IPP Library

Paul

- - - - Supporting Multithreaded Applications - - - -

Intel IPP Threading and OpenMP* Support

All Intel IPP functions are thread-safe in both dynamic and static libraries and can be used in the multithreaded applications.

Some Intel IPP functions contain OpenMP* code that gives significant performance gain on multi-processor and multi-core systems. These functions include color conversion, filtering, convolution, cryptography, cross correlation, matrix computation, square distance, and bit reduction, etc.

Refer to the ThreadedFunctionsList.txt document to see the list of all threaded functions in the doc directory of the Intel IPP installation.

Setting Number of Threads

The default number of threads for Intel IPP threaded libraries is equal to the number of processors in the system and does not depend on the value of the OMP_NUM_THREADS environment variable.

To set another number of threads used by Intel IPP internally, call the function ippSetNumThreads(n)at the very beginning of an application. Here n is desired number of threads (1,...). If internal parallelization is not desired, call ippSetNumThreads(1).

Using Shared L2 Cache

Some functions in signal processing domain are threaded on 2 threads intended for the Intel Core2 processor family, and exploit advantage of merged L2 cache. These functions (single and double precision FFT, Div, Sqrt, etc.) achieve the maximum performance if both two threads are executed on the same die. In this case these threads work on the same shared L2 cache. For processors with two cores on the die this condition is satisfied automatically. For processors with more than two cores, a special OpenMP environmental variable must be set:

KMP_AFFINITY=compact

Otherwise the performance may degrade significantly.

Nested Parallelization

If the multithreaded application created with the OpenMP uses the threaded Intel IPP function, this function will operate in the single thread because the nested parallelization is disabled by default in the OpenMP.

If the multithreaded application created with other tools uses the threaded Intel IPP function, it is recommended to disable multithreading in Intel IPP to avoid nested parallelization and possible performance degradation.

Disabling Multithreading

To disable multi-threading link your application with IPP non-threaded static libraries, or build the custom SO using the non-threaded static libraries.

Hello , Paul.
I have the same problem with FFT (IPP ver 7.0). The FFT len 2^19.
I run it on 12 cores machine and see that only one core is working. And I did all that you wrote.
For instance, Direct FIR function is very good parallelized.

Can you help me with FFT ?

Arkady

Login to leave a comment.