High CPU usage and Intel® IPP threaded function

There are about ~15-20% of IPP functions are threaded internally by OpenMP. A list of the threaded primitives in the IPP library is provided in the ThreadedFunctionsList.txt file located in the library's doc directory. The quickest way to multi-thread an Intel IPP application is to use the built-in OpenMP threading function directly, which mean when you call any of the threaded funtions on multi-core machine, your program will run in parallel automatically.  But some users noticed the cpu usage is "much" higher than expected. This article list several possible scenarios and provide solution for this. 

Problem :
Scenario 1: as the comments under the article Deprecated API list since Intel IPP v6.0
I2R D&T Team: I have noticed that when calling the function ippResize or ippResizeCenter recursively and sequentially with small delay (< 100 ms) in between, uses less CPU usage as compared to ippResizeSqrPixel. ......, I have tried using the single algorithm or parallel algorithm introduced in the documentation but it still uses a lot of CPU usage in both algorithms. I cannot figure out what's wrong with ippResizeSqrPixel function. Does anyone faced the same problem?

Scenario 2:mdl61:We have noticed a similar problem and traced it to the internal threading mechanism. We have implemented our own threading external to ippResizeSqrPixel and this conflicts with the internal threading causing an increase of 40% CPU utilization, mostly in the kernel. We avoid this problem by calling ippSetNumThreads(1) before using ippResizeSqrPixel.

Scenario 3: Upgrade from IPP 5.3 to IPP 6.0, caused my cpu usage to go from 3% to 88% on Quad-core machine. The setNumThreads(1) function resolves the problem. I'm calling ippiCFAToRGB_8u_C1C3R with a interval of 200ms in a separate thread. The pseudo code is like,

// ippSetNumThreads(1);
for (i=0; i<100; i++)
{
ippiCFAToRGB_8u_C1C3R(pRAWData, srcRoi, srcSize, img_orig.step, pRGBData1, img_debayer.step, grid, 0); 
Sleep (20000);
}

Root Cause : 

Seeing from the above scenarios, they have same features:
1. Using IPP build-in thread function on multi-threads environment (multi-core machine or own thread).
for example, ippiResize is not threaded, but ippResizeSqrPixel is threaded.
2. Using the threaded function intermittently. For example, call function with a interval of 200ms.

In general, it is true that the high cpu usage usually corresponds to better performance.
For example, calling the IPP threaded function repeatedly on 4 core machines,  
for (i=0; i<100; i++){
ippiCFAToRGB_8u_C1C3R(pRAWData, srcRoi, srcSize, img_orig.step, pRGBData1, img_debayer.step, grid, 0); 
}
when without openMP threading, the cpu usage is 25% (1 of 4 core), take 10ms,
and when OpenMP threading is on, the cpu use 100% (full 4 core). it takes 2.5ms.
So for highly optimized functions, when one thread is used, it will take 25% cpu and when all of core are used, the cpu usage go to 100% is expected.

But if intermittently call the threaded functions, i.e 30 times a second, the cpu usage arise from 3% to 88%, this causes overall performance to drop so much that it is impossible to use IPP threaded function in real application. 

The problem is actually caused by a feature of OpenMP threads execution mode.  As we know, IPP threaded function is using OpenMP threading internally (libiomp5*.lib/dll).  When IPP threaded function run in OpenMP threads, in order to get good performance, the threads or the workers (libiomp5* library=>kmp_lauch_worker) will keep active for next work until no more new work come in. Such mechanism is efficient if these threads are reserved solely for OpenMP execution. But for many application which happened to call the function intermittently and contains no-thread code (i.e. sleep()) in parallel region. The OMP threads will wait each times, after completing the work of ipp function call in a parallel region, before sleeping. Such "wait" (default is 200 milliseconds) cause the cores busy, therefore abnormally high cpu usage.  

Here is more information about OpenMP thread execution mode:
/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/optaps/common/optaps_par_libs.htm

In addition, calling and IPP threaded in external thread may cause high cpu usage because the overhead for nest parallel.  See more from OpenMP and the Intel® IPP Library

Resolution : 

1. Avoid the problems(intermittent IPP function call and nested threads) completely by calling ippSetNumThreads(1)

2. For who still hope to keep IPP build-in thread functionatlity, you may change the IPP OpenMP thread execution mode by 
setting environment variable in system, KMP_BLOCKTIME=0 
or setting it before run your application.
>set KMP_BLOCKTIME=0
>run.bat
or call OpenMP function kmp_set_blocktime(0). 

Notes*: The function kmp_set_blocktime() is from Intel OpenMP* run-time library libiomp*.lib/dll.
Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.