IPP with multithreaded applications

IPP with multithreaded applications

Dear all,

we use IPP (5.3.4) within a data acquistion apllication. Besides some other threads it consist of two 'main' threads:
- a data acquisiton thread in which images are convertedto object features
- a GUI thread in which post processing takes place (e.g. file writing, image display)

Both threads make use of IPP, but also both threads donot use the CPU for 100%. It seems that IPP use local parallellism in most functions. This indeed makes the call faster (twice) on dual / multicore machines. However it also gives an extra 20-30% processor load (on a DELL T3400 dualcore machine), compared to disabling the threading in IPP (thru 'ippSetNumThreads(1);'). Spying with Windows performance monitor one can notice that the thread context switches increase from an average of 1000 per second to 200000 per second.

This effect seems tolimit the performance of IPP (severly). Is there a recommended strategy to minimize this effect? Can this be circumvented completely? In a test program we made 3 tests (see below):
- single threaded
- IPP used from 2 threads
- IPP used from 1 thread, with a second thread flooding the CPU completely but not using IPP

In the last 2 options one can notice that IPP is actually slower without the 'thru 'ippSetNumThreads(1);' call.

thx in advance.

P.s. 1:we checked that IPP 5.3.4 is correctly loaded, e.g. it is using 'ippip8-5.3.dll' on my dual core machine.
P.s. 2: code:

#include 
#include 
#include 

#pragma comment(lib, "ipps.lib")
#pragma comment(lib, "ippcore.lib")
#pragma comment(lib, "ippi53.lib")


int main()
{
   //performance will be half on dual core
   //ippSetNumThreads(1);

   //Ippi:   7.733000              single threaded     
   //Ippi:  14.389000              single threaded, with ippSetNumThreads(1)
   //Ippi:   8.640000 + 8.812000   multi threaded
   //Ippi:   7.296000 + 7.312000   multi threaded, with ippSetNumThreads(1)
   //Ippi:  16.450000              flood threaded
   //Ippi:  14.482000              flood threaded, with ippSetNumThreads(1)
   
   enum IppiThread
   {
       eItSingle,
       eItMulti,
       eItMultiFlood,
   };

   //const size_t nMax = 1000000;
   const size_t nMax = 5000000;
   
   //const IppiThread eIt = eItSingle;
   const IppiThread eIt = eItMulti;
   //const IppiThread eIt = eItMultiFlood;

   switch (eIt)
   {
   case eItSingle:
      {
         TestIntlIppiImpl(nMax);
      }
      break;

   case eItMulti:
      {
         boost::thread_group threads;
         for (int i = 0; i != 2; ++i)
         {
            threads.create_thread(boost::bind(&TestIntlIppiImpl, nMax / 2));
         }
         
         threads.join_all();
      }
      break;

   case eItMultiFlood:
      {   
         long lContinue = 1; 

         boost::thread thread1(&TestIntlIppiImplFlood, &lContinue);
         boost::thread thread2(&TestIntlIppiImpl, nMax);
        
         thread2.join();

         BOOST_INTERLOCKED_EXCHANGE(&lContinue, 0);

         thread1.join();
      }
      break;

   default:
       _ASSERT(false);
       break;
   }

   return 0;
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImpl
//----------------------------------------------------------------------------
// Description  : test ippi impl.
//----------------------------------------------------------------------------
void TestIntlIppiImpl(size_t nMax)
{
    const int nWidth            = 320;
    const int nHeight           = 200;
    int       nStepSizeSource   = 0;
    int       nStepSizeTarget   = 0;
    int       nStepSizeSubtract = 0;

    IppiSize roiSize = {nWidth, nHeight};
    
    nmbyte* pImageBufferSource   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSource);
    nmbyte* pImageBufferTarget   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeTarget);
    nmbyte* pImageBufferSubtract = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSubtract);
    
    ippiImageJaehne_8u_C1R(pImageBufferSource,   nStepSizeSource,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferTarget,   nStepSizeTarget,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferSubtract, nStepSizeSubtract, roiSize); 

    for (size_t n = 0; n != nMax; ++n)
    {
        ippiSub_8u_C1RSfs(pImageBufferSubtract, nStepSizeSubtract, pImageBufferSource, nStepSizeSource, pImageBufferTarget, nStepSizeTarget, roiSize, 1);
    }
    
    ippiFree(pImageBufferSubtract);
    ippiFree(pImageBufferTarget);
    ippiFree(pImageBufferSource);
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImplFlood
//----------------------------------------------------------------------------
// Description  : flood cpu
//----------------------------------------------------------------------------
void TestIntlIppiImplFlood(long* pContinue)
{
    //do not use synchronisation like condition variables,
    //because they relinquish the processor

    for (;;)
    {
        if (!(*pContinue))
        {
            break;
        }
    }
}
38 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
vladimir-dudnik (Intel)'s picture

There is no magic. If your system have two cores then it is able to do only two threads simulteneously. If number of active threads (which do load CPU)in your applications more than number of physically available cores then some threads will wait for their time slice. And this will lower overall application performance. In this case we do recommend to use single threaded IPP libraries or disable threading in multithreaded IPP libraries.

On systems with bigger number of core there is an opportunity to balance, for example of 4 or 8 cores system, you may enable for example 2 threads for IPP and use rest threads for your application needs.

Just need to avoid thread oversubscription situations.

Regards,
Vladimir

Hello,

It looks like IPP is using a quite simple threading algorithm, basically dividing the workload onto a number of threads. I may be wrong, but evidence above point to this.

There are many techniques today that divides workloads in such a way that it doesn't matter if one or more cores are already saturated. Also, they handle the problem of non equal work loads for every "work unit" (e.g. Work-stealing).

Does Intel have any plan to leverage any of these more modern algorithms or will you stick to the simplistic forking used now for the near future?

IMHO this is key to good performance. I have myself turned of threading in IPP since it is too easy to disrupt unless you lock out everythig exept the IPP workloads.

Here's a good read for one of these algorithms (even though this one is for Java it is very informative): http://gee.cs.oswego.edu/dl/papers/fj.pdf

Cheers,
Mikael Grev

pvonkaenel's picture

Quoting - mikaelgrev

Hello,

It looks like IPP is using a quite simple threading algorithm, basically dividing the workload onto a number of threads. I may be wrong, but evidence above point to this.

There are many techniques today that divides workloads in such a way that it doesn't matter if one or more cores are already saturated. Also, they handle the problem of non equal work loads for every "work unit" (e.g. Work-stealing).

Does Intel have any plan to leverage any of these more modern algorithms or will you stick to the simplistic forking used now for the near future?

IMHO this is key to good performance. I have myself turned of threading in IPP since it is too easy to disrupt unless you lock out everythig exept the IPP workloads.

Here's a good read for one of these algorithms (even though this one is for Java it is very informative): http://gee.cs.oswego.edu/dl/papers/fj.pdf

Cheers,
Mikael Grev

Internally, IPP is using OpenMP. I would recommend turning off the internal threading in favor of devising your own threading - I use TBB to thread my IPP code, but manually specifying OpenMP threading works well also. I've found that this approach, while requiring more coding, leaves me in total control over the parallel aspects of my application.

Peter

pvonkaenel's picture

Quoting - pvonkaenel

Internally, IPP is using OpenMP. I would recommend turning off the internal threading in favor of devising your own threading - I use TBB to thread my IPP code, but manually specifying OpenMP threading works well also. I've found that this approach, while requiring more coding, leaves me in total control over the parallel aspects of my application.

Peter

I forgot to mention that only about 20%-30% of IPP is threaded instead of most of it as you mentioned. The library includes a ThreadedFunctionList.txtwhich outlines exactly which function are threaded.

Quoting - Vladimir Dudnik (Intel)
There is no magic. If your system have two cores then it is able to do only two threads simulteneously. If number of active threads (which do load CPU)in your applications more than number of physically available cores then some threads will wait for their time slice. And this will lower overall application performance. In this case we do recommend to use single threaded IPP libraries or disable threading in multithreaded IPP libraries.

On systems with bigger number of core there is an opportunity to balance, for example of 4 or 8 cores system, you may enable for example 2 threads for IPP and use rest threads for your application needs.

Just need to avoid thread oversubscription situations.

Regards,
Vladimir

Thx.

The problem is ofc in the word 'oversubsciption': sometimes the other thread is busy and then the cores get oversubscribed, but sometimes the thread is waiting. Disabling then the use of more threads in IPP would give a performance penalty.

Well, OpenMP is afaik more an easy way to fork and barrier threads than an efficient high level algorithm to divide tasks. How to divide tasks properly is quite complex. Please read the paper i linked to for more info.

IMO, the simple divide algo that is used by IPP (again, as I understand it from reading other threads) is only appropriate for very simple tasks where IPP gets all the attention. It is not suitable in a larger system where threads are used extensively and unevenly, like on a desktop system where the user can run many applications, which have threads you can't control. With an algorithm like Fork/join or Cilk IPP threading would be usable for a lot more use cases.

Cheers,
Mikael Grev

pvonkaenel's picture

Quoting - mikaelgrev
Well, OpenMP is afaik more an easy way to fork and barrier threads than an efficient high level algorithm to divide tasks. How to divide tasks properly is quite complex. Please read the paper i linked to for more info.

IMO, the simple divide algo that is used by IPP (again, as I understand it from reading other threads) is only appropriate for very simple tasks where IPP gets all the attention. It is not suitable in a larger system where threads are used extensively and unevenly, like on a desktop system where the user can run many applications, which have threads you can't control. With an algorithm like Fork/join or Cilk IPP threading would be usable for a lot more use cases.

Cheers,
Mikael Grev

This is why I mentioned TBB - from briefly skimming the paper you reference, it sounds like several of the Java fork/join framework facilities are implemented in TBB. This is what I use for threading my IPP based application, and I've gotten fairly good results with it. As for applications that have non-TBB threads, you do need to manage oversubscription prevention yourself, and I have no idea what to do when other applications in their own address space are also using system resources.

Let me know if you have suggestions.

Peter

Peter,

Yes, TBB seems like a similar thing.

Though I am wondering if the IPP libraries will use TBB themselves to get better performance under different circumstances. If I use TBB I guess I have to manage the threading myself. I could do that, but then it would be better if IPP was TBB-ized.

Cheers,
Mikael

Btw, I found some online video if anyone is interested. It's a good talk by Brian Goetz: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel

pvonkaenel's picture

Quoting - mikaelgrev
Peter,

Yes, TBB seems like a similar thing.

Though I am wondering if the IPP libraries will use TBB themselves to get better performance under different circumstances. If I use TBB I guess I have to manage the threading myself. I could do that, but then it would be better if IPP was TBB-ized.

Cheers,
Mikael

Btw, I found some online video if anyone is interested. It's a good talk by Brian Goetz: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel

I agree that TBB IPP would be nice, but I doubt it would happen since OpenMP is a compiler technology built into the Intel compiler, while TBB is a 3rd party library (Intel is the 3rd party, but still a 3rd party). I guess it doesn't hurt to ask though. In theory, I guess they could just add a second threading layer that uses TBB instead of OpenMP and allow the users to select the threading model that fits the rest of their system. I would be very interested in that, and it should not interfere with others who are already locked into the OpenMP threading layer.

Peter

Then we concur Peter!

Vladimir, what do you say, any chance?

Cheers,
Mikael

vladimir-dudnik (Intel)'s picture

Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Rob Ottenhoff's picture

Quoting - Vladimir Dudnik (Intel)

Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Hi Vladimir,
Well, that is an easy way out! First you advertise the parallelism of IPP, but when you use it in a real application like above you say: 'Ah that's not smart, just turn it off and do it yourself!' I think the suggestion of Peter and Michael to enable the use of TBB ( which is an Intel product, so why not? ) makes sense, that way your customers have more choices than just on or off.

Regards,
Rob (btw I am a colleague of gast128)

Quoting - Vladimir Dudnik (Intel)


Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Vladimir,

The Fork/Join algorithm is same as the generic and very old and simple Divide and Concur algorithm (and related to Map/Reduce). It is not in any way advanced and it have an extremely small overhead. Yet it is, at least in the Fork/Join implementation, self balansing in a couple of ways. It is almost as simple as the OpenMP way of just dividing the workload between threads. The specific smartness of Doug Lea's Fork/Join is that it is almost without thread locking and barriers which increses performance a lot. Btw, that framwork will be included in Java 7.

If you haven't already I really think you should look at the Goets screencast linked above. It is very informative and not that advanced. Further discussion on the subject almost demands this knowledge.

IPP threading is simply not usable in a typical desktop scenario where you aren't i control over all the cores. It is however very optimized towards benchmarks. When I read the threads on this forum I also get the feeling that people don't use IPP OpenMP and instead use their own threading algos.

I suggest that you add an internal benchmarking set where say 30% of the cores are busy when you do the benchmark. That way you can see the benefit that Fork/Join would give.

Even simple operations like memcpy, memset and ippiAdd_8u_C1R benefit from Fork/Join compared to just dividing the array into equal parts.

I would sugest that you at least do a study on how much you would gain.

Performance is realy tricky, especially so when you mix in parallelism. But trust me, Fork/Join is the way to go even for simple tasks, at least when you need to play nice with other processes (which you always have to outside the lab :)

Also, as a testiment that Fork/Join is not heavyweight the code is only some 800 lines in Java.

Cheers,
Mikael Grev
MiG InfoCom AB

vladimir-dudnik (Intel)'s picture

..........

vladimir-dudnik (Intel)'s picture

Hi Rob and Mikael,

yes, IPP use simple threading, which was proven to work for some applications. And yes, you may want to use another threading model in your application (fortunately it is possible).

I did read an article you pointed out and generally would agree with what it state. It is just not directly applicable to IPP beacuse still there is no single 'ideal' threading approach which perfectly work for everyone needs. BTW, Intel TBB use similar job stealing approach I believe. We also provide Deferred Mode Image Processing (DMIP)layer as a part of IPP samples package. The DMIP is a pipelining and threading layer built on top of IPP to combine advantages from IPP kernels performance, threading on above IPP level, chaining a computational task into sequence of calls to IPP and subdividinga task into smaller pieces in such a way that data processed are kept hot in cpu L2 cache (especially when they are reused between IPP calls).

I'm not argue that current multi core and future many core arhitectures create a demand for parallel frameworks or even languages which would simplify programming of such complex systems.

Regards,
Vladimir

pvonkaenel's picture

Hi Vladimir,

If I understand the IPP architecture correctly, threading is a layer on top of the processing, correct? How difficult would it be to provide alternate threading layer DLLs? Currently there are _t.dll files. Could TBB threading layer DLLs be introduced as _tbb.dll in addition to the existing _t.dll files so that users can choose? I have already discovered several advantages of TBB over OMP, and have moved away from the _t.dll layer because of it - don't want two threading systems to oversubscribe the system. I would like to suggest the same thing for the UMC audio-video-codecs samples.

Thanks for your consideration,
Peter

vladimir-dudnik (Intel)'s picture

Hi Peter,

the problem with TBB will be the same as for OpenMP. Not every application use TBB threading API. It is just not possible to provide threaded IPP for every threading APIs. Instead we do recommend to use IPP threading when it make sense (I've heard a lot positive feedback on it) and use not threaded IPP libraries when you want tohave fullcontrol on threading in your application (andin this caseyou can choose whatever API you like).

Regards,
Vladimir

The core or the argumenation is:
With OpenMP in IPP the processing will take longer than the version without threading if not all cores are free during the call. Fork/Join (and possibly TBB) will never take longer than a single threaded version if at least one core is free. That is the difference. Thus, Fork/Join is much more compatible with the surrounding environment and you are much less likely to have to turn it off.

Btw, I have just nu confirmed this with a simple test case where I saturated one of two cores and ran the Jpeg codec with and without OpenMP. (I also tested without the saturation and then I see both cores fill to 90% and times are improved with about 70%, so OpenMP is working).

Cheers,
Mikael

vladimir-dudnik (Intel)'s picture

Right, OpenMP is working. And I think the more interested case for consideration is about 4 or 8 (or 24) cores system where you direct IPP to use for example 2 threads onlyand then do whatever you want on application level (also keeping in mind oversubscription issue). Although it brings need for thread affinity functionality, which is not currently available in IPP but is considered to be added in future releases.

Vladimir

Quoting - Vladimir Dudnik (Intel)
Right, OpenMP is working. And I think the more interested case for consideration is about 4 or 8 (or 24) cores system where you direct IPP to use for example 2 threads onlyand then do whatever you want on application level (also keeping in mind oversubscription issue). Although it brings need for thread affinity functionality, which is not currently available in IPP but is considered to be added in future releases.

Vladimir

Vladimir,

I think we are locked into positions. You misread my last post and I don't think I can explain the problem well enough. Threading and the problems around it is a hard topic. Sorry. I will instead post something in the premium support. Or try to get a hold of someone who is a parallel expert within Intel.

Thanks for trying.

Thanks,
Mikael

pvonkaenel's picture

Quoting - Vladimir Dudnik (Intel)

Hi Peter,

the problem with TBB will be the same as for OpenMP. Not every application use TBB threading API. It is just not possible to provide threaded IPP for every threading APIs. Instead we do recommend to use IPP threading when it make sense (I've heard a lot positive feedback on it) and use not threaded IPP libraries when you want tohave fullcontrol on threading in your application (andin this caseyou can choose whatever API you like).

Regards,
Vladimir

Fair enought, but it seemed to be worth asking for :). Along these lines, could you outline how ippiYCbCr420ToCbYCr422_Interlace_8u_P3C2R() is internally OMP threaded? I have not been able to figure out the top and bottom cases, and would like to get the threaded gain back. Any chance of making the threaded code layer src available to make it easier to port certain routines to other threading architectures? Another long shot, but again worth asking for.

Thanks,
Peter

Rob Ottenhoff's picture

Hi All,

I understand that IPP cannot support all kinds of threading API's. But the whole 'raison d'tre' of IPP is its speed. If a more sophisticated threading strategy can make it faster, if only for a subset of users, that would be nice.

To see what TBB could do it wrote a little test. I parallelized the inner loop of gast128 above with TBB, and let the various scenorio's run. The results are below.

As you see boost::thread and TBB are about equivalent when the CPU is not flooded. But TBB certainly eases the pain when it is ( 19.7s to 25.5s ). The last case where boost,TBB and IPP all run in mutiple threads is dramatic so care is needed. I don't have a quad-core available at the moment, but when I have I will see what effect it has.

Conlusion: a different threading strategy, like the one of TBB, can make IPP faster when the CPU is loaded by other threads.

NT = 1 single: 25.4549
NT = 2 single: 14.2245
NT = 1 multi: 12.8907
NT = 2 multi: 14.4793
NT = 1 TBB: 12.903
NT = 2 TBB: 14.5953
NT = 1 multi flood: 25.4987
NT = 2 multi flood: 28.7594
NT = 1 TBB flood: 19.6604
NT = 2 TBB flood: 164.819

Where:
NT = number of threads of IPP.
single = single threaded.
multi = 2 boost threads.
multi flood = 1 boost thread + 1 boost thread flooding the CPU
TBB = parallel met TBB
TBB flood = 1 boost thread calling parallel TBB + 1 boost thread flooding the CPU

Regards,

Rob

vladimir-dudnik (Intel)'s picture

My feeling is that it isall about additional layer built on top of IPP which may add more benefits then just use of better threading technique inside of IPP functions.

Consider any real life task whichwill usuallyconsist from several calls to IPP (for example, sobel filter where you need calculate vertical and horizontal derivatives, take an absolute value ofthem and add them together to form output image with edges). Do you think it is better to parallelize each of these primitive operations independently (as it would be made with threaded IPP functions) or it is better to build threading on top of IPP functions where you can balance not only each core workload (by knowing for example computational complexity of each operation) but also process data in small enough chunks to keep all processed data in L2 cache?

That is what we try to implement with DMIP layer.

Regards,
Vladimir

Quoting - Vladimir Dudnik (Intel)
My feeling is that it isall about additional layer built on top of IPP which may add more benefits then just use of better threading technique inside of IPP functions.

Consider any real life task whichwill usuallyconsist from several calls to IPP (for example, sobel filter where you need calculate vertical and horizontal derivatives, take an absolute value ofthem and add them together to form output image with edges). Do you think it is better to parallelize each of these primitive operations independently (as it would be made with threaded IPP functions) or it is better to build threading on top of IPP functions where you can balance not only each core workload (by knowing for example computational complexity of each operation) but also process data in small enough chunks to keep all processed data in L2 cache?

That is what we try to implement with DMIP layer.

Regards,
Vladimir

IMO every parallelizable method should use TBB (if as good as Fork/Join) to run at full speed under all core loads. They should also be runnable in one thread (naming convention or extra argument) to facilitate custom usage of TBB by the user in the way you mention.

There's no need for simple core spanning OpenMP. Method local TBB or Fork/Join cover all advantages of OpenMP and it brings a lot more.

Cheers,
Mikael

vladimir-dudnik (Intel)'s picture

I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

Quoting - Vladimir Dudnik (Intel)
I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

Since OpenMP is a compiler directive, can't the implementation be switched to whatever you choose? The only demand is that it is suppose to take advantage of multiple processors. A better implementation should not break backwards compatibility. Then, I don't have the source code for IPP, so I can't really tell all the details.

Cheers,
Mikael

David Mackay (Intel)'s picture

Thanks for all of your inputs on this thread. We are certainly open to explore a different threading model within IPP (other than OpenMP). Understanding the usage model helps define the parameters for the selection. The usage model that began this thread describes a case where IPP functions are called concurrently from two threads; we will certainly consider this as we evaluate threading models within IPP. In the meantime, what other usage models do you want or use? What types of controls do you want over the threading within IPP? Please add your feedbacks here.

Additionally, here are some more background on a couple of threading models and then finally on IPP and Intel libraries. First, let me compare two popular threading abstractions: OpenMP and Intel Threading Building Blocks.

OpenMP originated from the HPC (high performance computing) community. It supports both Fortran and C. It is principally pragma/directive based. Coming from the HPC community, OpenMP takes a greedy approach when a process enters an OpenMP parallel region, it assumes all the system resources belong to it OpenMP will schedule the work across all the number of cores/processors on the system. This is the default behavior for Intels OpenMP runtime library.

Intel Threading Building Blocks builds on top of generic programming model. It targets C++ developers and is template based. The Threading Building Blocks library contains many common parallel algorithms as well as a number of parallel containers. Threading Building Blocks uses a Cilk-style work stealing algorithm. For the advanced users there are some highly tuned synchronization functions.

The Threading Building Blocks is entirely based on C++ object oriented programming; it has a rich set of constructs and is more flexible than OpenMP. When OpenMP model is applicable it can be added to your code with very few code changes. See http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ for a more complete comparison of Intel Threading Building Blocks and OpenMP. Both Threading Building Blocks and OpenMP are part of the Intel Compiler C++ Pro product. Threading Building Blocks is also available separately and works with other compilers as well.

Second, the Intel IPP library is built using the Intel OpenMP runtime library for threading. When two threads each call an IPP function that is threaded, each invocation of a threaded IPP function will fork off a number of threads to match the number of cores and you can end up with oversubscription (more threads than cores). If you are doing this, we recommend that you override the default greedy behavior of the OpenMP runtime library in IPP (ippSetNumThreads(1);). See http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-threading-openmp-faq/ for more details on Intel IPP and OpenMP.

Ying Song,
Consulting and support for Intel Performance Libraries
and
David Mackay, Ph.D.
Consulting and support for Performance Analysis and Threading

pvonkaenel's picture

Quoting - Vladimir Dudnik (Intel)
I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

I think that is where the Intel IPP layer approach really shines - you should be able to have multiple versions of a layer, and then let the users decide what best fits their needs. I need TBB, so do not use a threading layer at all - it's sad, but true.

Peter

Quoting - David Mackay (Intel)
Thanks for all of your inputs on this thread. We are certainly open to explore a different threading model within IPP (other than OpenMP). Understanding the usage model helps define the parameters for the selection. The usage model that began this thread describes a case where IPP functions are called concurrently from two threads; we will certainly consider this as we evaluate threading models within IPP. In the meantime, what other usage models do you want or use? What types of controls do you want over the threading within IPP? Please add your feedbacks here.

Additionally, here are some more background on a couple of threading models and then finally on IPP and Intel libraries. First, let me compare two popular threading abstractions: OpenMP and Intel Threading Building Blocks.

OpenMP originated from the HPC (high performance computing) community. It supports both Fortran and C. It is principally pragma/directive based. Coming from the HPC community, OpenMP takes a greedy approach when a process enters an OpenMP parallel region, it assumes all the system resources belong to it OpenMP will schedule the work across all the number of cores/processors on the system. This is the default behavior for Intels OpenMP runtime library.

Intel Threading Building Blocks builds on top of generic programming model. It targets C++ developers and is template based. The Threading Building Blocks library contains many common parallel algorithms as well as a number of parallel containers. Threading Building Blocks uses a Cilk-style work stealing algorithm. For the advanced users there are some highly tuned synchronization functions.

The Threading Building Blocks is entirely based on C++ object oriented programming; it has a rich set of constructs and is more flexible than OpenMP. When OpenMP model is applicable it can be added to your code with very few code changes. See http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ for a more complete comparison of Intel Threading Building Blocks and OpenMP. Both Threading Building Blocks and OpenMP are part of the Intel Compiler C++ Pro product. Threading Building Blocks is also available separately and works with other compilers as well.

Second, the Intel IPP library is built using the Intel OpenMP runtime library for threading. When two threads each call an IPP function that is threaded, each invocation of a threaded IPP function will fork off a number of threads to match the number of cores and you can end up with oversubscription (more threads than cores). If you are doing this, we recommend that you override the default greedy behavior of the OpenMP runtime library in IPP (ippSetNumThreads(1);). See http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-threading-openmp-faq/ for more details on Intel IPP and OpenMP.

Ying Song,
Consulting and support for Intel Performance Libraries
and
David Mackay, Ph.D.
Consulting and support for Performance Analysis and Threading

Thank you for your evaluation David,

I appreciate that someone with good knowledge about threading and parallel computing is looking into this.

As I know you know the current IPP OpenMP implementation is almost only good for a server with sole purpose to only do processing using IPP. In every other scenario you will occationally to always excercise one or many cores to the level when that algorithm is sub optimal. Just by saturating 50% of the cores you get half the speed of non OpenMP code which should say everything (I know you understand this).

For instance, any client computer is out of bounds for a developers that knows this since you don't exactly control the cores on a client computer. Everything from the OS to Photoshop will be running in the background, all outside the application's control since the user selects the apps he runs.

Therefore I suggest you implement TBB as a layer that can be swaped in instead of the current OpenMP one.

If TBB is as good as Fork/Join this should be trivial. If not, then Fork/Join should be implemented in C++ and used.

Parallel coputing is the future (sounds tacky, but is true) and you will have to do this going forward anyway. It's better to do it now since that will earn the respect that IPP needs to make it in a multi core enviroment.

I could prove this by benchmarking IPP with 1 to X cores saturated and compare the results to a Fork/Join equivalent and the non threaded version. The result wouldn't be pretty as I know you know.

If IPP is intended for the future, please make it easy for developers to get great performance under any load without coding with TBB themselves (which I know quite few developers can. Parallel computing is hard unless you do it frequently).

Cheers,
Mikael

vladimir-dudnik (Intel)'s picture

I have to point it out that ippSetNumThread is not justone bitswitch to turn on or off IPP internal threading. The function actually set number of theads to be launched by IPP. That basically allow you to left desired number of cores for any other backround work you have in system.

Regards,
Vladimir

Quoting - Vladimir Dudnik (Intel)
I have to point it out that ippSetNumThread is not justone bitswitch to turn on or off IPP internal threading. The function actually set number of theads to be launched by IPP. That basically allow you to left desired number of cores for any other backround work you have in system.

Regards,
Vladimir

Hello Vladimir,

I don't think you fully understand the problem. This is not about thread numbers or their allocations but about peformance in a non determinisitic system regarding core load. This includes all desktop computers that have an actual user that can do what he want with the computer. You cannot monitor the user, see what apps he runs and adjust the threads accordingly. Fork/Join and TBB is self adjusting under these circumstances where IPP OpenMP is not.

I would suggest you also view the screencast linked above. It is very clear on the difference.

Cheers,
Mikael

vladimir-dudnik (Intel)'s picture

Mikael,

I do understand that TBB's task stealing mechanism will try tokeep cores equally loaded with TBB tasks.There is potential benefitwhen all applicationsuser may launch on computer will share the same TBB scheduler.The question is what if not all applications arebased on TBB?

Regards,
Vladimir

Quoting - Vladimir Dudnik (Intel)

Mikael,

I do understand that TBB's task stealing mechanism will try tokeep cores equally loaded with TBB tasks.There is potential benefitwhen all applicationsuser may launch on computer will share the same TBB scheduler.The question is what if not all applications arebased on TBB?

Regards,
Vladimir

Vladimir,

Yes, that is a very good question indeed. Something that probably needs investigating before making a decision on this.

Then again it is hard to interpret the answer. Does a high number if TBB users mean that many feel IPP OpenMP is inadequate and they roll their own layer, or do most use their own threading model to really tune the best out of their application by managing threads manually? Similarily, does a low TBB usage mean that developers are ignorant of the TBB model (doesn't know about it or doesn't have the time to learn), doesn't like it (implementation wise) or are they satisfied with the IPP OpenMP model?

These are complicated questions indeed, but very interesting.

My take on this is that a few really smart people should do the work so that the mass can benefit the most. That is the way to do great business since it gives real incentive to buy IPP. You get the most back.

Cheers,
Mikael

Just purchased the Intel book 'Multi-Core Programming'. A paragraph is even devoted to this problem: chapter 11, 'Parallel Program Issues When Using Parallel Libraries'. If I read it correctly the authorrecommends to disable threading altogether when using threads of ur own. This might not even be a performance issue;results may be incorrect.

Ofc the most flexible solution would be (author also mentions this) that parallel libraries share the same task dividing framework. In our application we use TBB and sometimes raw Windows threads. The IPP libraries seem to use Intel's private OpenMP. So there is little chance that these libraries communicate with each other about the optimal task division (to prevent over subscription).

vladimir-dudnik (Intel)'s picture

My comment on this is thatit is developer responsibility to design software to avoid thread oversubscription. Even when one use Windows system threading and TBB it also may cause oversubscription. And from other hand, everything is under you control. You choose what techniques, libraries and tools to use and in what manner to solver your task. And so it is possible to avoid oversubscription problems like in cases mentioned above (system therading + TBB or OpenMP therading + TBB). Learn the tools, knowtheir potential and limitations and apply correctly, that all you need.

Regards,
Vladimir

Yes but ideally the libraries would solve this themselves. If there was just one task library on ur system, it could do the management. Ofc u could always create aditionaly threads and flood them outside this task library, but still it seems a problem which might be solvable (perhaps in OS? Still u don't want kernel transitions, they tend to be heavy too). In this way IPP could spawn as many parallel calculations as it would request. The shared task library would prevent the oversubscription.

vladimir-dudnik (Intel)'s picture

In theory, there is no difference between theory and practice. But, in practice, there is...

On practice, fortunately, there are many operating systems and each have many implementation of task systems. And because of thatyou can feel the difference. I do not think there is a chance for single universal and unified solutionwhichequally efficient for everything.

Of course, if library is flexible enough (like Intel IPP), it is possible to adopt to several task systems. That is what we demostrate with IPP sample applications.

Regards,
Vladimir

Login to leave a comment.