Limited Multi-threaded performace gain

Limited Multi-threaded performace gain

CPU: Intel (R) Xeon(R) X5660 2.8 GHz (Westmere 2 Hex-core with Hyper-threading)

OS: Linux CentOS 5

IPP: Version 6.0

I have a small multi-threaded C++ application that uses IPP in single-threaded mode -  ippSetNumThreads(1).  The application has N workers which each execute the same series of instructions - including some IPP calls (p8_ownsMul_32fc, p8_ownsAdd_32f_I, p8_ippsTone_Direct_32fc).  There is a configurable sized thread pool the N workers can use to help complete the work in a multi-threaded manner.  I have timed how long the N workers take to complete the work with various number of threads.  What's interesting is that the best performance (shortest completion time) was with a thread pool sized to 6 - speed up of ~ 5 times.  Adding more threads beyond that did not improve performance, i.e., with 12 threads I got a speed up of only 2.45 times.  Using a CPU profiler, it shows the additional time is spent in the IPP calls.

I have run this application without using IPP and I have seen performance scale almost linearly as I increase the number of threads.  Am I configuring IPP properly for the hardware I'm using?  Is their any limitations with IPP that would prevent performance increase with more than 6 application threads each making the same IPP calls (such as any mutex/locks within IPP) ? 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


How are Intel IPP linked in the application? If you link with none-threaded version of Intel IPP (not by setting ippSetNumThreads(), which may still reply on the OpenMP libraries), will this problem still happen for you?


I am linking with the non-threaded static merged libraries.


Hi Patrick,

The IPP functions don't contain any multi-threading synchronization objects inside. Especially the single-thread version of IPP library.

The only constraint I could see in your case is the amount of data you work with in the worker thread, The X5660 CPU has 12M of CPU cache which is shared between hardware cores.

If the total amount of working data (source/destination/temporary arrays, local data, etc.) within the worker threads is inside these limits, the scalability of performance should be ok.  If more, the application will spend more and more time waiting for the data to come.

Another point of concern could be the dynamic memory operations in the threads. The standard allocs (including IPP ippMalloc functions) are serialized, i.e. calling of malloc/free is another point of inter-thread synchronization, which in some cases makes threads to wait for each other.



Leave a Comment

Please sign in to add a comment. Not a member? Join today