CPU: Intel (R) Xeon(R) X5660 2.8 GHz (Westmere 2 Hex-core with Hyper-threading)
OS: Linux CentOS 5
IPP: Version 6.0
I have a small multi-threaded C++ application that uses IPP in single-threaded mode - ippSetNumThreads(1). The application has N workers which each execute the same series of instructions - including some IPP calls (p8_ownsMul_32fc, p8_ownsAdd_32f_I, p8_ippsTone_Direct_32fc). There is a configurable sized thread pool the N workers can use to help complete the work in a multi-threaded manner. I have timed how long the N workers take to complete the work with various number of threads. What's interesting is that the best performance (shortest completion time) was with a thread pool sized to 6 - speed up of ~ 5 times. Adding more threads beyond that did not improve performance, i.e., with 12 threads I got a speed up of only 2.45 times. Using a CPU profiler, it shows the additional time is spent in the IPP calls.
I have run this application without using IPP and I have seen performance scale almost linearly as I increase the number of threads. Am I configuring IPP properly for the hardware I'm using? Is their any limitations with IPP that would prevent performance increase with more than 6 application threads each making the same IPP calls (such as any mutex/locks within IPP) ?