I have an application with a 3 dim nested loop where the innermost loop is performing: output[j][k] += in[i][k] * scalar[i][j][k], where i≈1000, j≈200, k≈4000 and the data type is a complex float (8 bytes). I'm running in Linux on an Intel CPU with 40 cores (20 w/ ht). The application can be configured to distribute the work in the i dimension across multiple threads.
Without using IPP the total calculation completes in: ~11.6sec for 1thread and ~0.48sec for 40 threads for a speedup of ~24x. This shows an acceptable speedup but the total time is too long.
With IPP, using ippsAddProduct_32fc() with length of k for each call, the total calculation completes in ~1.16sec for 1 thread and ~0.27sec for 40 threads for a speedup of only ~4x. This total time is much lower, but the speedup is nowhere near what I'd like.
Can anyone please explain the lack of speedup and/or suggest ways to improve the speedup when using multiple threads with IPP.