Performance problem when apply cilk_spawn with ippiFFT function

Performance problem when apply cilk_spawn with ippiFFT function

Cucumber's picture

We have a method using ippiFFT to perform forward and inverse FFT of an Image. This method calls ippiFFT 3 times. We try apply cilk_spawn with purpose improve performance of FFT processing, but processing speed of the method is slower than source without cilk_spawn (decreased about 2 times). We don't known what is the problem, please help us!

I will describe shortly here:
void performFFT(int start, int end, Ipp32fc[] image)
{
for(int i = start; i < end; i++)
{
ippiFFTInv(Image[i])....
ippiFFTInv(Image[i])....
ippiFFTFwd(Image[i])....
}
}

Source calls method performFFT without cilk_spawn:

Image[10];
performFFT(0, 10, Image);

Source calls method performFFT with cilk_spawn: (which is performance decrease 2 times)

Image[10];
cilk_spawn performFFT(0,5, Image);
performFFT(5,10,Image);
cilk_sync;

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

Are you using the OpenMP threaded IPP?  If you do that without precautions with cilk_spawn, you have 2 instances of OpenMP each using the same cores.   You would need to cut back on the number of OpenMP threads so as not to over-subscribe, and you might well see a benefit by affinitizing them to distinct groups of cores, particularly if on a HyperThreaded or multiple CPU platform..

Jim Sukha (Intel)'s picture

The following article discusses some common performance problems that Cilk Plus users sometimes run into --- perhaps one of them may apply to your program?

http://software.intel.com/en-us/articles/why-is-cilk-plus-not-speeding-up-my-program-part-1

How much work is each FFT / iFFT doing?    As a sanity check, I would suggest comparing the serialization of your program to an execution with CILK_NWORKERS=1, and making sure those times are about the same.  (See issues #4 and #5 on the list).

Part 2 has not been posted yet, but it sounds like the issue that Tim mentioned (issue #10) could be your problem?
Cheers,

Jim

Cucumber's picture

Hi TimP, Jim Sukha.

As I described above, I use cilk_spawn on the method performFFT(), this method call an Ipp function: IppiFFTFwd. After research, I found that function IppiFFTFwd is listed in text file "C:\Program Files (x86)\Intel\Parallel Studio 2011\Composer\Documentation\en_US\ipp\ThreadedFunctionsList.txt". So IppiFFTFwd is threaded function, is it "OpenMP threaded IPP" that TimP said about ?

I have a question here: If I use cilk_spawn to make 2 flows and my CPU have 4 logic cores ( 2 core, 4 threads), will it divide 2 threads for flow1 and 2 threads for flow2, example:

  cilk_spawn method1();        // 2 cores used for this
  method2();                         // and 2 cores used for this
  cilk_sync;

If it's right, how can I modify the resources for each flows ?

Thank you,

cucumber.

Tim Prince's picture

The easy first steps would be to experiment with environment variables:

set omp_num_threads=1

set omp_num_threads=2

By not over-subscribing, you give Windows a chance to assign your IPP instances to different hardware contexts.  You would need to be running Win7sp1/8 or server 2008r2 to get a scheduler which works correctly with hyperthreads. 

I don't know whether your IPP function uses hyperthreads (if it runs long enough, you could tell by watching performance meter).  If it does, when running 2 IPP of 2 threads together, you are likely to see performance variations according to whether the IPP instances are each using one hyperthread on each core or each using both threads of a single core.

set kmp_blocktime=20 

This will cut the time OpenMP hangs on to the threads, preventing Cilk(tm) Plus from resuming use of them. I'm assuming your IPP cases take a second or more; otherwise you wouldn't have the incentive to try cilk_spawn.

Jim Sukha (Intel)'s picture

It does sound like you are calling OpenMP code from within Cilk Plus.   If you would like to use cilk_spawn and mix Cilk Plus and OpenMP threading, you might want to try setting the environment variable CILK_NWORKERS=2, so that Cilk Plus only creates two worker threads.  Then, if each worker starts an instance of OpenMP, then you can avoid oversubscribing the machine.

In this case, you aren't really using the features of Cilk Plus in any interesting way, so perhaps you might also just try using an OpenMP parallel for over two iterations, and avoid mixing runtimes?

Personally, I think it would be nice to have Cilk Plus versions of the IPP functions, which would allow users to compose parallel functions as in your example without having to do as much explicit tuning.   But I am unfamiliar with IPP in general, so I don't know how feasible that would be in practice.

The other possibility might be, if you can find an explicitly serial version of the IPP function, you might try using a cilk_for loop over the 10 images, and calling the serial version to process each of them.

Cheers,

Jim

Tim Prince's picture

As Jim pointed out, you need to watch the total number of threads (apparently, CILK_NWORKERS * OMP_NUM_THREADS).

Cucumber's picture

Hi TimP and Jim Sukha,

I understand the problem now. Thanks for your support.

Cheers,

Cucumber.

Login to leave a comment.