simple 'for' runs much faster than 'cilk_for'

simple 'for' runs much faster than 'cilk_for'

Hello,

I compiled the following code with Intel compiler 11 + Optimizer. Then In ran the code under Windows XP-32.

For some reason, the code without CILK runs much faster than the same code with CILKdefined.

Also, the no-cilk code runs with almost the same performace like a code written with intrinsic C.

The PC on chich the program runs is Code-I5 with 4 cores. Cilk can help divide the processing the many cores.

Can explain this ?

Thanks,

Zvika  

---------------------------------------------------------------------------------------------------------------------------------------

#include"stdafx.h"

#include<cmath>

#undefCILK

#ifdefCILK

      #include <cilk\cilk.h>

      #include <cilk\cilk_api.h>

#endif     

 

#include<windows.h>

#defineN_ELEMENTS 100000

#defineN_ITERATIONS 10000

#defineN_RG      300

#defineN_PRI     25600

#defineTYPEfloat

__declspec(align(32))floatfInput_I[N_RG][N_PRI];

__declspec(align(32))floatfInput_Q[N_RG][N_PRI];

__declspec(align(32))floatfFilter_I[N_RG];

__declspec(align(32))floatfFilter_Q[N_RG];

__declspec(align(32))floatfout_I[N_RG][N_PRI];

__declspec(align(32))floatfout_Q[N_RG][N_PRI];

 

 

int_tmain(intargc, _TCHAR* argv[])

{

      LARGE_INTEGER startTick, stopTick, ticksPerSecond;

     doubletotalTime=0,elapsed;

     floatsum = 0;

     for(inti=0;i<N_RG;i++)

      {

            fFilter_I[i]=i+0.2;

            fFilter_Q[i]=i-0.2;

      }

 

     for(intj=0;j<N_RG;j++)

      {

           for(inti=0;i<N_PRI;i++)

            {

                  fInput_I[j][i]=i+0.5;

                  fInput_Q[j][i]=i-0.3;

            }

      }

           

 

      QueryPerformanceFrequency(&ticksPerSecond);

 

     #ifdefCILK

      __cilkrts_init ();

     #endif     

 

     for(intk=0;k<N_ITERATIONS;k++)

      {

           

            QueryPerformanceCounter(&startTick);

           

           #ifdefCILK

            cilk_for (int i=0; i<N_RG; i++)

           #else      

           for(inti=0; i<N_RG; i++)

           #endif     

            {

 

                 #ifdefCILK

                  cilk_for (int j=0; j<N_PRI; j++)

                 #else      

                 for(intj=0; j<N_PRI; j++)

                 #endif     

                  {

                        fout_I[i][j] = (fInput_I[i][j] * fFilter_I[i] - fInput_Q[i][j] * fFilter_Q[i]) ;

                        fout_Q[i][j]= (fInput_Q[i][j] * fFilter_I[i] + fInput_I[i][j] * fFilter_Q[i]) ;

                  }

           

            }

            QueryPerformanceCounter(&stopTick);

           

            elapsed = (double)(stopTick.QuadPart - startTick.QuadPart)/(double)ticksPerSecond.QuadPart;

            printf ("cilk time=%f\n",elapsed);

            totalTime+=elapsed;

      }

     

     

      sum = 0;

     for(inti=0;i<N_RG;i++)

      {

           for(intj=0;j<N_PRI;j++)

            {

                  sum+=fout_I[i][j];

            }

      }

     

      printf ("cilk average time=%f\n",totalTime/N_ITERATIONS);

      printf ("sum =%f\n",sum);

      fflush (stdout);

 

     

     return0;

}

 

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

How about using cilk_for on the outer loop and allowing the inner loop to vectorize?  DId you compare opt-report for those cases?

Tim's suggestion is spot on.  A cilk_for is reasonably cheap, but it's not free.  There's overhead associated with setting up for potential parallelism that you need to amortize over the work to be done.  If you've got a lot of work, then the overhead is small relative to the work.

Jim Sukha discussed this in his article Why is My Cilk Plus Program Not Showing Speedup Part 1. Parallel regions with insufficient work is the first thing he covered.

Another reason to leave the inner loop as a simple for loop is that there's a known issue with vectorizing cilk_for loops.  We're working on it, but the easiest workaround for now is to keep things simple for the vectorizer.  Sometimes moving the inner loop to a separate (inlined) function helps.

   - Barry

Dear Members,

Are you saying that the attached code (in NewCode.txt) will work faster ?

When I ran the code with 'for' (no cilk used) I noticed that all 4 cores are occupied (according to task manager)

Does it make sense ? 

On Intel's Ivy-Bridge (i7), can I cause to graphic processor to help with computations ?

Should I use any special compiler switches ?

Thanks,

Zvika

 

Attachments: 

AttachmentSize
Downloadtext/plain newcode.txt240 bytes

yes the outer cilk_for keeps all workers busy while vector inner for keeps all SIMD lanes productive. cilk  has no facility to program. graphics GPU although it could contribute for specialized low precision tasks.

Hi Tim,

When I ran the code without cilk, according to the task manager, all 4 cores were running with high CPU usage.

Does it make sense ? I thought that only cilk can cause all cores to do their part.

Thanks,

Zvika

Dear Members,

When I ran the outer loop with 'cilk_for' and the inner loop with 'for' the performance was much compared to running both loops with 'for'.

So in my case, I do not understand how cilk helps.

Can you help ?

Thanks,

Zvika

Leave a Comment

Please sign in to add a comment. Not a member? Join today