CilkView underestimates upper bound of speedup

CilkView underestimates upper bound of speedup

Bild des Benutzers mecej4

Cilk Plus does quite well speeding up this slightly modified version of quick-sort for integers.

#include 
#include 
#include 
#include 
#ifdef __cilk
#include 
#else
#define cilk_spawn
#define cilk_sync
#endif

#define THRESH 16

void inssort(int *a, int l, int r){
   int i,j,v,t;
   for(i=r; i>l; i--)if(a[i-1] > a[i]){t=a[i-1]; a[i-1]=a[i]; a[i]=t;}
   for(i=l+2; i<=r; i++){
      j=i; v=a[i];
      while(v < a[j-1]){
         a[j]=a[j-1]; j--;
         }
      a[j]=v;
   } 
}

void qsort3(int *A,int p,int r){
/* Input A(p:r), rnk (p <= rnk <= r). Returns the value of the element of rank rnk.
   The array is partitioned about A[rnk], as well */
int k,t,q,rnd,f,x,j,rnk,p0,r0;  // median is all that we want

lq: p0=p, r0=r; rnk=(p+r)/2;

if (r <= p) {
    cilk_sync;
	return;
	}
if(r+1-p <= THRESH){
       inssort(A,p,r);
	   cilk_sync;
       return;
       }

while(1){
   if(p==r){ q=p; break;}
   rnd=rand();
   k=p+(r-p)*(double)rnd/(double)RAND_MAX+0.5;
   t=A[k]; A[k]=A[r]; A[r]=t;  /* randomize */

   q=p-1; x=A[r];           //Inline version of q=rpart(A,p,r);
   for(j=p; j r0-r){
    cilk_spawn qsort3(A,p0,p); 
	p=r; r=r0; goto lq; // Avoid tail recursion qsort3(A,r,r0);
	}
else {
    cilk_spawn qsort3(A,r,r0);
	r=p; p=p0; goto lq;  // Avoid tail recursion qsort3(A,p0,p);
	}
cilk_sync;	// superfluous, but keep for safety
}

int main(int argc, char* argv[]) {
    int size,*a,i,med; int c1,c2; int nl,nr;

    printf("Threshhold = %dn",THRESH);

    for(size=0x4000; size <= 0x1000000; size+=size){
        a   = malloc(sizeof(int)*size);
       if(!a){
	      fprintf(stderr,"Could not allocate space for %08X integersn",size);
	      exit(1);
	      }

        for(i=0; iHowever, CilkView gives a rather pessimistic estimate of 2.2 for the upper bound on parallel speed-up. On Windows 7 Home Premium X64 running on a Dell XPS 17 laptop with i7-2720QM, I get speed-ups that are never less than 3.0.
[bash]s:sortsav>icl /O3 /DNOCHECK /MD ssort.c
Intel C++ Compiler XE for applications running on IA-32, Version 12.0.4.196 Build 20110427

s:sortsav>cilkview -trials one ssort.exe
cilkview: running with 8 workers
Threshhold = 16
M 00016384  0.001
M 00032768  0.001
M 00065536  0.002
M 00131072  0.003
M 00262144  0.006
M 00524288  0.012
M 01048576  0.023
M 02097152  0.038
M 04194304  0.078
M 08388608  0.153
M 16777216  0.293
cilkview: generating scalability data
Cilkview Scalability Analyzer V2.0.0, Build 1113
Threshhold = 16
M 00016384  0.120
M 00032768  0.052
M 00065536  0.060
M 00131072  0.079
M 00262144  0.111
M 00524288  0.178
M 01048576  0.279
M 02097152  0.448
M 04194304  0.834
M 08388608  1.528
M 16777216  2.923

Whole Program Statistics
1) Parallelism Profile
   Work :                                        10,511,027,859 instructions
   Span :                                        4,781,368,295 instructions
   Burdened span :                               4,790,901,465 instructions
   Parallelism :                                 2.20
   Burdened parallelism :                        2.19
   Number of spawns/syncs:                       433,563
   Average instructions / strand :               8,081
   Strands along span :                          767
   Average instructions / strand on span :       6,233,856
   Total number of atomic instructions :         44
   Frame count :                                 867,137
2) Speedup Estimate
   2 processors:         1.13 - 2.00
   4 processors:         1.20 - 2.20
   8 processors:         1.25 - 2.20
   16 processors:        1.27 - 2.20
   32 processors:        1.28 - 2.20
4 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Brandon Hewitt (Intel)

I can see the problem as well, although I don't get quite a 3x speedup on my system. I still get better than Cilkview's estimate though, so I've submitted a problem report for the runtime team to investigate. I'll update the thread when we find anything out.

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"
Bild des Benutzers William Leiserson (Intel)
Best Reply

Hi Mecej4.

Bear in mind that you're timing a subsection of the code with respect to what Cilkview is measuring. Your timing interval covers only the sorting section of the code, whereas Cilkview is estimating the speedup of the whole program. Cilkview's measurement includes your creation of the array and the (serial) assignment of values to it, which your timing mechanism ignores.

My guess is that if you expanded your timing measurements to start at the beginning of main() and end just before the return (and, obviously, a printf), you would be finding speedup numbers that were much closer to the values that Cilkview gives you.

You can make Cilkview instrument specific subsections of code and outputstatistics for themin addition to the whole-program statistics you are seeing, now. Use cilktools/cilkview.h:

cilkview_data_t start, end;
__cilkview_query(start);
// do work.
__cilkview_query(end);
__cilkview_do_report(&start, &end, "Sort", CV_REPORT_WRITE_TO_LOG | CV_REPORT_WRITE_TO_RESULTS);

Bild des Benutzers mecej4

That makes perfect sense; I had the mistaken notion that CilkView timing covered only those sections of code that were cilk_spawned; sorry for being so short-sighted!

Melden Sie sich an, um einen Kommentar zu hinterlassen.