CilkView underestimates upper bound of speedup

CilkView underestimates upper bound of speedup

Cilk Plus does quite well speeding up this slightly modified version of quick-sort for integers.

#ifdef __cilk
#define cilk_spawn
#define cilk_sync

#define THRESH 16

void inssort(int *a, int l, int r){
   int i,j,v,t;
   for(i=r; i>l; i--)if(a[i-1] > a[i]){t=a[i-1]; a[i-1]=a[i]; a[i]=t;}
   for(i=l+2; i<=r; i++){
      j=i; v=a[i];
      while(v < a[j-1]){
         a[j]=a[j-1]; j--;

void qsort3(int *A,int p,int r){
/* Input A(p:r), rnk (p <= rnk <= r). Returns the value of the element of rank rnk.
   The array is partitioned about A[rnk], as well */
int k,t,q,rnd,f,x,j,rnk,p0,r0;  // median is all that we want

lq: p0=p, r0=r; rnk=(p+r)/2;

if (r <= p) {
if(r+1-p <= THRESH){

   if(p==r){ q=p; break;}
   t=A[k]; A[k]=A[r]; A[r]=t;  /* randomize */

   q=p-1; x=A[r];           //Inline version of q=rpart(A,p,r);
   for(j=p; j r0-r){
    cilk_spawn qsort3(A,p0,p); 
	p=r; r=r0; goto lq; // Avoid tail recursion qsort3(A,r,r0);
else {
    cilk_spawn qsort3(A,r,r0);
	r=p; p=p0; goto lq;  // Avoid tail recursion qsort3(A,p0,p);
cilk_sync;	// superfluous, but keep for safety

int main(int argc, char* argv[]) {
    int size,*a,i,med; int c1,c2; int nl,nr;

    printf("Threshhold = %dn",THRESH);

    for(size=0x4000; size <= 0x1000000; size+=size){
        a   = malloc(sizeof(int)*size);
	      fprintf(stderr,"Could not allocate space for %08X integersn",size);

        for(i=0; i 
However, CilkView gives a rather pessimistic estimate of 2.2 for the upper bound on parallel speed-up. On Windows 7 Home Premium X64 running on a Dell XPS 17 laptop with i7-2720QM, I get speed-ups that are never less than 3.0.

[bash]s:sortsav>icl /O3 /DNOCHECK /MD ssort.c
Intel C++ Compiler XE for applications running on IA-32, Version Build 20110427

s:sortsav>cilkview -trials one ssort.exe
cilkview: running with 8 workers
Threshhold = 16
M 00016384  0.001
M 00032768  0.001
M 00065536  0.002
M 00131072  0.003
M 00262144  0.006
M 00524288  0.012
M 01048576  0.023
M 02097152  0.038
M 04194304  0.078
M 08388608  0.153
M 16777216  0.293
cilkview: generating scalability data
Cilkview Scalability Analyzer V2.0.0, Build 1113
Threshhold = 16
M 00016384  0.120
M 00032768  0.052
M 00065536  0.060
M 00131072  0.079
M 00262144  0.111
M 00524288  0.178
M 01048576  0.279
M 02097152  0.448
M 04194304  0.834
M 08388608  1.528
M 16777216  2.923

Whole Program Statistics
1) Parallelism Profile
   Work :                                        10,511,027,859 instructions
   Span :                                        4,781,368,295 instructions
   Burdened span :                               4,790,901,465 instructions
   Parallelism :                                 2.20
   Burdened parallelism :                        2.19
   Number of spawns/syncs:                       433,563
   Average instructions / strand :               8,081
   Strands along span :                          767
   Average instructions / strand on span :       6,233,856
   Total number of atomic instructions :         44
   Frame count :                                 867,137
2) Speedup Estimate
   2 processors:         1.13 - 2.00
   4 processors:         1.20 - 2.20
   8 processors:         1.25 - 2.20
   16 processors:        1.27 - 2.20
   32 processors:        1.28 - 2.20
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I can see the problem as well, although I don't get quite a 3x speedup on my system. I still get better than Cilkview's estimate though, so I've submitted a problem report for the runtime team to investigate. I'll update the thread when we find anything out.

Brandon Hewitt
Technical Consulting Engineer

For 1:1 technical support:

Software Product Support info:
Best Reply

Hi Mecej4.

Bear in mind that you're timing a subsection of the code with respect to what Cilkview is measuring. Your timing interval covers only the sorting section of the code, whereas Cilkview is estimating the speedup of the whole program. Cilkview's measurement includes your creation of the array and the (serial) assignment of values to it, which your timing mechanism ignores.

My guess is that if you expanded your timing measurements to start at the beginning of main() and end just before the return (and, obviously, a printf), you would be finding speedup numbers that were much closer to the values that Cilkview gives you.

You can make Cilkview instrument specific subsections of code and outputstatistics for themin addition to the whole-program statistics you are seeing, now. Use cilktools/cilkview.h:

cilkview_data_t start, end;
// do work.
__cilkview_do_report(&start, &end, "Sort", CV_REPORT_WRITE_TO_LOG | CV_REPORT_WRITE_TO_RESULTS);

That makes perfect sense; I had the mistaken notion that CilkView timing covered only those sections of code that were cilk_spawned; sorry for being so short-sighted!

Leave a Comment

Please sign in to add a comment. Not a member? Join today