Cilk Plus Perfomance - Overhead

Cilk Plus Perfomance - Overhead

knikas的头像

Hi all,

I downloaded and built the gcc47-cilkplus branch. I managed to built the library using binutils version 2.20.1, as I had some problems with versions 2.22 and 2.18. I then tried to experiment with the following code:

#include 
#include 

int fib (int n) { 
   if (n<2) return n; 
   else { 
          int x, y; 
#ifdef cilk 
          x = cilk_spawn fib (n-1); 
          y = cilk_spawn fib (n-2);
          cilk_sync;
#else 
          x = fib(n-1);
          y = fib(n-2);
#endif 
          return (x+y); 
 } } 

int main (int argc, char *argv[]) { 
   int n, result; struct timeval t1, t2; 
   double time; n = atoi(argv[1]); 
   gettimeofday(&t1, 0); 
#ifdef cilk 
   result = cilk_spawn fib(n); 
   cilk_sync; 
#else 
   result = fib(n); 
#endif 
   gettimeofday(&t2, 0); 
   time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000; 
   printf ("Result: %dn", result); 
   printf("Time : %.4fn", time); 
   return 0; 
}
For the compilation I export LD_LIBRARY_PATH and LIBRARY_PATH and I run:
gcc -Wall -Wno-unused-variable -O3 -o fib-serial-gcc47-O3 fib.c -ldl -lcilkrts
gcc -Wall -Wno-unused-variable -O3 -Dcilk -o fib-cilk fib.c -ldl -lcilkrts

for the serial and the parallel version. I limit the CILK_NWORKERS to 1 and I get the following results for fib(40): serial :Time : 1.0540 cilk :Time : 27.3902 Even for 8 workers (I have an 8-core machine) cilk reports3.4457...still slower than the serial version! Something seems to be wrong. Maybe something to do with code optimizations? Or do you think something went wrong while building gcc? Kind regards, Kostis

31 帖子 / 0 new
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项
Balaji Iyer (Intel)的头像

Hello Kostis,
I cut and pasted your code, compiled it using the tags you used and ran it for fib 40. My parallel fib was way faster than serial fib.

One thing I notice is that you are using #include <syst/time.h> and my compiler tagged it as error. I changed it to <sys/time.h>

Do you have multiple or proprietarygettimeofday installed in your machine?

Thanks,

Balaji V. Iyer.

Georg Zitzlsberger (Intel)的头像

Hello,

using this

x = cilk_spawn fib (n-1);  // -> executed by X
// created thread Y that continues here
y = cilk_spawn fib (n-2);  // -> executed by Y
// created thread Z that continues here
cilk_sync;  // synchronize: wait for X, Y & Z to finish

...is very inefficient. What you're doing here is spawning two new threads (Y & Z) - thread X is the main thread entering the fib(...) function initially. Whilst threads X & Y are doing work, thread Z is just there to wait.

Better solution:

x = cilk_spawn fib (n-1);
y = fib (n-2);
cilk_sync;

This should reduce your overhead.

Best regards,

Georg Zitzlsberger

knikas的头像

Hi guys, yes I agree this is a better solution. Still this is not the problem. I am constantly getting this strange overhead comparing the serial version to the parallel (with 1 thread or more) for Fibonnacci, for Floyd-Warsall and other algorithms. There seems to be another problem, something to do with the library and perphaps the way I have built it. The syst/time was just a typo while copying the code in the forum. I use the usual sys/time.h. AsBalaji is mentioning I would expect the parallel version with 1 worker to e maybe slightly slower than the serial version. And then get faster as the number of cores increases. Kind regards, Kostis

knikas的头像

Following my previous comments, the compilation seems to ignore optimization flags. I compiled the code with O0, O1, O2, O3 flags for both the serial and parallel version. And here are the results -O0 : serial -->Time : 2.6085 , parallel --> Time :Time : 26.2700 -O1 : serial --> Time :2.2422 , parallel --> Time :Time : 27.1657 -O2 : serial --> Time : 1.9993, parallel --> Time :Time : 28.6697 -O3 : serial --> Time : 1.0544 , parallel --> Time :Time : 27.4960 I remind you this is the time for the parallel version with 1 cilk worker. So, while the optimization flags have an effect on the serial code there seems to be none on the parallel code? Is this normal? Could this be a sign that something has gone wrong in the building process? Kind regards, Kostis

Balaji Iyer (Intel)的头像

Hello Kostis,
When Cilk was implemented, we did not modify the existing GCC scheduler. Also, I have seen that GCCinstruction schedulerdoes not do an astonishing job in scheduling functions inside another function (A spawn helper is implemented that way). This is probably why you are not seeing much difference in time.

Thanks,

Balaji V. Iyer.

knikas的头像

Hi Balaji,

just to clarify...

When you run this code with CILK_NWORKERS=1 does it execute slower than the serial code on your machine? And if yes, how much slower? Because, if you are not seeing huge differences then the problem must be with my gcc build. My build is based on the code commited in the SVN on 08/12/2011 by hjl.

Kind regards,

Kostis

Balaji Iyer (Intel)的头像

Hi Kostis,
We are currently looking into this. Although, please take note that a 10X overhead for a program thatis almost pure-overheadis not unexpected.

Thanks,

Balaji V. Iyer.

knikas的头像

Hi guys, to offer some more info...I downloaded the latest non-commercial version of the composer_xe compiler.

icc --version

icc (ICC) 12.1.2 20111128

Copyright (C) 1985-2011 Intel Corporation. All rights reserved.

I used it to compile the same code (with -O3 flag) and compare again serial execution vs parallel execution with 1 thread. Here are the results for fib(40):

serial -->Time : 2.6701

parallel (with CILK_NWORKERS=1) -->Time : 29.7030

so similar behavior to using the gcc build. Even when I up the CILK_NWORKERS to 8, it runs in 3.62 secs, which means it is slower than the serial version!

Another reason why this is strange, is that when I used in the past cilk++ for the same experiment I didn't observe that much difference between the serial and the parallel version with one worker. I will try and reproduce this experiment as well and report the numbers.

Kind regards,

Kostis

Georg Zitzlsberger (Intel)的头像

Hello Kostis,

we've observed the same for some Linux distributions, too. It only seems to affect few systems. We're currently investigating...

What Linux distribution are you using and is it 32 or 64 bit?

Thank you for your feedback!

Georg Zitzlsberger

knikas的头像

Hello Georg, all the previous results come from a Debian Wheezy. I have also observed the same behaviour on Fedora-14, Fedora-16 and Ubuntu 11.04. All these are 64-bit distribtions. Which distribution are you using that does not exhibit this behaviour? Kind regards,

Kostis

Georg Zitzlsberger (Intel)的头像

Hello Kostis,

I've quickly tested some of our 64 bit systems. Those worked:

SLES10
RHEL5
Fedora 12

Best regards,

Georg Zitzlsberger

William Leiserson (Intel)的头像

Hi Knikas,

I'm running Ubuntu 11.10 (64-bit). I took your code and removed the extra spawns and found ~10x overhead on one worker. That seemed pretty high to me, too. On the other hand,recursivefib is a diabolical example since there's basically nothing except for overhead in the algorithm. In that sense, fib is the proper function to use to measure the overhead of calling a Cilk function (as opposed to a C++ function). The overhead, itself, is in every spawning function: larger frame to accommodate Cilk data structures, access TLS to get the worker, no optimizing away the frame pointer, verify that the function is sync'd before returning, etc.

There is, of course, a lot of optimization opportunity in all that -- especially in the case of fib -- and knocking down the overhead would be a big win. Naturally, one anticipates it going down over time, but 10x is not altogether surprising on fib.

Regarding Cilk++, I'd definitely like to see the comparison if you run the experiment. My recollection was something on the order of 3-4x on fib.

For completeness, here is the code I ran to get my timing:

#include 
#include 
#include 

typedef int (*fib_t) (int);

int sfib (int n) {
    if (n < 2) return n;
    else {
        int x, y;
        x = sfib(n - 1);
        y = sfib(n - 2);
        return x + y;
    }
}

int pfib (int n) {
    if (n < 2) return n;
    else {
        int x, y;
        x = cilk_spawn pfib(n - 1);
        y = pfib(n - 2);
        cilk_sync;
        return x + y;
    }
}

void time_it (fib_t f, int n)
{
    int result;
    struct timeval t1, t2;
    double time; 
    gettimeofday(&t1, 0);
    result = f(n);
    gettimeofday(&t2, 0);
    time=(double)((t2.tv_sec-t1.tv_sec)*1000000+t2.tv_usec-t1.tv_usec)/1000000;
    printf ("Result: %dn", result);
    printf("Time : %.4fn", time);
}

int main (int argc, char *argv[]) {
    int n;
    n = atoi(argv[1]);
    time_it(sfib, n);
    time_it(pfib, n);
    return 0;
}
knikas的头像
Hi William,

I was wondering the same thing, so I performed the same experiment with Cilk++ (build 8503). For fib(40) the serial version took 1.8860 sec while the parallel version with 1 worker took 16.69 sec. So a lot faster than the 27 sec needed with GCC or ICC, but still a huge overhead. I also tried your code (to remove 1 spawn and reduce the overhead), which took 11.2355 sec. So in this case I have a ~10x overhead.

I also tried the cilkc MIT compiler. In this case the parallel execution with 1 worker takes 7.433 sec, so a ~7x overhead. Kind regards,

Kostis

jimdempseyatthecove的头像

The parallel Fibonacci is an excellent demonstration of tasking overheadof an inefficient recursive algorithm. For Fibonacci series a fast iterative method by one thread will, IMHO, always be faster than a parallel method.

[cpp]// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
	if( n<2 )
		return n;
	long fib0 = 1;	// most recent
	long fib1 = 1;	// 1 prior to most recent
	long fib2 = 0;	// 2 prior to most recent
	long i;
	for(i=2; i

the above loop has for O per n of

3 register to register moves
1 register,register add
1 add 1
1 compare
1 branch

(before unrolling by the compiler and/or other optimizations like register swapping)

The above is 7 instructions, possibly 4 clock ticks per iteration.

A call (near) to ret, that is to say a call to a void function, exceeds this.
Which in turn means that your tasking library (Cilk Plus) would require negative clock ticks (in parallel), to surpass the performance of the single threaded iterative process.

The point of the parallel fib is to teach recursive programming, and not to teach how to efficiently program. What the student running the test program should observe is: the cost of tasking as compared to the work per iteration and can this cost be effectively amortized.

Jim Dempsey

www.quickthreadprogramming.com
Georg Zitzlsberger (Intel)的头像

Hello Jim,

yes, there are more efficient algorithms to calculate Fibonacci.
The recursive one above has O(2^n), an iterative approach, like yours O(n).
There's even a closed-form version with just O(1).

Anyways, the idea of using Fibonacci with Intel Cilk Plus is to easily demonstrate how it can be applied and how it works. The simplicity helps in understanding the involved tasking model. In general you can use other more complex problems as well (like to see some from the community, not here, but on a dedicated thread!).

You said that the runtime using the recursive algorithm with Intel Cilk Plus should be close to the iterative one (for all n). Well, that's like comparing apples with pears (O(2^n) to O(n) respectively).

What we want to show here is that the same (recursive!) algorithm is running faster with using multiple cores than sequentially.

If I had to implement a mathematical library I'd prefer the closed-form version (for big n) ignoring the recursive & iterative approach altogether. However, we're exercising here...

The problem that's described above is that on some systems we see a reproducible slowdown. The task scheduling usually should be consistent on all systems. Maybe it's just a small problem in the scheduler or runtime library.

Best regards,

Georg Zitzlsberger

jimdempseyatthecove的头像

>>The problem that's described above is that on some systems we see a reproducible slowdown

Yes, ... Time for VTune

>>The task scheduling usually should be consistent on all systems.

But not necessarily between systems, nor relative efficiency between systems.

Jim

www.quickthreadprogramming.com
jimdempseyatthecove的头像

>>What we want to show here is that the same (recursive!) algorithm is running faster with using multiple cores than sequentially

Isn't this "same (recursive)" (in parallel)when both legs are spawned?

As opposed to:

Half same recursive in parallel (one legusing task spawn)
Other half same recursive via call stack (other leg using function call)

Don't get me wrong, this example is a good learning tool. It illustrates a case where you split work between the current thread and a worker thread (and recurse down doing the same thing).

Jim Dempsey

www.quickthreadprogramming.com
Jim Sukha (Intel)的头像

As I believe was mentioned earlier in the post, spawning the second recursive leg is redundant for this example. In this case, spawning one task and calling the other task is "equivalent" to spawning both halves, because the continuation of the second spawned task would do nothing but immediately sync.

To illustrate the difference more concretely, consider the following two program fragments.

// Example 1:
int x, y= 0;
cilk_spawn f();
x++;
g();
y++;
cilk_sync;

// Example 2:
int x, y = 0;
cilk_spawn f();
x++;
cilk_spawn g();
y++;
cilk_sync;

Assume f() and g() are serial functions.
In the first example, there are two strands that can potentially execute in parallel:
Strand 1 for the initialization of x and y andthen f()
Strand 2 for x++, g() and then y++.

In the second example, there are three strands that can potentially execute in parallel:
Strand 1 for the initialization ofx and y and thenf()
Strand 2 for x++ and then g()
Strand 3 for y++.
Strand 3 can not begin executing until strand 2 has executed x++ and reached the cilk_spawn of g().

For fib(), one could spawn the second recursive call, but since its "strand 3" does nothing, it is only adding overhead.

Cheers,

Jim

jimdempseyatthecove的头像

>>As I believe was mentioned earlier in the post, spawning the second recursive leg is redundant for this example. In this case, spawning one task and calling the other task is "equivalent" to spawning both halves

Except for the overhead. You cannot claim the implementation is fully parallel when half the implementation is recursive serial. Apples and oranges.

Using one leg in parallel_is_ faster than using both legs in parallel, and an actual implementation should favor this over both legs in parallel. However, this does not discount the "requirement" of the sample program running fully recursively parallel. To do this you need to run both legs in parallel.

A similar situation is where you want to illustrate a parallel for loop, with, unknown to the compiler, only 1 iteration. This is "equivalent" to issuing the body statement(s) without the loop control and withouttask setup for the Cilk_for. However, making the substitution, then claiming it is equivalent to the parallel implementation is "cooking" the results data.

Making these otherwise "stupid" programming choices is valuable in that it provides you with the means to measure the tasking overhead, and then have the information available to make your programming decisions such as parallelizing one or both legs and what size to set your cut-off threshold.

Jim Dempsey

www.quickthreadprogramming.com
Jim Sukha (Intel)的头像

I agree, there is a lot to learn from studying even seemingly "simple" examples such as fib, and it is definitely important to understand what you are measuring when trying to make performance comparisons.

Just wanted to clarify for any future readers who might stumble upon this post, because it is possible that programmers new to Cilk and parallel programming might get confused by my original description or the statements"half the implementation is recursive serial" and "you need to run both legs in parallel".
In the following code:

cilk_spawn f();
g();
cilk_sync;

f() and g() can both contain parallelism nested inside, and Cilk will not only allow f() and g() to execute in parallel, but also exploit nested parallelism in both sides, f() and g(). If g() itself has spawns nested inside, the function g() is NOT guaranteed to execute completely serially. In the case of fib(), f and g happen to be the same function, which does recursively spawn.
Thus, in a divide-and-conquer algorithm which divides into two subproblems, the preferred approach isto avoid spawning the second subproblem (i.e., g()). Cilk can still exploit parallelism due to spawns nested insidethe second subproblem g().

Anyway,I'm probably repeating a lot informationthat everyone has summarized in earlier posts. But I just wanted to clarify, becauserecursive divide-and-conquer algorithms are acommon and importantusage pattern for Cilk, and we want to encourage programmers to write efficient Cilk code.
Cheers,

Jim

jimdempseyatthecove的头像

>> but also exploit nested parallelism in both sides

This is correct, but the call to g() is not via cilk_spawn, and thus reduces the number of spawns by ~1/2.
While this is good programming practice, it is not following the problem premise of a fully parallel recursive algorithm.

In the original TBB example, they incorporated a CutOff threshold. If the (remaining) n was belowthis threshold the Serial Fib function is called. This too, while good programming practice, circumvents fully parallel-ness of the implimentation.

The point being made here is "fully parallel" isn't always "fully best"... and this is part of the learning experiance of this problem.

As an extended example of avoiding unnecessary parallelism in a recursive parallel implementation (I'll wait for your "that's cheating"). Consider the QuickThread implimentation. QuickThread has a function call that will return the number of idle threads. The following code will use this function to determine if it is advantageous to spawn the recursion or call the recursion.

// support task for qtParallelFib( int64_t n );
// returns result indirect pointer to argument as opposed to return value
// (tasks have void return)
void qtParallelFibLambda( int64_t n, int64_t* result)
{
   // see if below cutoff
   if(n      // no waiting threads
      qtParallelFibLambda(n-2,&y);
      qtParallelFibLambda(n-1,&x);
   }
   // join legs
   *result = x + y;
}

// Shell function to return result as return value
// as opposed to indirect pointer to result location.
int64_t qtParallelFibLambda( int64_t n )
{
   int64_t	result;
   qtParallelFibLambda(n, &result);
   return result;
}

Is this better programming - yes, is it "fully parallel" - no.

Jim Dempsey

www.quickthreadprogramming.com
BradleyKuszmaul的头像

Consider these two code fragments:
fun1() {
spawn f(); spawn g(); sync; } fun2() { spawn f() g(); sync; } In this case fun1 and fun2 offer the same amount of parallelism. There is no sense in which fun2 is less parallel than fun1. It doesn't make sense to say "half the calls are serial", because serialization is a property that relates two different parts of the code. In both of these examples, f() and g() execute in parallel. To say that g() is serial, you have to say what it is serial with. In the second example, g() is serial with the sync, but the sync serializes on the previously spawned functions, so g() is also serial with the sync in the first example. Read Jim Sukha's post carefully. He knows what he's talking about.

BradleyKuszmaul的头像

I have also seen performance anomalies in recent versions of Cilk. I have a big code, and when I enable Cilk it slows everything down. This is preventing me from shipping product that uses Cilk.
I also have a "PR" problem with the Cilk scheduler. When there is nothing to steal, the worker threads happily chew up all the otherwise unused CPU cycles. My beta customers don't like that, since they think it will impact other processes running on the same server. It would be really helpful if some way could be found to make it so that when users run "top" they don't see my software chewing up all available CPU cycles.

jimdempseyatthecove的头像

Bradley,

>>I have a big code, and when I enable Cilk it slows everything down.

I suspect that you are a new user to parallel programming and thus using Cilk improperly.

There is an adage: 'Vector -inner, parallel - outer"

It looks like you may have profiled the serial version of your big code, found the hot spot (inner most) routines, and parallelized those using Cilk Plus.

The better practice is to see if you can vectorize the inner most routines, then parallelize the code in the uppermost appropriate level possible that calls those routines. In following this practice you reduce the frequency of spawn-join. spawn-join has an cost. It is your job as the programmer to assure the total cost is appropriate for the total benefit.

>>"PR" problem with the Cilk scheduler

Most tasking schedulers: Cilk Plus, TBB, OpenMP, QuickThread, employ a technique where an assumption is made that when you terminate a parallel region, that you will very soon enter a next parallel region. This is preponderantly true for nearly all applications. With this assumption in place, the better programming strategy is to, to take an analogy, place the race car at the starting line with engine rev-ing, clutch depressed, and in gear. When the start flag comes down - pop the clutch and immediatelytake off. The alternative, is when momentarily done is to park the car in the garage and turn the engine off. And in which case the time to start the engine and drive back to the starting line is an additional latency. Additionally, by parking the thread you may be yielding some time you left on your rental agreement (with the O/S). When it is time to start the impending parallel region, one or more of your threads may have to wait for some other task/application expends its time with the thread. (some systems are preemptive time-sliced, others are not).
The trade-off here is: CPU cycles (fuel/engine wear) vs latency (time to go).

If you read the programmer's guide for the parallel programming tool you may find that most of these provide a tuning parameter for you to specify how long the thread (race car) waits (revving its engine) before deciding to suspend (park in the garage). In OpenMP this is the KMP_BLOCKTIME. I will leave it an exercise for yourself to discover if this tuning parameter is available in Cilk Plus (you never know what additional useful information you may discover while digging around).

Before you go in to setting the block time to 0, examine your current parallel programming attempt to see if you can push the parallelization to outer levels - as I suspect that this is the root of your problem.

Jim Dempsey

www.quickthreadprogramming.com
BradleyKuszmaul的头像

>> I suspect that you are a new user to parallel programming and thus using Cilk improperly. I was one of the original authors of Cilk in 1992. I've written many high-performance codes using both MIT Cilk and Intel Cilk, and most of the codes are available on the web. For example, in addition to my academic papers on the web, you can find a bunch of my Cilk code on the intel forums (by looking through my postings).

I might find your Cilk-related comments to be more helpful if they related to some specific experience you have with Cilk, rather than extrapolating from some other parallel programming environment such as OpenMP. I'm sure you know more than I do about OpenMP, but it's not Cilk. For example,Cilk does not require a parameter corresponding to KMP_BLOCKTIME.

I will leave it as an exercise for yourself to learn something about Cilk (and please share your experiences).

I could be using Cilk improperly. I did check to make sure that I have released the parking brake, however.

-Bradley
jimdempseyatthecove的头像

You are correct in that I am not a big-time Cilk Plus programmer, I use other paradigms.

The OpenMPKMP_BLOCKTIME is not a task call parameter, rather it is a global setting that specifies a "spin wait" time before a thread suspends itself. There is also a runtime function call you can use on some implimentations to dynamically vary the spin-wait time.

When your code is:

serial
parallel
short serial
parallel
short serial
...

You may find it beneficial to have a block time that is longer than the runtime of the short serial. "blocktime" is a misnomer, it really means "maximum wait time".

When your code is

serial
parallel
long serial (seconds)
parallel
long serial
...

Then you may find zero wait time beneficial. As this conserves power, may speed-up the serial code on "Turbo-Boost" type processors, and will release processing capabilities to other threads/applicaitons

I would be very surprised to find thatCilk Plusdoes not have a tuningmeans for the wait-time (block time).

-----------

At the top of this forum thread, the recusive Fibbonacci series was used as an example program for recursive parallel programming.

Whilethis is an excellent simple example, it is also one of the worst uses of parallel programming and should never be used as a measure of expectation what parallel programming will do for your applicaiton. Any reasonable parallel programming guide will state somewhere something to the effect: "The cost of invoking a parallel region is not zero. This cost must be factored in to the decision of the use and placement of the parallel regions."

In the Fib series, the computation is "x = y + z" (plus the inordinate overhead of redundantly calling itself).
In the cases where you can get the parallel version to run faster, you still suffer from "redundantly calling itself" syndrom.

The following is a faster serial recursive method that removes redundancy:

__int64 SerialFibMedium( int n, int64& Prior) { if( n==2 ) { Prior = 1; return 1; } __int64 PriorPrior; Prior = SerialFibMedium( n-1, PriorPrior ); return Prior + PriorPrior; } __int64 SerialFibMedium( int n ) { if( n<2 ) return n; __int64 Prior; return SerialFibMedium( n, Prior); }

Initial call is made to the second routine.

While you can still "parallize this", the cost of one spawn will exceed the runtime of the serial code with an "n" sufficient to cause an overflow in __int64 output.

The point I am making is do not parallize code when it is not appropriate.

Jim Dempsey

www.quickthreadprogramming.com
BradleyKuszmaul的头像

> I would be very surprised to find thatCilk Plusdoes not have a tuningmeans for the wait-time (block time). MIT Cilk did not have such a tuning knob. I've never seen one in Intel Cilk. Surprise! I'm not really interested in reading about OpenMP on a Cilk forum.

BradleyKuszmaul的头像
I think that one can learn a lot about a parallel programming environment by examining how it does on this kind of simple code.

When I use icc to take this measurement on my laptop (which has two real cores, with HT enabled, so Cilk thinks there are 4 cores), here is what I see:

$ icc fib.c -o fib -O2 -cilk-serialize
$ ./fib 40
Result: 102334155
Time : 0.5314
$ icc fib.c -o fib -O2
$ ./fib 40
Result: 102334155
Time : 4.2463
$ CILK_NWORKERS=1 ./fib 40
Result: 102334155
Time : 10.9888

So there's about a 20x overhead for using Cilk for this example on ICC. And on my laptop I see a 2.58-fold speedup, which is pretty good considering that it's really only two cores.
By using Jim's optimization (don't spawn the second call to fib), the 1-worker speed goes to 6.44s and the 4-worker speed to 2.26s.
One can argue that this isn't the right way to write a fib program, but that's not the point. The point is that we can see what the overhead of the cilk mechanism is. A spawn costs about the same as 20 function calls.
If you do care about making fib run faster, then even the linear-time fib is too slow. There is a log-time solution based on matrix multiplication that runs in only O(log n) multiplies. Consider this 2x2 matrix, call it M.
(1 1) (0 1)
Take it to the power of n, and you'll see that fib(n) appears in the upper right corner. M^n can be calculated in log time by repeated squaring. For example M^40 can be calculated by M^32 * M^8 where M^8 can be calculated as (((M^2)^2)^2) and M^32 can be calculated as (((M^8)^2)^2)
The number of bits in the output grows quickly (roughly speaking fib(n) requires about than 2n/3 bits), and the last multiply may dominate the cost. It probably takes something like O(n log n log log n) bit operations to do it using a fast multiplication algorithm. (I haven't worked this out carefully, so it might be off by a little.) The fast multiply can probably run in parallel, so you can make it faster on a multiprocessor. Incidentally, the linear time algorithm, performs n additions, and length of the number grows by a constant number of bits on each step, so it takes O(n^2) bit operations. So if I have a machine with 64 gigabits of storage (like my laptop), then I can hope to compute fib(10,000,000,000) without too much trouble. (I don't think I want to wait around for the answer from the linear-number-of-adds algorithm.) This would be a great project for a budding Cilk programmer to do, and to write up. -Bradley


jimdempseyatthecove的头像

Bradley,

Good summary.

>>A spawn costs about the same as 20 function calls

In this example it would be 20x redundant recursions.

To measure cost ratio to function call

// assure this function is not inlined
void fooInc(int& foo)
{
++foo;
}

int foo = 0;
// time this
for(int i=0; i!=yourLimit; ++i)
fooInc(foo);

// time this
// ignore the fact that foo is not atomically incrimented
// we areonly interested in the spawn overhead
for(int i=0; i < yourLimit; ++i)
{
cilk_spawn fooInc(foo);
cilk_synch;
}

// time this
// ignore the fact that foo is not atomically incrimented
// we areonly interested in the spawn overhead
for(int i=0; i < yourLimit; i += 2)
{
cilk_spawn fooInc(foo);
cilk_spawn fooInc(foo);
cilk_synch;
}

... increasing number of spawns and += n for the number of spawns of interest

This will give you a representative idea of the overhead.

-----------------------------

Additional test that may be of interest

replace fooInc with

// compile such that not inlined
void stall(__int64 n)
{
__int64end = _rdtsc() + n;
while(_rdtsc() < end)
continue;
}
//...

__int64 n;
for(n=0; n < 10000; ++n)
{
__int64 begin = _rdtsc();
cilk_for( int i=0; i < 10000; ++i)
stall(n);
__int64 elapse = _rdtsc() - begin;
if(elapse < n)
break;
}
cout << "break even = " << n << endl;

The break even will vary depending on number of worker threads.

Jim Dempsey

www.quickthreadprogramming.com
BradleyKuszmaul的头像

Please go ahead and measure the things you find interesting. And if you want to report it for some other tool too, it might be interested. I measured what I thought was interesting. The reason we program in parallel is for performance, and the exponential-time fib program shows us something about how to get high performance on parallel codes. Knowing that the spawn overhead is 20 function calls tells me that I should coursen the base case of the recursion so that something on the order of 20 function calls are done without Cilk. Then spawn cost will match the non-spawn cost, and I'll be within a factor of two of optimal. I might want to make 200 function calls in the base case to get the spawn cost to be less than 5% of the runtime. The idea is to understand how you might transform your code to get high performance. If your recursive program already performs at least ten microseconds worth of work at the leaves of the computation, then you don't have to do anything. If it's a lot less, like 1ns worth of work, you will be paying a lot for the overhead of spawn. So let's see what happens: Code is shown far below. I cleaned up the code so that it can be called with -W -Wall and not emit any errors. This code takes two arguments N the number to compute fib(N), and M the size of the base case (that is compute any fib(M) using the serial code. It's still exponential-time code (that's not the optimization we're trying to understand). The game here is to show how by coursening the base case, one can avoid spawn overhead. So here's the result for computing fib(40) [The original wallpaper is shown far below.]

Serial code: 0.7s
 one worker:
   base=2:  6.65s
   base=3:  4.22s
   base=4:  2.85s
   base=5:  2.07s
   base=6:  1.49s
   base=7:  1.19s
   base=8:  0.96s
   base=9:  0.85s
   base=10: 0.77s
   base=11: 0.72s
   base=12: 0.69s
   base=13: 0.66s
   base=14: 0.66s

 two workers:
   base=14: 0.35s
 four workers:
   base=14: 0.25s

Speedup two workers: 2.00
Speedup four workers: 2.79

Since fib(8) is 21, that's where the coursened base case is close to the 20x overhead. So the function call for fib1(8) costs about the same as a spawn. At that point we might expect the spawns to take half the time and the work to take the other half. In fact at that point the spawn overhead is about 1/3 of the runtime. That's not too bad for an estimate. By the time we get to fib(14) for the base case (fib(14)=377)), we've essentially removed all the cilk overhead. If we go from that point to running on two cores, we get a speedup of 2.00 over the serial code, and with 4 workers (again on my laptop with 2 cores running HT), the speedup is 2.79. Here's the code:
[cpp]#include 
#include   
#include 
#include   
  
static int basecasesize=2;

static int fib1 (int n) __attribute__((const));
static int fib1 (int n) {
    if (n<2) return n;
    else {
#pragma warning (disable:981)
	return fib1(n-1)+fib1(n-2);
#pragma warning (default:981)
    }
}

static int fib (int n) {   
    if (n

And here's the wallpaper. Note that with icc, you can compile the cilk code with all the cilk features nullified by using -cilk-serialize.

$ icc fib.c -o fib-s -O2 -Wall -W -cilk-serialize
$ ./fib-s 40 2
Result: 102334155
Time : 0.6965
$ icc fib.c -o fib -O2 -Wall -W
$ CILK_NWORKERS=1 ./fib 40 2
Result: 102334155
Time : 6.6448
$ CILK_NWORKERS=1 ./fib 40 3
Result: 102334155
Time : 4.2223
$ CILK_NWORKERS=1 ./fib 40 4
Result: 102334155
Time : 2.8528
$ CILK_NWORKERS=1 ./fib 40 5
Result: 102334155
Time : 2.0746
$ CILK_NWORKERS=1 ./fib 40 6
Result: 102334155
Time : 1.4851
$ CILK_NWORKERS=1 ./fib 40 7
Result: 102334155
Time : 1.1948
$ CILK_NWORKERS=1 ./fib 40 8
Result: 102334155
Time : 0.9635
$ CILK_NWORKERS=1 ./fib 40 9
Result: 102334155
Time : 0.8447
$ CILK_NWORKERS=1 ./fib 40 10
Result: 102334155
Time : 0.7685
$ CILK_NWORKERS=1 ./fib 40 11
Result: 102334155
Time : 0.7251
$ CILK_NWORKERS=1 ./fib 40 12
Result: 102334155
Time : 0.6851
$ CILK_NWORKERS=1 ./fib 40 13
Result: 102334155
Time : 0.6585
$ CILK_NWORKERS=1 ./fib 40 14
Result: 102334155
Time : 0.6569
$ CILK_NWORKERS=2 ./fib 40 14
Result: 102334155
Time : 0.3479
$ CILK_NWORKERS=4 ./fib 40 14
Result: 102334155
Time : 0.2489



登陆并发表评论。