number of cilk_spawn

number of cilk_spawn

I have a question on the number of cilk_spawn keywords in a code block. Originally I used only one cilk_spawn as in code 1, but in myspecificexperiment, code 2 appears to be faster than code 1.

// code 1
cilk_spawn f1();

// code 2
cilk_spawn f1();
cilk_spawn f2();
cilk_spawn f3();

Is code 2 even the right way to use cilk_spawn? Could anyone please comment on the run time difference between one and multille usage of cilk_spawn?

Thank you very much in advance.



3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It all depends upon how much work your functions do. Consider what you've specified in your two code snippets:

In the first snippet, you're saying that f1() can be run in parallel with f2() and f3(), and then the two parallel strands must sync.

In the second snippet, you're saying that f1(), f2() and f3() can all be run in parallel, allowing idle workers to steal each of them and run them in parallel. Assuming that all three functions do an appreciable amount of work, then that's not an unreasonable way to do things. However in general, I'd skip the final cilk_spawn keyword. That is saying that there is a strand between the call to f3() and the cilk_sync that can be stolen. Since it's empty it's a waste of time. Empty strands like this should be optimized out by the compiler, but I'm not sure if it is in the current versions.

So I might write this sequence as:

    cilk_spawn f1();
    cilk_spawn f2();

As always, you need to maintain a balance. cilk_spawn is reasonably cheap so the temptation is to use lots of them. But even small amounts of additional overhead add up if you do it often enough. If f2() and f3() are small, then putting f2() into it's own strand may not gain you much.

In general, you want to have enough parallelism to keep the workers busy. Cilkview will give you a value for the parallelism of your application. This is based on two values:

  1. "work" - How long it would take to run your application if it was run on a single processor.
  2. "span" -How long it would take to run your application if it was run on an infinite number of processors. This assumes that all code that could possibly run in parallel does. Essentially this is the critical path through your application.

Cilkview measures these by counting instructions. Which is imprecise, but good enough. If the span is a large fraction of the work, then there's only so much that running on multiple processors can do to speed up your program. If the span is small, then you have a lot of scope to run things in parallel, so you can (potentially) get a lot of speedup.

A rough rule to follow is that you want the potential parallelism to be approximately 10x the number of workers you've got so if the tasks are unbalanced and there's a long task, the idle workers can steal enough to keep themselves busy.

Sometimes the results are not intuitive. Qsort is a classic divide-and-conquer algorithm which we expected to parallelize well. When we first ran the qsort examples through Cilkscreen, we got a lower than expected parallelism value. We were confused until we examined the algorithm closely and realized that the qsort function partitions the data, and then recursively calls itself for the two halves of the data, running the two halves in parallel. The partition is a sequential operation and a major portion of the span. So despite having a lot of cilk_spawns, the partition of the data prevents us from getting a good parallelism score.

Of course there are other factors that may prevent your application from speeding up when you parallelize it, like memory bandwidth or locks. But those are topics for another post.

- Barry

Thanks Barry as always for the kind help and detailed explanations. It makes a lot of sense now. I will try removing the last cilk_spawn and cilkview as well. Have a good day!

Leave a Comment

Please sign in to add a comment. Not a member? Join today