Is there some "intelligence" built in the CilkPlus runtime?

Is there some "intelligence" built in the CilkPlus runtime?

Dear all,

We are running some experiments with Cilk and one thing that strikes us as weird is that our code has some "warm up needed" effect. When we run the code for the first time, it times at, say 8secs, while further repetitions of the function call. If we run the code (even with different arguments), it goes down to say 0.5, and consistently stays there. It may be cache, it may be prefetching, etc., but I'd like to first rule out CilkPlus as the culprit. For that, I would like to know: during task creation and scheduling, does the Cilk runtime "realize" at any point that he can optimize some tasks for any reason? Is there any other under-the-hood optimizations that may be causing this behavior?

Thanks very much for your time.

4 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I can think of a couple of things that might be affecting you, but they shouldn't be causing a 16x performance difference:

  • There is initialization of the runtime which you'll pay for the first time you spawn a function or use a cilk_for.  You can initialize the runtime manually using __cilkrts_init() if you choose.
  • There's cache effects.  When a continuation is stolen, it starts running with a cold cache.
  • There's the marshalling of the execution of the "top" function onto the original user worker thread.

There is no intelligence in the runtime that attempts to optimize which task is scheduled to a given thread.  Stealing is purely random.  When we discussed it with Professor Leiserson, he said that experiments with MIT Cilk had shown that there was little to no benefit to it.  Attempting to do "smart" scheduling might help in some edge cases, but it hurt in others. And it just complicated the code.  Instead the onus is on the programmer to make smart choices about the decomposition of his application so that "large" chunks of work are stolen and stealing is infrequent.  Thus cilk_for is implemented as a divide-and-conquer algorithm which breaks the work into roughly halves at each step instead of a for loop spawning each task.

    - Barry

Parallel programming with either Cilk(tm) Plus or OpenMP on MIC has been notorious for a few extra seconds taken the first time (to allocate implicit data structures?)   I guess the amount of time would depend on complexity.  Anyway, it might be interesting to know something about platform and application.

Barry's description of divide and conquer appears to me to resemble OpenMP schedule(guided).  I'm sure he won't appreciate the analogy. schedule(guided) can be advantageous on a dual CPU host but disadvantageous on MIC. 

We are trying to get clarified the cases where the exact form of trademarking is required, but this post of mine was one, so I hope I got it fixed.

An additional factor is, depending on operating system, virtual memory mapping to physical RAM or to Page File is deferred until first access (as opposed to at time of malloc or thread stack creation). First access will then perform the necessary mapping (not necessarily that large of overhead), but may also include a wipe with a potential for large numbers of page faults.

Jim Dempsey

www.quickthreadprogramming.com

发表评论

登录添加评论。还不是成员?立即加入