Implementation of the parallel runtime systems

Implementation of the parallel runtime systems

Hi all,

My question is about the implementation details of different parallel runtime systems, e.x. OpenMP, Cilk Plus, and TBB runtime systems.

Where can I find a detailed information? For example:

* How do they do runtime task scheduling?

* What are the differences between GCC implementation and ICC implementation of the OpenMP runtime system?

* How is work stealing done in these approaches?

* Do OpenMP and Cilk Plus use busy waiting?

Is there any source to find answers to such questions?

 

Many Thanks in advance!

 

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Source code is available in the gnu project, and at https://www.openmprtl.org/

"work stealing" is usually taken as Cilk(tm) Plus terminology, although openmp schedule(dynamic) may be suitable for similar purpose.

OMP_WAIT_POLICY is a standard way to control OpenMP spin waiting. The implementations you mention have their own additional environment options (KMP_BLOCKTIME, GOMP_SPINCOUNT).  So it's normal to default to a busy wait for some interval after which threads will suspend.

Use of OpenMP and Cilk(tm) Plus together incurs the issues of possible over-subscription as well as the one about the OpenMP spin wait delaying an OpenMP thread context from yielding to Cilk.

all runtimes that you mentioned are open sourced.

So you can look at sources and check their implementations

--Vladimir

Thanks!

@Tim: Thanks Tim! Helpful information!

@Vladimir: Yes, but first, I want to make sure that the open source implementation and the commercial one I use behave fairly similarly.

Second, I am wondering if there are detailed documentations that discuss the key points. 

 

Best Reply

Yes, like Tim/Vladimir mentioned you'll need to look at the sources and the docs associated with those methodologies accordingly. You can find various articles in the Intel Developer Zone (https://software.intel.com/en-us) on Cilk Plus, OpenMP etc and in the compiler user manuals as well. In addition http://www.cilkplus.org does offer interesting details too including FAQ.  Refer to the article at https://software.intel.com/en-us/articles/getting-code-ready-for-paralle... which could be helpful in general (especially listing the benefits vs caveats although it needs be to updated to include OMP 4.0).  

With that said, here's some info of interest:

  • There's no queue of tasks in Cilk (unlike openmp) – so the models are different per-se and does affect on how load balancing is taken care of.  
  • With OpenMP, the performance really depends on how much the user can fine tune (needs to understand the underlying system) to get the code execute efficiently.  Not so in Cilk, which supports a very high abstraction (separating spec from scheduler) and uses the randomized work stealing scheduler which does dynamic load balancing without oversubscribing threads on the system.
  • With Cilk, the implementation per-se is also different in that, a cilk_spawn not only has less overhead than creating a thread(OS) and only needs to push the task onto a data structure without having to make a call into the OS and get the scheduler involved (unlike OpenMP). Again, the models are different. 
  • if you are just interested “in the simple case of parallelizing a for loop where iterations are independent”, I don’t think there will be much performance difference between Cilk Plus and OpenMP, although as you may need to use dynamic scheduling with OpenMP to address thread imbalance, whereas Cilk Plus will take care of that automatically.  In this simple case, no code restructuring is needed, but for either paradigm, you have to figure out where to insert the right pragmas for OMP, or Cilk_for/Cilk_spawn for Cilk.  I’d say the development effort is about the same for this simple case.  For simple loops, OpenMP performs better for small, very-balanced loops.  Also, for non-loop parallelism, Cilk Plus is far easier to work with.
  • Note that Cilk runtime uses work stealing scheduling unlike the OpenMP runtime. Generally work stealing helps in scenarios where there is an imbalance in the distribution of workload across all threads in action
  • In addition, even though OMP supports task parallelism via #pragma omp task and #pragma omp sections, it’s harder to envision the task parallel architecture versus the abstraction offered by Cilk.  And now with OMP 4.0, you can exploit instruction level data parallelism via #pragma omp simd which was copied almost verbatim from Cilk Plus’s #pragma simd.  They are almost entirely equivalent at this point.  However, Cilk Plus also has array notation, which supports vectorization with a more concise syntax that some people really like.
  • OpenMP has better support for cache affinity.  This comes into play when a parallel loop is executed multiple times, each time working on effectively the same memory locations.  Using static scheduling, OpenMP will typically execute the same iterations on the same cores, giving much better performance by reusing the cache.  
  • Cilk Plus composes better than does OpenMP, so it is easier to write a parallel function without worrying that you will conflict with parallelism elsewhere in the program, thus oversubscribing the machine.
  • Cilk is easier to program in, but OpenMP has more knobs for fine-tuning. It comes down to how much tuning a developer is willing to do, and you can get more bang for the buck with Cilk, at least initially.
  • It also comes down to how static your hardware and software environment is.  An OpenMP program can be fine-tuned for a specific machine with a specific workload, but that same binary image might not perform well on a larger or smaller machine or on a machine that has other programs running on it.  A Cilk program, by contrast, will automatically scale both up and down, with no further tuning or recompilation needed.

Hope the above helps,

_Kittur

>>However, Cilk Plus also has array notation, which supports vectorization with a more concise syntax that some people really like.

Array notation can be used without Cilk Plus directly in C++ including in OpenMP (assuming you apply the appropriate compiler options)

Jim Dempsey

www.quickthreadprogramming.com

Thansk @Jim! @Kittur Ganesh: Thanks very much for the answer! There are some considerations:

1- "There's no queue of tasks in Cilk (unlike openmp)": As far as I know, tasks in the Cilk Plus are functions, and each worker's stack operates like a deque. Besides that, what are the main differences between Cilk Plus, OpenMP, and TBB in terms of task scheduling. For example, I know that TBB uses similar mechanism as Cilk Plus for stealing. They both steal from the top of the deques (or stacks). 

2- I have read this:  "A task in OpenMP has an assigned thread that executes the code and the data" When are tasks assigned to the threads? Are they places in a single queue or multiple queues? How does load-balancing/task-stealing is performed?

Many Thanks!

Quote:

Kittur Ganesh (Intel) wrote:

  • With OpenMP, the performance really depends on how much the user can fine tune (needs to understand the underlying system) to get the code execute efficiently.  Not so in Cilk, which supports a very high abstraction (separating spec from scheduler) and uses the randomized work stealing scheduler which does dynamic load balancing without oversubscribing threads on the system.
  • With Cilk, the implementation per-se is also different in that, a cilk_spawn not only has less overhead than creating a thread(OS) and only needs to push the task onto a data structure without having to make a call into the OS and get the scheduler involved (unlike OpenMP). Again, the models are different. 
  • if you are just interested “in the simple case of parallelizing a for loop where iterations are independent”, I don’t think there will be much performance difference between Cilk Plus and OpenMP, although as you may need to use dynamic scheduling with OpenMP to address thread imbalance, whereas Cilk Plus will take care of that automatically.  In this simple case, no code restructuring is needed, but for either paradigm, you have to figure out where to insert the right pragmas for OMP, or Cilk_for/Cilk_spawn for Cilk.  I’d say the development effort is about the same for this simple case.  For simple loops, OpenMP performs better for small, very-balanced loops.  Also, for non-loop parallelism, Cilk Plus is far easier to work with.
  • Note that Cilk runtime uses work stealing scheduling unlike the OpenMP runtime. Generally work stealing helps in scenarios where there is an imbalance in the distribution of workload across all threads in action
  • OpenMP has better support for cache affinity.  This comes into play when a parallel loop is executed multiple times, each time working on effectively the same memory locations.  Using static scheduling, OpenMP will typically execute the same iterations on the same cores, giving much better performance by reusing the cache.  
  • Cilk Plus composes better than does OpenMP, so it is easier to write a parallel function without worrying that you will conflict with parallelism elsewhere in the program, thus oversubscribing the machine.
  • It also comes down to how static your hardware and software environment is.  An OpenMP program can be fine-tuned for a specific machine with a specific workload, but that same binary image might not perform well on a larger or smaller machine or on a machine that has other programs running on it.  A Cilk program, by contrast, will automatically scale both up and down, with no further tuning or recompilation needed.

Hope the above helps,

_Kittur

I agree entirely with the Best Reply which Kittur earned on this post.   It's the first time I've seen the limitations of cilk_for addressed.

I disagree in part with the first comment which I quoted above.  The Cilk work-stealing model seems to be predicated on the unified cache model of Intel single CPU hosts, and doesn't scale as well to NUMA platforms, in those frequent cases where it's feasible to achieve memory locality with OpenMP.  Kittur mentions the related cache locality issue in a subsequent point.

On the next point, OpenMP uses schemes such as Intel's KMP_BLOCKTIME to overcome the overhead of re-building thread teams between parallel regions.  Among other things, this delays availability of cores recently used by OpenMP when starting up Cilk workers, particularly on Windows, where it may be necessary to increase rather than decrease the default setting.

Many of us have noticed the comparison Kittur makes between OpenMP dynamic scheduling, where Cilk(tm) Plus is more often than not competitive, and the simple cases of static scheduling where OpenMP performs typically twice as well as Cilk(tm) Plus.

To me, the "composability" jargon hides several interesting issues.  In order to get the advantage of immunity from serious performance degradation due to over-subscription, it's frequently necessary to disable HyperThreading (or, on linux, take advantage of taskset).   Intel(r) Xeon Phi(tm) doesn't offer such facilities to help out Cilk(tm) Plus.  In my examples, there are some where cilk_for can't be combined effectively in any straightforward way with vectorization, as OpenMP 4 can do, so the advantage of immunity to over-subscription is at the cost of giving up on parallelism.  The degradation of OpenMP performance with over-subscription is largely brought about by conflict with cache locality and the need to shut off pinning options which Cilk(tm) Plus doesn't have.

OpenMP (like threading APIs) does offer opportunities to build in dependencies on specific numbers of cores and BIOS settings, but I don't see this done often.  For example, the option to set a specific KMP_AFFINITY which can't be over-ridden at run time is seldom used.

 

When Kittur says #pragma omp simd was copied from Intel #pragma simd, it makes me wonder why less of the Intel legacy simd directives were implemented in Fortran omp simd (such as ability to suppress temporary arrays, which omp simd accomplishes in gnu compilers, but Intel touts as a Cilk feature).

My tests using the OpenMP and Cilk(tm) Plus models are posted at https://github.com/tprince/lcd

Thanks Tim for further elaboration and I'll pass on your comments to the Cilk Plus team.  I'll update if I get any more info accordingly.

Again, I'd like to emphasize that it's good to note that Cilk’s work-stealing scheduler automatically load-balances if there is sufficient parallelism. The divide and conquer approach results in less steals and less overhead (ex: cilk_for). Also, most spawns do not result in steals. And overhead can be reduced by letting a spawn do more work by setting proper grain-size for loops. The default grain-size should yield good performance in most cases....

_Kittur

Leave a Comment

Please sign in to add a comment. Not a member? Join today