Are threads persistent in CilkPlus?

Are threads persistent in CilkPlus?

Suppose I have a cilk_for that runs its course, and then another one right after it. Do the worker threads die after the first and are pthread_created() for the second one again, or do pthreads get created at the beginning of the process and persist for its whole duration?

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In the current implementation, worker threads are not destroyed unless an explicit call to __cilkrts_end_cilk() is made.  Thus the same overall set of worker threads should be executing the consecutive cilk_for loops.  

Of course, because of dynamic load balancing,  even if the consecutive cilk_for loops happen to have the same number of iterations, there is no guarantee that the same worker thread will execute the same loop iterations.



This is a generic point, and applies equally to OpenMP and similar interfaces.  Good programs will work in either case, and this is at most an implementation-dependent performance issue.  But I did report it as a significant performance issue to one (Fortran, OpenMP) compiler, and I assume that the current CilkPlus implementation does that for performance reasons.

It's extremely unclear whether a design like CilkPlus can or should specify and/or allow the control of resource affinity - I have never seen it done well (it's horrible in OpenMP), but it is a fairly major tuning tool.  The same applies to process placement in MPI.  Whatever, please don't rush in!

OpenMP and Cilk(tm) Plus differ significantly in the approach to affinity.

In OpenMP, it's partly covered by the standard and partly by accepted practice.   Most OpenMP implementations have an adjustable time-out on thread persistence (KMP_BLOCKTIME for Intel).  A side effect of this is that Cilk(tm) Plus can't use a hardware thread which was recently in use under OpenMP until this time interval expires.

OpenMP 4.0 offers additional facilities for affinity; previously existing facilites such as Intel KMP_AFFINITY or gnu GOMP_CPU_AFFINITY deal with only a single level of threaded parallelism.

Cilk(tm) Plus appears to emphasize quicker re-assignment of hardware resources ("composibility") at the expense of cache data locality.  Where the usual usage of OpenMP and MPI is unfriendly to sharing CPUs among multiple tasks, Cilk(tm) Plus favors this sharing over efficiency of dedication to a single task.

The other consideration is whether you're on a timeshared system or dedicated hardware.  If you've specified your affinity to pin a thread to all of the cores on the system and some other task is running on one of those cores, then you're going to have to wait for that core.

Even if we supported thread affinity, a well-written Cilk application should have enough "parallel slackness" to allow other workers to steal a portion of the work.  As a general rule-of-thumb, your work should be broken into about 10P chunks.  cilk_for does this for you automatically.  This also allows your application to deal well with a varied workload where the chunks aren't all of uniform size. 

    - Barry

I agree that OpenMP and CilkPlus differ, though I could debate its actual state in OpenMP (let's not!), but my general point was that relying on thread affinity is not a good idea as it introduces a lot of extremely nasty factors (like the one mentioned above and the problem that different compilers, versions or systems may do different things for the same code).  It's also extremely hard to get right, either in a language specification or an actual implementation, and I have bitter experiences of ones that got it almost right but not quite!

Barry >> If you've specified your affinity to pin a thread to all of the cores on the system and some other task is running on one of those cores, then you're going to have to wait for that core.

 The above is the same as no affinity, i.e. the "pinned" thread is free to migrate amongst all the cores (hw threads) it is pinned to. In the above case, all cores. One can pin a software thread to a single hardware thread (in which case your wait example is valid), or if you wish, one can pin a software thread to any of the hardware threads within a core (this may be 2 or 4 threads depending on CPU), or one could pin a software thread to all the the threads sharing L3 cache, or.... And in all the preceeding cases, a single interloper thread from another process would not block progress .AND. not interfere with cache locality (other than competition for cache resource).

Jim Dempsey

Oops.  You're correct. I meant to say, "If you've pinned a thread to *each* of the cores".

Leave a Comment

Please sign in to add a comment. Not a member? Join today