cilk_spawn not using idle threads

cilk_spawn not using idle threads

I have a large, dynamically-resizable hash table whose access is controlled with a tbb::spin_rw_mutex. Multiple Cilk threads request access to this table, and occasionally one thread's desire to add a new item triggers a resize operation. The thread requesting the resize operation already holds a writer lock, thus the other threads will quickly be blocked from proceding with their read or write requests, and become idle.

The resize operation, which needs to recompute the hash functions for all entries, is easily parallelizable. I expected the runtime system to recognize that the other threads are idle and use them to help with the resize operation. This is not what happens. Using __cilkrts_get_worker_number() I can show that no matter how many times my resize operation spawns child threads to help, all child function calls are being performed by the same thread ID that called the resize operation.

Is Cilk not smart enough to recognize that it can use the idle threads (I've seen that they are idle in VTune Amplifier), or is there some other reason that I cannot resize my hash table in parallel?

11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm not entirely sure what you're expecting to happen here. A thread that is waiting on a synchronization object is stuck waiting for that synchronization object until access is granted. TBB spin locks are no exception.

In the past we've discussed creating anew type oflock which will allow it to recruit any other workers that are waiting for access to that lock, but haven't gone any further than discussing the idea. We'd need a way to restrict the work available for stealing to the worker that owns the lock.

But even if this new type of lock were available, workers waiting on TBB spin locks or OS synchronization primitives would not be able to participate.

- Barry

Can your applicationuse a concurrent hash map from TBB? From the standpoint of eliminating the bottleneck, you will likely get better performance using a hash table implementation which is designed for concurrent accesses.

You have run into a limitation of the current runtime system. There are several obstacles that must be dealt with before Cilk can effectively exploit parallelism inside a critical section protected by a lock, which is what your code is attempting to do.

First, the runtime needs to be informed that a worker thread that is blocked at a lock should be considered idle.A typical implementation of a lock might have a thread yield to the OS if the thread if it fails to acquire the lock quickly enough. The runtime would need to direct the worker thread blocked at the lock to begin stealing other work. But currently, the Cilk runtime does not do anything special in this case.

Moreover, even if the runtime did have worker threads steal when they block at a lock, it is tricky to choose the correct work to steal. If you steal another task which is trying to read or write to the table again, then the worker might block again. If the worker thread ends up repeatedly blocking at the same lock, it might use up a lot of space, since each task that blocks on the lock must save some state somewhere to be resumed later. In this case, we really want the worker thread to steal work inside the resize operation.
Unfortunately, it turns out that stealing the work inside the critical section seems to require some nontrivial extensions to the runtime, because the Cilk runtime scheduler is, for performance reasons, optimized to allow only stealing of certain tasks, not others.

At the risk of shameless self-promotion, I will say that I know of at least one way to eliminate this limitation inCilk. I'llrefer you to the following recent research paper on this topic if you are interested inmore technical details. :)

This kind of solution is still at the stage of a research prototype, however...


I come from the world of OpenMP, so maybe I misunderstand how Cilk allocates threads. I had hoped that Cilk would create new threads inside of the resize subroutine and pass them work. Then, the OS would be able to context-switch out of the idle blocked threads and to these new, active threads. I do not expect a blocked thread to drop what it is doing, but for new threads to begin performing the resize operation (on the now-idle CPU cores).

I was disappointed to see that I cannot dynamically increase the number of threads in Cilk after the first cilk_spawn. Maybe I can ask for, say, 8 threads on my 4-core machine at the start of execution, only allow 4 threads to operate normally, but begin using the other 4 threads when the resize operation is called. The limited control that I have over the number of threads seems to make that impossible.

The concurrent hash map from TBB does not support multiple items with the same key, which my implementation seems to require.

The Intel Cilk Plus runtime uses a work-stealing scheduler to dynamically load-balance the tasks that are created by a Cilk Plus program.

At a high level, the runtime scheduler executes a Cilk Plus program by using worker threads. More specifically, by default, the runtime scheduler queries the OS to determine how many cores are present on a system, and creates a "worker thread" for each core, minus 1 for the user thread that starts the runtime. You canoverride the value returned by the OSusing the CILK_NWORKERS environment variable or an API call.Each worker maintains a deque (short for double-ended queue) for storing tasks that have yet to execute.

Intuitively, in the runtime, each worker thread maintains its deque using the following simple algorithm:

1. When a worker thread executes a cilk_spawn or a cilk_for statement, it may push new tasks onto the tail (bottom) of its deque.

2. When a worker thread reaches a cilk_sync that needs to wait for a spawned function to complete, it tries to keep busy by popping a task from the tail of its deque.

3. If the worker thread discovers that its deque is empty, it tries to steal work from the head (top) of the deque of another worker, where the worker is chosen at random.

Let's assume for the moment that there's a way to tell the Cilk runtime to create extra worker threads and start them in a suspended state. The Cilk Runtime has no visibility into synchronization primitives, so when your application waits on a synchronization object (or I/O, or any blocking call, and there are potentially lots of them) that worker is blocked. You'd have to add calls to your application before and after each blocking call to add and subtract from the number of available workers. Don't miss one, or you'll start oversubscribing your system. And don't forget that decrementing the number of workers won't take effect immediately. It will only suspend a worker that it looking for work to steal.

Now assume that we provided the ability to modify the number of available workers. You'd need a mechanism to restrict the work those additional workers could steal from to the work you're trying to advance. How would you specify that?

A basic design philosophy of Cilk is that it should be easy to use. Cilk developers should not have to worry about which task is going to be stolen. You should be able to expose the parallelism in your algorithm and let the Cilk runtime worry about scheduling the tasks.

- Barry

So am I not exposing the parallelism correctly in this resize function, or is there just no way to alleviate this serialization with Cilk?

The short answer is "no,thereis no easy way to eliminate this serialization in Cilk at the moment."
As Barry was explaining, the current implementation and design of the runtime makes it difficult to efficiently support parallelism inside a critical section at this time. The code you are writing seems (to me) to be perfectly reasonable, but there is no (simple) way that I can think of for the existing runtime to exploit the parallelism inside the resize function without us making some runtime modifications.

In general, conventional thinking in parallel programming has been "make your critical sections small to avoid contention", and thus, to my knowledge, we have not received requests for this particular feature before. If there is demand for it, then it might be worth looking into for the future.

We may be able to work with you to come up with an alternative solution.
Can you solve your problem with a level of indirection? Perhaps you can wrap a TBB hash table, so that each element in the hash table is not a (key, value) pair, but a pair which is (key, pointer to a linked list of values)?

Thank you. I am considering your last suggestion---to use tbb::concurrent_hash_map with values that represent a pointer to the head of a list.

Alternatively, I could take the resize operation out of the locked region and wait for all threads to synchronize just before calling resize. The lost effort may be made up for by the much-faster resize.

Oh, another idea came up in some of our offline discussions which might work.

You might also be able to start up a parallel region for the resize, parallelized using a different threading platform (e.g., TBB or OpenMP). This resize would then create new threads (which are not managed by the Cilk runtime), and thereby getting around the problem. [You'd also have to parallelize the resizeusing the other platform.] As long as the Cilk threads are yielding their scheduling quanta when they are blocked on the lock, the OS should in principle be doing the right thing in terms of scheduling. The downside is that you now have created extra threads for your resize operation, which might be slower than if you were to restructure the program as you mentioned above.

It does not seem like a particularly clean solution though... I've never personally tried coding this approach myself, so I'm not sure what kind of performance you would get? But I wanted to throw it out there, because it might be an option. The Cilk and TBB runtimes are intended to be able to coexist and cooperate in the same program.



To expand a bit on the cooperation between Cilk and TBB, the cooperation is intended to limit the oversubscription to 2P by default, where P is the number of cores; Cilk will create P-1 threads, and TBB will create it's own team of P (P-1?) threads. Of course, if you're only using the alternate threading package for your resize operation, that's not a problem with any threading package.

A couple of caveats, off the top of my head:

  • You'll potentially be oversubscribed to the extent that your other threads aren't waiting on access to the collection.
  • You'll need to determine whether the overhead of starting up another set of threads is greater than the performance boost you get for resizing in parallel. Perhaps you'll only want to do the resize in parallel if the collection is "big enough."
  • You may run into memory bandwidth and false sharing issues.

But that's what experiments are for. If you try this, please let us know how it turns out. :o)

- Barry

Leave a Comment

Please sign in to add a comment. Not a member? Join today