Is the task stealer NUMA aware?

Is the task stealer NUMA aware?

Is TBB's task stealing mechanism NUMA aware? That is, for example, assume that there are four sockets(NUMA nodes) in the system with a four core chip on each of them, and each socket has its own low latency local memory. When a task queue of a particular core runs out of tasks, will it first try to steal from the cores on the same socket and then from remote ones ? Are there any other NUMA related performance issues?

37 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - Gera Prasun Dineshkumar (Intel)
Is TBB's task stealing mechanism NUMA aware? That is, for example, assume that there are four sockets(NUMA nodes) in the system with a four core chip on each of them, and each socket has its own low latency local memory. When a task queue of a particular core runs out of tasks, will it first try to steal from the cores on the same socket and then from remote ones ? Are there any other NUMA related performance issues?

No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy Vyukov

No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.

Hi, thanks for the tip about the Quick Thread Library. I'm reading the white paper and the manual, and will get back to you. However, I still find it unsettling thatTBB doesnt take NUMA architectures into account. I suppose that trying to steal from the same NUMA node first will lead to a reduction in the set of victim threads and might lead to a penalty incurred in the form of increased unsuccessful steals, but that is largely problem dependent in that the occupancy of queues and the skew between different nodes at any point of time vary from problem to problem, and the choice can be left to the user. Also, what kind of theoretical bounds does random stealing provide?

Quoting - Gera Prasun Dineshkumar (Intel)
Hi, thanks for the tip about the Quick Thread Library. I'm reading the white paper and the manual, and will get back to you. However, I still find it unsettling thatTBB doesnt take NUMA architectures into account. I suppose that trying to steal from the same NUMA node first will lead to a reduction in the set of victim threads and might lead to a penalty incurred in the form of increased unsuccessful steals, but that is largely problem dependent in that the occupancy of queues and the skew between different nodes at any point of time vary from problem to problem, and the choice can be left to the user. Also, what kind of theoretical bounds does random stealing provide?

Regarding QuickThread, you may address directly to Jim Dempsey, you can find his email on http://www.quickthreadprogramming.com

Regarding NUMA and random stealing, I think I will not be able to retell all that stuff, so please refer to "Space-Efficient Scheduling of Multithreaded Computations":
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.6193
and other Cilk papers:
http://supertech.csail.mit.edu/papers.html

I agree with you that NUMA awareness is indeed of help for efficient scheduling, and not only NUMA awareness but also HyperThreading awareness, shared cache awareness, etc. Actually I incline towards the opinion that the only good thing in randomized stealing is that it allows one to make statements like "Each thread is located independently at random, and hence, the random variable Zi has a binomial distribution with P lg P trials and success probability 1/P". IMVHO more practical scheduler will tend to be locality/NUMA/cache/HT aware, and apply some supervision on top of that to prevent possibility of very bad executions, so to say. I.e. randomized scheduler penalizes all cases in the name of theoretical properties. And practical scheduler must preserve locality in common case, and preserve theoretical properties by some episodic low-overhead activity.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Dmitriy,

it is unfair to say that a randomized scheduler penalizes all cases in the name of theoretical properties. It turns out that when you actually try policies like the one suggested by the original poster, they just don't work. For your information, we (the Cilk research group at MIT) tried locality-aware work stealing in 1997 and it did not make any difference. We tried on what you might consider the mother of all NUMA machines---a cluster of SMPs with shared memory over a network (I forgot which one---could even have been ethernet). Keith Randall has some discussion about this topic in his Ph.D. thesis. So the lack of NUMA-awareness in Cilk and TBB is not due to theoretical considerations, but to the fact that there seems to be no practical way to produce a scheduler that will take advantage of NUMA.

This problem is much more complicated than it appears at first sight. For example, every steal causes a disruption of the memory access pattern, in the worst case requiring a transfer of the whole cache of the ``victim'' processor to the ``thief'' processor. The steal operation itself costs nothing (a few hundred cycles), but the cache disruption costs you tens of thousands of cycles. This is a real phenomenon that you can observe in common algorithms, such as LU decomposition or stencils. From this perspective, you want to minimize the number of steals. Randomized work stealing does that; any attempt to bias the stealing activity will in general increase the number of steals, and thus of cache misses.

To make things even more complicated, there is the issue of what to steal. The Cilk/TBB policy is a pretty good one for a parallel machine with private (i.e., non-shared) caches, because in practice it tends to steal big chunks of work that are independent of each other. (This is not a theoretically necessary property, but in practice this is a common case.) However, the Cilk policy is bad if caches are shared between the thief and the victim, because it causes two big chunks of work to reside in the same cache, which tends to overflow the cache. Nobody knows what a good stealing policy would be on machines with both shared and private caches, i.e., any mainstream multicore processor sold today :-( There are some partial results on this topic: read my paper ``The cache complexity of multithreaded cache oblivious algorithms'' for a partial understanding of the private-cache case, and the referenced work by Blelloch and Gibbons for the case of shared caches. (Hyperthreading falls into the shared-cache category, sharing everything including L1.)

More complicated still, keep in mind that there is no such thing as ``NUMA'' or ``affinity'' once you take into account that your algorithm may be called in parallel multiple times---that is, any practical software must be usable as a subroutine in a larger context. E.g., you may try to place data and schedule things carefully so that the computation happens where the data is, but your careful work goes out of the window once the user of your program spawns something else in parallel with it. You must be careful in not evaluating an algorithm in isolation, without reference to the larger context.

In conclusion, the issue is complicated, but you can be sure that existing designs such as Cilk and TBB are more motivated by what works in practice than by any kind of theoretical consideration. It just happens that the Cilk theory is one of the few theories that work :-)

Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.

(Added) If it has been addressed in a reference, juststating thatis OK for me.

Quoting - Raf Schietekat

Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.

(Added) If it has been addressed in a reference, juststating thatis OK for me.

Raf,

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself). The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

I suspect that there is a thermodynamic analogy at play here, although I have no proof. If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid. Over time, the ink diffuses into a maximum-entropy distribution that is indistinguishable from the distribution that you would get by allowing Cilk-style stealing of ``work molecules'' (this would happen with ink molecules of infinite velocity). So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster.

"academia being what it is, you cannot publish things that don't work"
Ah?

"whatever work is in a queue close to you is not necessarily using data that is close to you"
Hmm...

Quoting - Matteo Frigo (Intel)
it is unfair to say that a randomized scheduler penalizes all cases in the name of theoretical properties. It turns out that when you actually try policies like the one suggested by the original poster, they just don't work. For your information, we (the Cilk research group at MIT) tried locality-aware work stealing in 1997 and it did not make any difference. We tried on what you might consider the mother of all NUMA machines---a cluster of SMPs with shared memory over a network (I forgot which one---could even have been ethernet). Keith Randall has some discussion about this topic in his Ph.D. thesis. So the lack of NUMA-awareness in Cilk and TBB is not due to theoretical considerations, but to the fact that there seems to be no practical way to produce a scheduler that will take advantage of NUMA.

Hi Matteo,

Well, I can't say that there is a practical way to produce a scheduler that will take advantage of NUMA, because I did not create one... ah, and not even tried to create one.

Quoting - Matteo Frigo (Intel)
This problem is much more complicated than it appears at first sight. For example, every steal causes a disruption of the memory access pattern, in the worst case requiring a transfer of the whole cache of the ``victim'' processor to the ``thief'' processor. The steal operation itself costs nothing (a few hundred cycles), but the cache disruption costs you tens of thousands of cycles. This is a real phenomenon that you can observe in common algorithms, such as LU decomposition or stencils. From this perspective, you want to minimize the number of steals. Randomized work stealing does that; any attempt to bias the stealing activity will in general increase the number of steals, and thus of cache misses.

My key point is - how can *randomized* scheduler ever ensure that/anything? It may do only not too bad, but it can't do too good either.

Quoting - Matteo Frigo (Intel)
To make things even more complicated, there is the issue of what to steal. The Cilk/TBB policy is a pretty good one for a parallel machine with private (i.e., non-shared) caches, because in practice it tends to steal big chunks of work that are independent of each other.

In what way? Random scheduler equally prefers small chunks. I would say that it actually tends to steal small chunks of work (I guess there are more of them), but also sometimes "with probability 1/P" steals big chunks too. If one wants to steal big chunks -> steal big chunks, do not steal at random. What I am missing here?

Quoting - Matteo Frigo (Intel)
More complicated still, keep in mind that there is no such thing as ``NUMA'' or ``affinity'' once you take into account that your algorithm may be called in parallel multiple times---that is, any practical software must be usable as a subroutine in a larger context. E.g., you may try to place data and schedule things carefully so that the computation happens where the data is, but your careful work goes out of the window once the user of your program spawns something else in parallel with it. You must be careful in not evaluating an algorithm in isolation, without reference to the larger context.

Indeed. If I have 2 NUMA nodes and need to sort 2 independent arrays, I would prefer to dedicate first NUMA node to first arary, and second NUMA node to second array, and not to blend everything.

Regarding practical ways. What do you think on following?
For Hyperthreading (somehow applies to shared caches too): always try to steal from HT-sibling in FIFO order first. HT-siblings do not have enough resources to work on separate tasks, so it's better for them to help each other with single task. Steals for HT-siblings are naturally costless, so more frequent steals won't harm.
Locality-aware stealing: steal in the following order: (1) prefer to steal from nearest victim with a lot of work, (2) then victim with a lot of work, (3) then nearest victim, (4) then everything else. I.e. stealing is based on distance/amount_of_work metric. Locality is good, stealing of big chunks is good, right? So why not prefer that?
Work-first vs Help-first. Try to balance between Work-first vs Help-first scheduling for the following code (your example):
while (something)
cilk_spawn foo();
I.e. try to steal even more work in specific situations.

Actually "smart" scheduler may introduce some randomness too to not behave too bad in unexpected situations. For example, it can do "non-random" scheduling on even steals and do "random" scheduling on odd steals; that way probability of something good/bad is just 1/2P instead of 1/P for pure random scheduler; but who cares, it's just as we have 2 times more processors.

p.s. I still remember about the following conversation:
http://software.intel.com/en-us/forums/showthread.php?t=69681
Just do not have much time right now... and it's quite difficult to reverse-engineer assembly (I've intercepted Cilk++ "preprocessor" intermediate output file, it helps somehow).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Matteo Frigo (Intel)

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

It's not surprising that it did not make much difference if he tested on Fibbonacci program :)

Quoting - Matteo Frigo (Intel)

The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself).

Well, there are 1000 and 1 way how user may compromise performance, runtime scheduler will never win in this game.
However runtime scheduler may provide good performance for good programs. Doesn't you require busy-leaves, etc? For matrix-multiplication, merge-sort and all that conquer-and-divide stuff close queue means close data. I think it's worth providing better performance for at least those programs.

Quoting - Matteo Frigo (Intel)

The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

After remote steal stolen remote data becomes local data. So every worker close to W will do right thing rather than wrong. What else it can do? It can't steal anything local (there is no local since W did remote steal). It can only steal even more remote work and cause even more misses, is it right thing?
Since W already stolen remote piece of work X, W will have to fetch whole X sooner or later. So if workers close to W will help in that they do not increase number of misses in any way (they do increase if they will do something else).

Quoting - Matteo Frigo (Intel)

I suspect that there is a thermodynamic analogy at play here, although I have no proof. If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid. Over time, the ink diffuses into a maximum-entropy distribution that is indistinguishable from the distribution that you would get by allowing Cilk-style stealing of ``work molecules'' (this would happen with ink molecules of infinite velocity). So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster.

Well, ok, if we have following distribution of work:

02111111111120

What is better:
1. Fill first '0' by splitting second '2'; and fill last '0' by splitting last but one '2'.
2. Do vice versa
?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Matteo Frigo (Intel)

Raf,

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

Hi,
I haven't read the thesis in detail, but yes i read the last part where he talks about taking the ratios of latencies of local access and remote access into account for biasing the stealing, which is exactly what i meant. So when you say that the actual stealing takes only few cycles, if the latency is included in the stealing cost, doesn't the disparity between local stealing and remote stealing manifest quite strongly in the case of NUMA?

There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself).

If we consider comparing the probability ofsuccessfullystealing related('data that is close to you') work within a node to the probability of successfully stealing related work from any core at random, this was the crude oversimplified calculation that i could come up with:

Assumptions:
p cores per node
n such nodes
pn cores in all
For a given thief at a a point of time, pn-1 victims.
m out of them have related data. (Assume m>p for this case)

For a random steal, probability of success = m/pn-1

For a local steal, probability of success = (1/(p-1))*(Prob that one core in the node has related data) +(2/p-1)*(Prob that 2 cores in the node have rel. datat ) + ... + 1( Prob that p cores ...t)

and Prob that k cores out of p have related data = pCk (m/pn-1)^k * (1- m/pn-1)^p-k

So we can calculate the probability that a steal gets related data for both, the randomized and the biased case.
Similarly, we can calculate the probability of a steal beingsuccessfullyfor both the cases.

So the total time for a steal = (t successful steal)(Prob success) + (t unsuccessful) (prob failure)
where t success =t related(prob related) + t unrelated(prob unrelated)

where we capture all sorts of benefits and penalties in the t terms.

So my question is that wouldn't there be cases where for certain values of the variables in the above eq, t biased would outperform t random. As i was trying to say earlier, isn't it significantly dependent on the problem and/or hardware/topology?

The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

Can't the sameargument about probabilities be applied to this too? I mean if we look at the probabilities, I feel its again about trying out different things, and for some of them one would work, for some of them the other would.

Perhaps a bit more articulate this time:

"If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid."
It's not really ink so much as grains. And if you process grains faster in a localised setting (hyperthreaded core, cores sharing a cache) perhaps they locally disappear more quickly, counteracting the spread.

"So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster."
But how do you know that that is not a negative thing...

I really should read some of that study material, though. Meanwhile, perhaps somebody who has actually tried to implement and exploit a NUMA-aware multithreading toolkit might decide to chime in?

Gera,

A distinction has to be made herebetween a numerical probability derived from the assumption of random selection within a subset and the deterministic selection within a scheduler. Depending on the circumstances (tasks in queue) the numerical probability may be significantly less than 1 whereas the deterministic selection within a scheduler may always yield 1.

When a programmer is given the means to provide an indication of local preference for task(s) (don't care, preferred set, required set) as well as control over task teams, they can thenpursue a coordinated effort at producing a solution.

The trick in doing this is in providing a low overhead means of performing the core and NUMA node selections. When I designed this capability into the QuickThread scheduler special attention was given to making the overhead as low as possible. Most of the overhead consists of a table lookup and write of a bitmask. Some table lookups take more time than others.

In the case where no preference is specified (e.g. parallel_for) the team selection is similar to TBB/Cilk: all threads, however QT has a bias towards current thread (thread issuing parallel_for). When you schedule under "Hot in Cache" situations you would specify a team based upon those sharing your thread's specified cache level (L1, L2, L3) this is a table lookup for bitmask. For "Not in Cache" you might want to direct the scheduler to choose an L3 with the most idle threads. This incurs a little more overhead but with the hope of payback with higher cache hits.Or you may have the occasion for each object to contain a preferred or required thread selection. Or certain task may be restricted to particular cores (e.g. task driving GPU).

If you or Matteo have an example program that _should_ exhibit improvement with cache and NUMA aware scheduling (but has not due to lack of threading tool) I would appreciate you contacting me. A good sample program would be nice. But good sample programs are hard to come by. The sample program has to do real work.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
When a programmer is given the means to provide an indication of local preference for task(s) (don't care, preferred set, required set) as well as control over task teams, they can thenpursue a coordinated effort at producing a solution.

Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quoting - Dmitriy Vyukov

Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?

When operating *without* any hints from a programmersuch as what is available with TBB, Cilk, OpenMP, then the designers of those schedulers can take either a "don't care" method .OR. take a method that has some potential for benefit on NUMA platform. I haven't examined closely what TBB does, but I think the effect will be something along the line of when affinity pinning is in effect, parallel constructs tend to favor closer threads first.

While this technique works well in the inner most loops, the outer most loops (when you havemulti-levels of nesting and multi-socket/nodes) may tend to interfere with your cache utilization. In some cases you will want to scatter the outer loops amongst sockets/nodes but schedule the inner most loops within the socket/node. When you have no means as a programmer as to your preference (TBB/Cilk), the task scheduler has only one selection criteria (don't care or nearest(/farthest)). The programmer though will have apriori knowledge of the data access patterns and should be able to supply a reasonable hint as to the most effective task scheduling strategy provided they have a means of specifying this hint (e.g. QuickThread).

Although the following is not diffinative:

http://www.quickthreadprogramming.com/Comparative%20analysis%20between%2...

page 25, "Throughput of pipeline"

You can see an example ofthe QuickThreadparallel_pipeline run on a Dell R710 dual socket Xeon 5570NUMA class system with RAID10 disk controller. (unfortunately I cannot paste the chart in here)

The QuickThreadparallel_pipeline is NUMA aware. The default settings are to allocate 4 buffers per hardware thread distributed across availableNUMA nodes. 2 additional buffers are allocated as a provision for 2 additional I/O threads. This system has 16 HW threads (2 x 4-core with HT). This QT pipeline therefore runs with 18 threads. The 2 I/O threads (one for input end and one for output end of pipeline)are not affinity pinned, the 16 compute threads are affinity pinned. When the input end (pipe)of the pipeline receives a buffer it may come from any NUMA node (2 on this system). When the read is done, the buffer is passed from the I/O class thread to a compute class thread with a preference to be run on the NUMA nodefrom which it was allocated.

The test program is relatively simple in that it performs a sequential read of a file, up-casing words, while writing results out to an output file. The file was about 659MB in size. The particular setup scheduled adjacent HW threads (e.g. core, core's HT sibling, next core,...) although it would have been relatively easy to schedule core, core, core then back fill with HT siblings.

The scaling was linear upto the point where the compute threads and I/O threads were competing for HW thread resource (a little dip from linear scaling). The slope was ~linear but not 1:1 as there appears to be a fixed overhead per I/O request to the operating system (Windows Server x64 2008).

When all 16 compute class threads were running the I/O bandwidth was about 1.65GB/s meaning the computation throughput was also ~1.65GB/s (fewer writes than reads but each write within same 8-byte word). A total application memory throughput of about 3.3GB/s.

Unfortunately I did not have access to a 4 socket NUMA system to push the test further.

This sample program, although using NUMA capabilities, is not what I would consider a suitable application that would benefit from the NUMA capabilities of QichThread. Sample programs would be welcome, as well as time on larger NUMA class systems.

Jim Dempsey

www.quickthreadprogramming.com

The standard program that benefits from NUMA awareness/affinity is the naive three point stencil:

for (t = 0; t < T; ++t)
  cilk_for (i = 1; i < N-1; ++i)
     x[(t+1)%2][i] = .5 * (x[t%2][i+1] + x[t%2][i-1]);

(This is a pretty stupid solver of the 1D Poisson equation. The particular form of the right-hand side does not matter much for this discussion, as long as it depends on indices i-1, i, and i+1.)

A NUMA-aware scheduler that always does iterations 0..N/P on core 0, N/P..2N/P on core 1, etc., minimizes the shuffling of data. With Cilk and TBB, data moves around all the time.

You could argue that you need NUMA-awareness in this case, but the problem is that the naive nested loop that I showed above is a bad algorithm even in the sequential case, because it incurs a zillion cache misses. My paper describes a much better algorithm that works better sequentially, scales nicely with Cilk, and does not benefit at all from NUMA awareness. It is also more complicated than the simple nested loop shown above. So the tradeoff is between the complexity of a better algorithm, and the complexity of NUMA awareness (which still does not beat the better algorithm). I am not sure what the right answer is; presumably different people will prefer different design points.

Quoting - Dmitriy Vyukov

It's not surprising that it did not make much difference if he tested on Fibbonacci program :)

Dmitry,

don't be silly. We use Fibonacci as an example because it fits on a slide.

Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)

Quoting - Dmitriy Vyukov

After remote steal stolen remote data becomes local data.

No. We are talking about NUMA here, not caching (although the two problems are related, of course).

After you steal, data that is in remote DRAM is still in *remote* DRAM, and *cached* in local cache. Once evicted from the cache, the data must be fetched again from *remote* DRAM.

And this is precisely my point. If your algorithm uses the cache well, it does not matter whether the first cache miss is local or remote. If your algorithm does not use the cache well, you are hosed anyway.

Quoting - Matteo Frigo (Intel)

Dmitry,

don't be silly. We use Fibonacci as an example because it fits on a slide.

Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)

Fibbo is realy a bad parallization example. A good serial code is 15,000,000 times faster than that stupid recursive example. (on 4 core Q6600)

// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
	if( n < 2 )
		return n;
	long fib0 = 1;	// most recent
	long fib1 = 1;	// 1 prior to most recent
	long fib2 = 0;	// 2 prior to most recent
	long i;
	for(i=2; i < n; ++i)
	{
		fib2 = fib1;
		fib1 = fib0;
		fib0 = fib1 + fib2;
	}
	return fib0;
}

Jim Dempsey

www.quickthreadprogramming.com

The claim

// *** Note, several orders of magnitude faster than all   
// *** recursive techniques.

is not true. Recursion is part of a good algorithm for Fib, and would be a good study of parallelism. It's just that a good algorithm for Fib will not fit on a slide, because it involves parallel FFTs.

Here's what a good algorithm for Fib might look like.Use repeated squaring of a 2x2 matrix (as noted in a footnote in the TBBtutorial). The matrix is . It only requires O(lg(n)) squarings andsomematrix additions to compute Fib(n).The catch is that Fib(n)grows exponentially, solarge multiprecision arithmetic is required to represent the result for large n.Fast algorithms for large multprecision multiplication use FFTs. FFTs are best tackledby cache oblivious algorithms, which are recursive and canbe parallelized efficiently.

Asymptotic analysis: The precision [number of digits] of the values grows as O(lg(phi^n)), which is O(n) digits per numeral. Each FFT on an O(n) digitnumeral takes time O(n lg n).For O(lg n) squarings, that's time O(n lg^2 n). The non-recursive algorithm using summation requiresO(n) additions, each requiring up to time O(n). That's time O(n^2). Hence thesimple iterative algorithmis not faster than all recursive techniques.

It would be fun to have a contest to see who could write the fastest program for computing fib(10^9) :-)

So then you change the definition of the problem? The ParallelFib example returns a long. Which implicitly restricts the precision and therefor an upper limit on n. Type change of this to __int64 would be the the scope of equivalent process since you are returning an integral result withing the precision of the processor. Changing the rules (requirements) to compute using multi-precision has anentirely different set of programming requirements. Why stop at 10^9?

The point of the ParallelFib example was to have a simple example of recursion. This it does. However, my comments are directed at teaching the reader that simple recursion techniques. while easy to program, can at times introduce excessive and unnecessaryoverhead. The reader should explore alternate means. This said, when computing multi-precision fib on the order of 10^9, and the problems encountered,my comments direct the reader to explore alternate means.In this case an FFT which may or may not be the most efficient method (I am not a mathematician).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Arch Robison (Intel)

It would be fun to have a contest to see who could write the fastest program for computing fib(10^9) :-)

Computing fib(10^9) is not that hard. For example, PARI/GP (my favorite multiprecision calculator) has a builtin fibonacci function that computes the answer in a few seconds. (Pari uses essentially the algorithm that you described for computing Lucas numbers, from which Fibonacci numbers can easily be derived.)

The hard part is printing the answer---in fact, I am not even sure how to do it in parallel. PARI/GP has not printed the full answer after one hour of churning. A partial answer is

7.9523178745546834678293851961971481892555421852346 E208987639

with an error of approximately1.516 E208987590 .

Hmm, I find this hard to believe. In a few seconds, this program finds a result that takesroughly 60 MB to represent correctly, and then it takes more than an hour to convert it to decimal? More likely the program sheds precision during the calculation; but that's just my first impression, as I don't know much about the math involved. Anyway,does this still relateto the topicof discussion?

Quoting - Matteo Frigo (Intel)

No. We are talking about NUMA here, not caching (although the two problems are related, of course).

After you steal, data that is in remote DRAM is still in *remote* DRAM, and *cached* in local cache. Once evicted from the cache, the data must be fetched again from *remote* DRAM.

And this is precisely my point. If your algorithm uses the cache well, it does not matter whether the first cache miss is local or remote. If your algorithm does not use the cache well, you are hosed anyway.

This post cleared up a lot of confusion. I was under the impression that once a thread steals a task from another thread's queue, the end result would be that the task would be moved from the victim's queue (and hence its DRAM) and added to its own queue(which i wrongly assumed to be its DRAM). At an intermediate stage, when the local(thief's) cache needs to be evicted, if the task is still physically written back to remote memory, wheras logically it belong's to the thief's task pool, then what you say makes sense. So, on a related note, for NUMA based systems, does it make sense to steal a task physically as well, and not just logically ?

Quoting - Gera Prasun Dineshkumar (Intel)

This post cleared up a lot of confusion. I was under the impression that once a thread steals a task from another thread's queue, the end result would be that the task would be moved from the victim's queue (and hence its DRAM) and added to its own queue(which i wrongly assumed to be its DRAM). At an intermediate stage, when the local(thief's) cache needs to be evicted, if the task is still physically written back to remote memory, wheras logically it belong's to the thief's task pool, then what you say makes sense. So, on a related note, for NUMA based systems, does it make sense to steal a task physically as well, and not just logically ?

Gera,

your suggestion is likely to be impractical. In practice, the task is a small data structure that contains pointers to where the real data is. For example, a task that copies an array into another does not contain the source array, but only a pointer to the relevant section of the source array. Systems like Cilk and TBB transfer ownership of the task to the thief worker. In TBB, you could in principle copy the task itself, but you wouldn't gain anything from doing so.

Many years ago we actually had an implementation of Cilk for distributed memory machines that was implementing your suggestion: tasks had to be self-contained structures that were copied over a network. You can do it and it works, but the system is very hard to program once you lose the ability of passing pointers.

>>Many years ago we actually had an implementation of Cilk for distributed memory machines that was implementing your suggestion: tasks had to be self-contained structures that were copied over a network. You can do it and it works, but the system is very hard to program once you lose the ability of passing pointers.

Many years ago (~1985)I wrote a distributedoperating system where an entire application and itsdata could "slurp" from node to node for load balancing. It had some advantage when the virtual memory system on a node entered a high paging state. i.e. when paging got high, an application and its data could migrate to an underutilizednode.

I wasn't aware the Cilk used to do this. Now-a-daysMPI is a bit of a hybrid of this technique, multiple instances of the code is loaded in each node, then just (portions of) the data is migrated. The ratio of code to data would suggest which method is better.

I think there may be some merrit in exploring passing a GUID of the code, followed by (portions of) the data, thus giving time for the recipient to indicate if the code transfer could be omitted due to it having a copy in its cache. i.e. treat it like a cache hit.This could be extended to portions of the data. Some data tends to remain unchanged for long periods. If the GUID also contains a timestamp you could pass the code GUID, low flux data GUID(s) and timestamp(s), high flux data, then test for requirement to transmit code and low flux data. This technique should work well where the application makes extensive use of JIT compilation(i.e. code is not known until run time).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Matteo Frigo (Intel)

Hi Matteo,

First of all thank you and all other participants, the discussion is really interesting.

I see your point regarding cache oblivious/conscious algorithms. I have to agree that good algorithm tries to exploit not only NUMA but also caches/register file, and if it does so then importance of NUMA-awareness significantly reduces.
I really need to drum cache obliviousness into my head. I guess it's the reason why you guys kick a$$es in Threading Contests. I remember how Bradley Kuszmaul kill to death my silly low-level optimizations with his recursive 3D decomposition in Line Segment Intersection problem.
Recursive trapezoid decomposition is awesome! I actually read before about cache obliviousness, and watched MIT Open Courseware 6.046J, and implement some search trees in a cache oblivious way. But when in comes to practice it's so non-obvious why/when/how I better use some tricky recursive decomposition instead of a simple nested loops. Ok, I study on my mistakes, let's return to scheduling :)

So, do you think that locality-awareness is harmful, or that it's just not worth doing (because of the negligible returns)?
I suggest to consider here not just NUMA but locality in a more general sense (including caches). Or even random-scheduler vs deliberate-decision driven scheduler. By deliberate-decision driven I mean that scheduler may deliberately choose, for example, the biggest piece of work to steal (under some assumptions of course, for example 'busy leaves'). Or it may prefer LIFO steals except for steals between HT-sibling threads where it will do FIFO steals, etc.
Do you think that all that will not have any visible returns?
Note that nowadays people try to use parallel processing for all the various problems out there, for example... agent-based simulation, server-side processing, etc. Such problems frequently have low granularity of tasks and/or do not structured as well-balanced trees and/or do not have busy leaves and/or do not expose enough parallel slack. For such problems if you can cut down overheads at least somehow (make steal itself a bit cheaper by doing it from closer thread, or eliminate at least some cache misses), well, I still think it can have visible performance implications.

> No. We are talking about NUMA here, not caching (although the two problems are related, of course).
> After you steal, data that is in remote DRAM is still in *remote* DRAM.

Not always, data may have 'indeterminate' placement. If a thread steals from remote NUMA node, stolen data may indeed become local. Consider we have 'first-touch' NUMA allocation policy (which is used in latest Linux and Windows) (btw, what NUMA allocation policy you used when you done all that research - allocation/stripped/first-touch?) Consider output array which is not initialized with zeroes initially, if a thread steals "a part" of that output array from whomever, he will "first touch" it, and so it will become local to the thread. Consider input array which is just memory-mapped from input file, if a thread steals "a part" of that input array from whatever, it will "first-touch" it, and so it will become local to the thread.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Can I throw in a question on why it's called stealing when in reality a thread neatly subdivides the work for another thread to come in and assist? The real question behind that is what the findings are on such fixed subdivision overhead vs. an approach where a separate task is created only as a handle, and the division occurs during execution based on direct interaction. So in a parallel_for(1000) the first thread would create a handle task and then go to work on, say, 10 units at a time, each time obtained by locking its range and taking off 10 units to leave 990, 980 etc. units in the range, or doing the same with atomics, on the assumption that there is some but not that much overhead if locks and atomics are mainly used locally. When another thread comes to assist, it finds the handle (and creates another one), and uses it to divide the remaining work in the range, e.g., 640, so that both threads can do their thing independently on 320 units and without creating any more tasks until the end game when one task still has 40 units and the other has completed its work and can take on 20 from the first task. Parallel slack would be created on demand, not proactively. That stealing doesn't work like this was a big surprise for me, and maybe somebody has already answered this question so I don't have to investigate for myself.

(Added) Obviously it's tougher against auto_partitioner, but that one is vulnerable to inequalities in processing time across the range.

Quoting - Dmitriy Vyukov

> But when in comes to practice it's so non-obvious why/when/how I better use some tricky recursive
> decomposition instead of a simple nested loops.

You are correct in saying that this is a tricky issue. I think it helps to separate the problem into two parts. First, you need to ask the question of how well is a given code using the cache. For this you need to have some idea of the optimal number of cache misses. Second, you may want to ask the question of whether the same result can be obtained by cache-oblivious methods. For the first part, it helps to read Jeff Vitter's papers, in particular his book
"Algorithms and Data Structures for External Memory", which is available online. He talks about DRAM vs disk instead of cache vs DRAM, but the problems are the same. This should at least give you some idea of what is possible and of which tricks have been invented already.

> So, do you think that locality-awareness is harmful, or that it's just not worth doing (because of the negligible
> returns)?

No. I do however think that using the cache properly gives you the most bang for the buck on current technology, and that thinking about NUMA issues come only after having exhausted the opportunity for caching. Another way to say it is that misusing the cache loses a factor of 100 in performance (try the triply nested loop matrix multiply on large matrices), whereas misusing NUMA loses much less than that (2x? 10x?).

> [..] Do you think that all that will not have any visible returns?

I think that at a large enough scale, locality at all levels becomes fundamentally important, whereas at a small scale you can ignore NUMA if you can use the cache(s). The kind of things that you are suggesting make software design look like hardware design, because you have to worry about how to place things in a 2D (or 3D) physical space. I believe that ultimately you are correct, and there will be visible returns in worrying about NUMA etc. However, there is also a high cost to pay in terms of rigidity of your software: once you have scheduled things manually, your software becomes very hard to change---you really have a piece of hardware that happens to be written in C++. My stencil paper, which you liked so much, was an only partially successful attempt at understanding for how long can we get away with ignoring the physics of computation.

By the way, I found the paper ``Horizons of Parallel Computation'' by Bilardi and Preparata inspirational in thinking about locality. This paper is underappreciated, but I think that they are fundamentally right, and you already see some of their predictions becoming true.

Quoting - Dmitriy Vyukov

Not always, data may have 'indeterminate' placement. If a thread steals from remote NUMA node, stolen data may indeed become local. Consider we have 'first-touch' NUMA allocation policy (which is used in latest Linux and Windows) (btw, what NUMA allocation policy you used when you done all that research - allocation/stripped/first-touch?) Consider output array which is not initialized with zeroes initially, if a thread steals "a part" of that output array from whomever, he will "first touch" it, and so it will become local to the thread. Consider input array which is just memory-mapped from input file, if a thread steals "a part" of that input array from whatever, it will "first-touch" it, and so it will become local to the thread.

Yes, but first-touch works only once. Once you have allocated the data you are back to the same problem once you try to steal again.

The SGI Origin 2000 had many processors (I used the 256-processor version) in a room-size cabinet, with hardware shared memory and first-touch allocation. In our Cilk chess program we wanted to initialize a huge (16GB, huge for the time) hash table which was meant to be distributed across the machine, but the first-touch policy was doing exactly the wrong thing, guaranteeing that the hash table was concentrated in a few nodes rather than spread across the whole machine. To fix this, we had to parallelize the initialization. Moral: be careful what you ask for.

The SGI folks of course realized this problem, and at some point they had some automatic migration policy that was moving pages closer to the place of use. AFAIK, this never really worked well (and it never really worked with my programs, so I always disabled it).

Raf,

Atomic operations are expensive (e.g. InterlockedAdd). If you were to peel off individual elements of an iteration space, say n items, then you would encounter n atomic operations. The number of atomic operations can be reduced by say dividing the interationspace upby the number of threads (4, 8, 16, ...). However, by doing so you assume all threads are available, which isgenerally true for an unloaded system anda non-tasking/non-nested level program environment. Actual runtime environment is often quit different from the test environment.

In anon-trivial programming situation you are not assuredof the availability of hardware threads at the start of the loop nor availability of additional threads becoming availableduring the execution of the loop, then, whileusing atomic operations on each iteration of the loop would lower skew latencybetween threads on the completion of the the loop, this comes at theexpense of using interlocked operations for each iteration of the loop. (talk about run-on sentence) This means the overall run time for the loop becomes larger as attempts are made for reducing the thread loopcompletion skew time. Taking larger chunks out of an iteration space will reduce the number of interlocked operations, meaning as chunk size increases interlocked overhead reduces, but at the potential expense of increased skew latencybetween threads on the completion of the the loop. As to if the loop completion skew latency is important or not, this would depend on the other pending workin the application. i.e. when threads have other work to do, the skew latency of loop completion is not bematerial to the overall throughput of the application (and the reduction of overhead induced by excessive interlocked operation IS important to the overall throughput of the application).

A balance needs to be made between the overhead of the interlocked operations for partitioning the iteration space usingchunking or some sort of smart iterator (smart iterator == higher overhead), together with the probability of HW threads being available at start of loop or at some time during the loop or not being available at all. In the first case (availability of threads at start of loop) an auto-partition-er may be of use (divide range up based on availability of threads or the ability of MRU/LIFO scheduling). In the second case you may want provisioning to repartition portions of the iteration space or to use a chunking(-like) means to have the initial partitioning exceed the number of HW threads.In the last case (high probability of additionalHWnot being available at all) then a serial for loop would be best. Also you can have partitioners that work ex-post-facto (repartition after initial partitioning during execution of the loop).

Each partitioning technique comes withdifferent overhead(s) and eachapplication is different, there is no single rule to apply to all situations. Note, testing components of an application separately, i.e. produce maximum performance of a component in isolation of other components, can at timesyield a lower performing application when all the components are used togetherunder actual run-time conditions.

When I wrote the QuickThread task scheduler I made provisions for the programmer to address these issues in a straitforward manner.

// divide up by number of HW threads
parallel_for(fnFoo, 0, nObjects, vObjects);

// divide up by current thread + number of waiting HW threads
parallel_for(AllWaitingThreads$, fnFoo, 0, nObjects, vObjects);

// divide up by number of chunks
// (chunks taken with interlocked operations)
parallel_for(chunk, fnFoo, 0, nObjects, vObjects);

// divide up by current thread + number of waiting HW threads
// then as execution progresses, threads monitor for additional
// waiting threads. When thread discovers waiting thread it
// repartitions its remaining iteration space (ex-post-facto).
parallel_for_each(AllWaitingThreads$, 0, fnFoo,nObjects, vObjects);

And several other specialized partitioning techniques.

As to which is best... this depends on the circumstances.
In many circumstances, oron low core count systems, the default behavior of TTB's parallel_for partitioner or OpenMP partitioning/schedule-ing maybe all that is required to attain acceptible performance. i.e the additional capability of QuickThread might not be warranted. However, as the parallelization complexity of the application increases, and/or on larger platforms are used,the additional task scheduling capabilities might warrant additional scheduling capabilities.

Jim Dempsey

www.quickthreadprogramming.com

"Atomic operations are expensive (e.g. InterlockedAdd)"
But is that also true if the previous user was typically the same core, I wonder?

"If you were to peel off individual elements of an iteration space, say n items, then you would encounter n atomic operations."
Grainsize would still apply, I used 10 as an example.

"The number of atomic operations can be reduced by say dividing the interation space up by the number of threads (4, 8, 16, ...). However, by doing so you assume all threads are available, which is generally true for an unloaded system and a non-tasking/non-nested level program environment."
I make no such assumption. The only thing that's different is that the "victim" does not carefully set up something to do for the "thief" (which I thought was rather curious when I found out about it), but that the thief actually comes in and grabs part of a range from a running task, say, half the remaining range.

"In many circumstances, or on low core count systems, the default behavior of TTB's parallel_for partitioner or OpenMP partitioning/schedule-ing maybe all that is required to attain acceptible performance. i.e the additional capability of QuickThread might not be warranted."
I'm mainly interested in easy gains, with a low learning curve.

>>I make no such assumption. The only thing that's different is that the "victim" does not carefully set up something to do for the "thief" (which I thought was rather curious when I found out about it), but that the thief actually comes in and grabs part of a range from a running task, say, half the remaining range.

The victim has to perform its iterations in a thread safe manner. For the thief to steal half the remaining range it must be able to see the range (or at least the remaining range) and then it must be able to modify the remaining range, and do so in a manner such that the victim and thief do not execute the same iteration(s). This requirement in turn, requires the current index (or copy there of) of the victim is visible to the thief. IOW index (or copy there of) is a volatile (or other type that forces read/write to physical memory). The upper limit must also be volatile. When using critical section in advancing the iterator, the coding is relatively straightforward. However, you can code this while not using critical section. This is harder to program (the first time you do it)and require the thief to make a tentative theft and wait observing the victim until next iteration. When the iteration computation time is large, then critical section might be advised. When the iteration computation time is short, then tentative theft with observation of iteration next interval by victim is warranted.

Jim Dempsey

www.quickthreadprogramming.com

Apparently atomics aren't noticeably faster than spin_mutex within a single thread (my laptop's Core Duo), and locking also works with nonlinear ranges. Perhaps tentative theft might be useful sometimes, but I'm not sure that configurable grainsize wouldn't provide similar benefits. So does this seem to you like a plausible approach, or have you perhaps already tried it?

Best wishes and signing off for this year,

Raf

Quoting - Matteo Frigo (Intel)

You are correct in saying that this is a tricky issue. I think it helps to separate the problem into two parts. First, you need to ask the question of how well is a given code using the cache. For this you need to have some idea of the optimal number of cache misses. Second, you may want to ask the question of whether the same result can be obtained by cache-oblivious methods. For the first part, it helps to read Jeff Vitter's papers, in particular his book
"Algorithms and Data Structures for External Memory", which is available online. He talks about DRAM vs disk instead of cache vs DRAM, but the problems are the same. This should at least give you some idea of what is possible and of which tricks have been invented already.

Table Of Contents looks quite interesting, thank you, I've added it to my list of literature I think worth reading (but most likely will never get to :) )

Quoting - Matteo Frigo (Intel)


> So, do you think that locality-awareness is harmful, or that it's just not worth doing (because of the negligible
> returns)?

No. I do however think that using the cache properly gives you the most bang for the buck on current technology, and that thinking about NUMA issues come only after having exhausted the opportunity for caching. Another way to say it is that misusing the cache loses a factor of 100 in performance (try the triply nested loop matrix multiply on large matrices), whereas misusing NUMA loses much less than that (2x? 10x?).

I think I see the root cause of disagreement - we are just talking about different contexts. You are talking about how highly skilled professional with suitable problem and with a lot of resources and willingness to rewrite the application or whatever can achieve maximum performance. I agree that in such context random LIFO scheduler is mostly enough.
I am talking about mediocre programmer plus whatever problem plus limited resources plus ... etc. That's what we have wrt TBB in a lot of cases. Can scheduler improve situation in such context at least somehow? For example, for some problems (simulation) one will have very badly structured work DAG if a problem will be coded in a natural way, so constant stealing will have place. For some problems there is no cache-oblivious algorithms. And in any case cache-obliviousness is a complication, so it's not a win-win solution (not just performance-wise, but if we consider various aspects). If you can do some optimization on a wide-spread library side, it's usually worth doing.
My initial question was about such context.

Quoting - Matteo Frigo (Intel)

> [..] Do you think that all that will not have any visible returns?

I think that at a large enough scale, locality at all levels becomes fundamentally important, whereas at a small scale you can ignore NUMA if you can use the cache(s). The kind of things that you are suggesting make software design look like hardware design, because you have to worry about how to place things in a 2D (or 3D) physical space. I believe that ultimately you are correct, and there will be visible returns in worrying about NUMA etc. However, there is also a high cost to pay in terms of rigidity of your software: once you have scheduled things manually, your software becomes very hard to change---you really have a piece of hardware that happens to be written in C++. My stencil paper, which you liked so much, was an only partially successful attempt at understanding for how long can we get away with ignoring the physics of computation.

I am not sure I get you here. I am not talking about manual scheduling here. I am talking about automatic scheduling which takes into account locality, NUMA, shared caches, HT, etc. Ideally, it always works not worse than random scheduler, and sometimes somehow better.
In the limit it may use not only work-stealing, but a combination of work-stealing/distribution/requesting.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Login to leave a comment.