TBB does not provide a parallel_find(), which would act similar to std::find(): find the first element in a range meeting a given condition/predicate. I think those who know the tbb library well should be able to implement this as follows: a range is split into a first and second half with the first half searched at higher priority (put on the stack last). If the first child task succeeds (finds a match), the second task is cancelled. This is done recursively, of course.

Could you consider implementing this? I'm not sure how to implement the cancellation of the second child task if the first succeeds.

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Walter,

I wrote a quick implementation using an atomic<bool> to indicate a hit in any child.

Let me know what you think about it.

You can find it here.

Consider writing a recusive routine using parallel_invoke. Pass in a (vector, lower bound, upper bound, fnChriteria). Keep splitting bounds until threshold based on either separation of bounds or depth of recursion. Return condition for not found in either branch or the lower index of the branch with find. QED

Jim Dempsey

RE: Cancellation

You can change the function from returning the index of the found, to passing in the address of the volatile index. Initialize the found index to not found (e.g. 0xFFFFFFFF). Termination then is indicated by the found index being .lt. your lower bound. Should you find a match, then use a CAS provided that the found entry is .lt. what is now in the found index.

Jim Dempsey

I know how I can indicate termination. My problem is that with TBB I don't know how precisely to peform it. How can the task running the lower-half of the range cancel its sibling task (which runs the upper half)? And, how can the cancellation be guaranteed to propagate down the recursive task tree, should the sibling task already have executed and spawned children?

Another option, which is perhaps what you were thinking about and which does not use cancellation, is to use an atomic variable to store the current best guess (initialised to initial_range::end()) for the iterator of the first found element. If that is less than current_range::begin(), the current_task doesn't need to run. But this requires atomic operations on the iterator -- can we do that? It would restrict the possible template arguments that can be used with parallel_find(). For iterators not allowing atomic operations, we could use a mutex.

Hi Stephan,

I just only now saw your reply. Yes, that worked. However, even on 16 cores, it's only really superior to serial if N>~10^8 and if the searched for element is at index I > ~10^7. For small N or I, it's actually substantially slower than serial code ...

I think my answer didn't get unlocked immediately.

Yes, but I think that you probably can't get better than that. Searching is not really a computational intensive task and if all you are doing is iterating through cache and doing a single compare, most of the time is probably wasted by threading overhead. When the dataset is large though, I think you can get quite a performance boost in the best case. If you are doing a lot of searches, a concurrent datastructure like a concurrent hash-map might be a better fit.

A plain implementation without workstealing which simply splits itself recursively(e.g.: parallel_invoke as Jim proposed) might even outperform the task-approach here.

This looks like a job for a pipeline with a serial input stage that does a serial search through the first few pages and then as its gets bored starts delegating reasonably-sized subranges to later stages (again a few pages each, perhaps growing later on), a parallel stage that finds the earliest occurrence in its own subrange, an ordered serial output stage that looks for the first successful outcome, and a task_group_context for cancellation by the output stage (interpretation by the middle stage might work too). Slightly more advanced would be to interpret default_num_threads() and use interleaving (each data item represents a regular set of staggered subranges, either aligned at cache lines or sufficiently long to just amortise cache misses), thus increasing the chance of finding an answer sooner without using more tasks, and to be sure to interpret cancellation frequently enough in the middle stage. Well, it's what I would try, anyway...

(Added 2013-04-22) Maybe without that interleaving idea, though.

>> How can the task running the lower-half of the range cancel its sibling task (which runs the upper half)?

The found index (initialized to 0xFFFFFFFF or 0x7FFFFFFF), which is set with atomic operation (CAS) is used as the termination indicator.

The following is pseudo code
(won't work as-is)

void parallel_find_helper(SomeVector, fnSomeTest, iBegin, iEnd, pFound)
  if(*pFound .lt. iBegin) return;
  if(iEnd - iBegin .lt. cutoff)
     for(int i=iBegin; i .lt. iEnd; ++i)
        int OtherFound = *pFound; // freshen
        if(i .lt. OtherFound) return;
          // match
          while(!CAS(pFound, i, &OtherFound))
             if(OtherFound .lt. i) return;
          return; // success, result in *pFound (ours or some other .lt. i)
      return; // fail
  } // if(iEnd - iBegin .lt. cutoff)
  // recurse
  int iSplit = iBegin + ((iEnd-iBegin)/2);
    []() { parallel_find_helper(SomeVector, fnSomeTest, iBegin, iSplit, pFound); },
    []() { parallel_find_helper(SomeVector, fnSomeTest, iSplit, iEnd, pFound); });
int parallel_find(SomeVector, fnSomeTest)
  int Found = 0x7FFFFFFF; // not found
  parallel_find_helper(SomeVector, fnSomeTest, SomeVecto.begin(), SomeVector.end(), &Found);
  return Found;

Jim Dempsey

I disagree: the result is more likely to be found in the first half than in the second half, and real cancellation would entirely prevent execution for most tasks.


your code does not cancel any scheduled/running tasks. Rather, it uses an atomic variable as indicator, which each task has to constantly read in order to effectuate something like cancellation. With many threads, this results in a heavy atomic load and hence will not scale well.


re your earlier post. Surely, the problem of parallel finding is well researched. The strategy I was advocating is essentially what Anthony Williams uses in his "C++ concurrency in action" (chapter 8.5.2), using std::async(). He also commented on the problem of using an atomic indicator, as the "atomic loads can slow operations".

I also had a remark about that atomic before I surreptitiously removed it because I don't see why reading it should necessarily be such a burden... provided of course that you do read it before attempting a CAS (don't rely on any rumours about optimisation in hardware)! If the strategy is otherwise sound (searching primarily near the beginning), there shouldn't be too many updates of that atomic variable before you settle on the final result.

Using reseaurch results is more professional but also less fun than reinventing the wheel on the spot. :-)

The strategy with halving the input range won't work as intended with TBB where stealing is breadth-first. Maybe Anthony Williams is using another toolkit that works differently and therefore more closely resembles my proposal, even if it wastes creating a task for the work currently set aside? But somehow you have to know for certain when the result is in, and with TBB a pipeline seems a natural fit. That's not to say that you couldn't possibly improve the outcome marginally by also communicating through a shared (atomic) currently best result. I'll leave that up to you, if you wish to find out for yourself (but do tell us the outcome!).

Of course, all attempts are futile if memory bandwidth is saturated first, depending on the search predicate.

(BTW, you don't need atomic iterators if they're random access, because then atomic<ptrdiff_t> will help you navigate anywhere in an instant.)

Anthony Williams uses the same approach with an atomic<boo> flag. He plainly splits recursively and checks after every comparison whether the done flag was set.

I like raf's idea about doing a preliminary scan on the first elements. Nevertheless I think that searching is just not the best task for parallelism.



>>your code does not cancel any scheduled/running tasks. Rather, it uses an atomic variable as indicator, which each task has to constantly read in order to effectuate something like cancellation. With many threads, this results in a heavy atomic load and hence will not scale well.

I would argue that the cancellation of scheduled/running tasks has higher overhead than reading a shared variable. cancellation requires traversal of list of stacked threads who's status/state are effectively atomic variables. I suspect that explicit thread cancellation has higher overhead than letting thread run to point of observing earlier index find detected. The length of the run of the lowest level task is the (remainder of) run time of the non-forking portion of the search.

With non-(task cancellation), i.e. with atomic found variable, the shared variable gets evicted from L1 cache of other threads only on change, change only occurs when new found has lower index. So let's estimate the number of CAS operations:

Example scenario:

Assumed vector to be searched is partitioned into 1024 parts (tasks)

Absolute worst case could be 1024 CAS operations.
Average is NOT 512

Assuming the 1024 partitions are randomly run with the available thread pool (non-random can be illustrated at a different time).

First update of atomic variable holding found index:
First thread to reach find would occur at average partition 512.
All threads running in partitions above 512 would abort task as soon as found index was observed (immediately after fnMatch).
There is a very small window (2-3 statements) where a CAS might attempt to be performed when unnecessary. This may be negligible.
All subsequent stacked tasks in partitions above 512 would run no further than first test of found index.

Second update of atomic variable holding found index:
First thread to reach find would occur at average partition 256.
... (see first update, adjust for smaller range)

Third... First thread to reach find would occur at average partition 128.
Forth... First thread to reach find would occur at average partition 64
Fifth... First thread to reach find would occur at average partition 32
IOW, number of CAS's average to find ~log2(number of partitions)-1
In this case, 9 CAS

A more optimal search strategy would be to run the thread pool in manageable chunks from the low end of the vector onwards. While using similar atomic found index. The parallel_pipeline that Raf suggested could be useful in this regard, however a slightly different forking function may function better. Tests would have to be done to work out the better technique.

Jim Dempsey


Stephan Dollberg wrote:

Anthony Williams uses the same approach with an atomic<boo> flag. He plainly splits recursively and checks after every comparison whether the done flag was set.

Are these parts safely set aside on a stack with serial access so that a lot of threads won't plunder it and defeat the purpose (thread gets top item that's way too big because other threads have taken the smaller ones already, splits it, puts back the other half in the wrong order because it's bigger than the current top and also succeeds it, next thread gets the wrong top, claims success, and the real winner is still on the stack)? What's the cost of the locking to prevent all these mishaps (don't forget convoying)? Once you add the necessary mutual exclusion, how could this be any better than just taking a fixed-size piece each time? It just doesn't make sense, except perhaps as a poor man's version of TBB to find a hit anywhere in the input range, which is quite different from finding the first hit.


Stephan Dollberg wrote:

I like raf's idea about doing a preliminary scan on the first elements. Nevertheless I think that searching is just not the best task for parallelism.

Don't use it to beat Boyer-Moore and expect to win. But maybe you're doing something more challenging that doesn't incur a memory bottleneck first and where the pickings are slim so that the parallel overhead is a price you want to pay? There is no absolute answer, it all depends. That's why they (also) call it software engineering. :-)


Jim Dempsey wrote:

I would argue that the cancellation of scheduled/running tasks has higher overhead than reading a shared variable. cancellation requires traversal of list of stacked threads who's status/state are effectively atomic variables.

I agree (see higher) that an atomic might well be an appropriate communication mechanism, because it will soon settle on the right answer iff the strategy is otherwise sound. But cancellation is actually quite cheap if the current task group context does not have descendants, because then traversal of the scheduler is cut short before it really gets started. In a crowded environment, cancellation should not be overlooked.

I do maintain that no recursive splitting of the input range is a sound strategy to find the first hit.


#1: I don't know if I understand you 100%. He uses recursive binary splitting till a certain threshold is reached, on joining both halfs, he prefers the left element so that the first element is gonna be selected. For every split he spawns a new task for the right half using std::async. So yeah, he has some implicit locking in the std::future. I am not a big fan of his approach either because he also checks after every comparison. This again might of course be ok, if the checks aren't a simple value comparison as you mentioned in your second part. Though, only measurements can prove any of this.

#2: :) You are right of course, my over-generalization was bad. In such a case, my approach from above should work, too, the more difficult thing which probably comes up here is to select the right threshold for more complex and more simple comparisons.

Well, that's not very smart of "him", is it? Why would you possibly want to touch the right half before you have exhausted the left half, unless you know there's at most one item in the whole collection, in which case the challenge of finding the first qualifying element and that of finding any qualifying element become one and the same (you can very easily do the latter with parallel_for and cancellation)? Without such knowledge, you also have to keep track of where you have looked already (that's the exhaustive bit mentioned above), implicitly by waiting or explicitly by... well, let's not even go there! In any case, you're wasting performance this way, and with an atomic<bool> it's even worse because it will delay the moment of communication until you've found the final result (it would be just like cancellation) instead of being able to let others know they're on the wrong track (searching to the right of the current best result).

I think you may have misinterpreted this (first vs. any), or else it's just a demonstration of technique (like TBB and Fibonacci), or else you should stay well clear of this person's software and stop reading his books! ;-)

(Added) I've had a look at your program, and if the challenge is big enough for any stealing to occur, a task searching a later part of the collection may set "done" to true, causing a task about to search an earlier part to exit early and return NULL, and then, unless perhaps if there's also a hit further to the left, NULL will be the outcome of the algorithm, instead of the real first hit, or any hit at all really. Instead, you should return "end" to exit early, and then the program will give you any hit in the collection, with a mostly useless bias toward any elements near the start (those are visited before stealing starts) but otherwise just random (in the traditional sense).

Yeah, I was thinking of "first" more like "give me any as fast as possible and if you have currently more than one hit give me the one most to the left".

Most peculiar... what relevance could that possibly have?

Also, I think Walter's original question was rather unambiguous, and it was he who first referred to that book.

I just noticed that I have overlooked Walter's question about cancellation. In TBB, there is no preemption of tasks, neither for scheduling nor for cancellation. Instead, a task must regularly consult the cancellation status of its own task group context, so in that regard it's the same as communicating by atomic<bool>. Where it differs is that those contexts can be nested hierarchically, with probably many tasks being associated with one context, and cancelling one context means cancelling all descendants as well, so effectively all tasks associated with all those contexts. A selling point for using this mechanism is how it is built in to the prepackaged algorithms, so mostly you only have to declare a context, hook it up to an algorithm, and possibly cancel it.

To me it seems more natural. To make it explicitly clear and to avoid confusion, Anthony Williams also explicitly states that he is looking for "any" element and not the very first one.

Using a task_group_context seems like another good idea for both cases.

>>Yeah, I was thinking of "first" more like "give me any as fast as possible and if you have currently more than one hit give me the one most to the left".

Then in my posted algorithm you would abort the search when index found != "not found" (0xFFFFFFFF or 0x7FFFFFFF)
With additional CAS should you observe an "index found" while you have yourself a found and your found is .lt. current index found.

In this situation, best case is one CAS (or 0 CAS in event of no match), worst case is thread pool size number of CAS's


Variation on binary split

The ..._helper function:
1) split iteration space into non-splitable size + remainder
2) searches non-splitible size, if found follow CAS rules stated earlier, return (done)
3) (must handle remainder) if no remainder, return
4) if remainder .le. non-splitable size, search remainder, if found follow CAS rules stated earlier, return (done)
5) parallel_invoke(first half of remainder, second half of remainder)

What the above does is defer additional task enqueues for the higher portion of each branch. This will permit each thread in thread pool to drill down to lowest partition size without enqueuing additional tasks.

Jim Dempsey

amazing what you all work out on the weekend ...

okay. So, I now regret to have quoted Anthony Williams, because, obviously, his approach to this particular problem is just stupid, as it searches many elements potentially unnecessarily: testing elements to the right of the one we want. A clever algorithm must minimise unnecessary tests.

If there was no parallel overhead and computers were perfectly symmetric machines, then the best stragegy with p synchronuous parallel threads is simple: the ith thread tests elements i+k*p k=0,1,... So, the p first elements are tested synchronuously first (for k=0), then the next p (k=1) etc. Thus, the number of elements tested unnecessarily it never larger than p-1 and p/2 on average.

In practice, there is parallel overhead and you presumably want to avoid reading the same cache line at different threads, i.e. each "element" in my above algorithm should be a super element no smaller than a chache line. It also seems that this algorithm is better implemented directly on the thread level rather than using  tasks. An important issue is that the performance depends critically on the ability of the threads to act near-synchronously. If one thread is much slower than the others and the element to be found is in its part of elements, performance is reduced a lot. In such a case, one may consider using an atomic index to the next (super-)element to be searched.

Perhaps I have time later to implement a simple algorithm directly in C++11 threads with these ideas.

>>Perhaps I have time later to implement a simple algorithm directly in C++11 threads with these ideas.

You can do much of what you want in tbb. This may only require the addition of a few features. I suggest looking at the problem and determining the probability of number of nodes matching the search chriteria. Divide the search set size by the (probability of number of matches) * (thread pool size) and use this as a grain size to parallel_for. This might be a good starting point. See my next post relating to hypotheticals.

Jim Dempsey

This is in response to private message with Raf. We though it be pertinent to add to this forum thread. Raf’s questions begin with >>

>> I'll leave you the opportunity to correct yourself about that comparison (or not).

Raf is referring to an error in my pseudo code where the if test for early termination was backwards. Thank you Raf.

>> I must say I'm "surprised" about your suggestion to use tasks as if they were threads. How could that possibly be beneficial for any purpose?

Threads perform tasks. Due to lack of knowledge of number of threads immediately available one tends to find it beneficial to partition the workspace up into more tasks than threads.


Should partitioning == nThreads with say 8 hw threads this yields 8 tasks (and assume equal work per task/partition). With all 8 hw threads available the work gets done in 1/8 the time (plus 8x thread scheduling time). However, should one of the hw threads not be available, the work gets done in 1/8 + 1/8 of the time (2x that of expected).

Alternately should the partitioning be in finer partitions, say 2 per thread (16 tasks), when all 8 threads are available 1/16(8) + 1/16(8) + 16x thread scheduling time, or same time as 8 task +8x more thread scheduling time.

Now look at 7 hw threads with 16 tasks: 1/16(7) + 1/16(7) + 1/16(2) + 16x thread scheduling time.

8 threads, 8 partitions = 1/8 run time (.125rt) + 8x thread scheduling
7 threads, 8 partitions = 1/4 run time (.25rt) + 8x thread scheduling (~2x ideal)
8 threads, 16 partitions = 1/8 run time (.125rt) + 16x thread scheduling (~== ideal)
7 threads, 16 partitions = 3/16 run time (.1875)  + 16x thread scheduling (~1.5x ideal)
If on average 1 thread is not available then larger number of tasks yields ~33% faster code.

The smaller partitioning (more tasks than potential number of threads) yields faster code when not all threads are available, but at the expense of higher thread scheduling overhead. At some point you reach a diminishing return on creating additional tasks.

What is the optimal number of partitions (tasks)?

This will vary under the dynamics of the system and the ratio of thread scheduling overhead verses iteration through partition. Let’s consider the circumstance where whatever the number of threads are immediately available at the start of the parallel_foobar will be available for the duration of the activity and that no additional threads will become available during that run time. Also, let’s consider 0 overhead for task enqueue/dequeue. The optimal number of partitions (tasks) is the number that is evenly divisible by each of the potential numbers of threads.

8 hw threads, nThreads {1,2,3,4,5,6,7,8}, nTasks 8*7*6*5 = 1680 (note, 4 fits into 8, 3 fits into 6, 2 fits into 8)
16 hw threads, nThreads {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}, nTasks = 16*15*14*13*12*11*10*9 = 518,918,400

Task scheduling overhead is not insignificant (computationally nor memory requirements). Using the above requirements, partitioning to optimal number of tasks even under the assumption of 0 overhead for task enqueue/deque is untenable.

Some foreknowledge of iteration overhead verses task scheduling overhead is required to keep task scheduling overhead portion of compute time to an acceptable level.

What’s acceptable is subjective?

Assume a high level of code is required for match test results in 1,000 iterations = task scheduling overhead, then 100,000 iterations per task yields 1% task overhead, 10,000 iterations yields 10% overhead. On an 8 hw thread system, at the 1%, the vector size would have to be .gt. 1,680 x 100,000. 

Now then ask yourself: Is the default partitioning of say parallel_for knowledgeable of state of the application at the time of the partitioning? The answer is no, so they use a rule of thumb for partitioning. As to what this is, I cannot say. The programmer can tune the partitioning to minimize both functional runtime and thread scheduling overhead.

>> Any time I see a grainsize that depends on problem size, I see that as a red flag.

tbb::parallel_for uses grain size whether you specify it or not. In many cases one finds it beneficial to specify a grain size: programmer’s knowledge about how many threads will be available at this point in the program (e.g. is this parallel_for run at top level where all threads are available or some nested level where few threads are available), as well as overhead of task scheduling verses iteration.

>> But it seems more important that the algorithm shouldn't start searching toward the right before the left half has been exhausted,

With random distribution of nodes to find, and with unique entries (one match in vector), it does not matter which side you start processing first. The better strategy is to have each thread take an equi-partition and search left to right (iow nTasks == nThreads).

With an average of 2 possible matches, then it may be beneficial for all treads to examine the left half first, then all threads to examine the right half next. IOW with average of n possible matches, divide the range into n primary partitions, and process each of n primary partitions in sequence using all preferred number of threads (task set size) (see next question).

>> And maybe something to automatically avoid using too many CPUs in the presence of memory saturation (how would one go about that?).

This requires the foreknowledge of the ratio of L1 cache hits (instruction and data) per memory (or L3) hits. When the search is light weight (int key == value) or ((double key >= MinVal) && (double key <= MaxVal)), then depending on memory system 2-4 threads per memory channel might be optimal. For higher weight tests, possibly more threads per memory channel due to more code and match criteria fitting in L1. The default settings for grain size won’t know this, the programmer may know this. Heuristics might help to automate this assuming overhead for heuristics is insignificant.

>> Things certainly depend on numerical details not provided, but I'm assuming that they are such that parallel execution could be beneficial, and I wouldn't go as far as wondering about element removal.

Agreed. Knowledge of the application requirements is paramount to writing efficient code. Selecting a generic parallel_find may be counter productive.

Jim Dempsey


It seems to have been the royal "we" who decided it was "pertinent" to bring this back online...

There is overhead to task scheduling, but I'll go with the analysis in the User Guide, section "Parallelizing Simple Loops/parallel_for/Controlling Chunking", that shows a graph indicating that execution time significantly decreases with increasing grainsize and then stays flat "because the parallel overhead becomes insignificant for a sufficiently large grainsize", and a "rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute". I also don't want to run into the right end of the graph because of undersubscription either right away or at the end (because of insufficient parallel slack). The takeaway here is that chunks should not become too big. It is true that auto_partitioner starts with chunks about 1/16th of the fair share of the work, i.e., they do depend on the problem size, but that is not an absolute, because stolen work will be subdivided further (even smaller than 1/16th). A specified grainsize based on the total amount of work cannot do that, so it certainly should not approach a fair share and thereby make tasks behave like threads. (Note that it can still be useful to use a tuned but fixed grainsize together with auto_partitioner.)

The assumption here is that finding the first is not the same as finding any (despite what one Anthony Williams may or may not have written), i.e., there are more than one qualifying elements, so searching should indeed be asymmetrical. If you have an average of 2, finding 2 does not give you the knowledge that you have found all of them and that therefore the leftmost of those 2 is the first in the collection (there might be another one further toward the beginning), and you will still have to exhaustively search from the beginning to the presumptive first hit before you can stop and go home. That goes for any number n. And your argument also carries in itself the reason that most of the work will have been wasted, even before my argument that an exhaustive search is still required: if all n partitions have on average 1 hit, on average n-1 partition searches will have been wasted, and these are exactly the ones not starting from the beginning.

Avoiding memory bandwidth saturation can obviously be done by tuning, but it does seem to depend on the system in a way that TBB or we won't be able to abstract away just yet.

(Corrected 2013-05-02) I was aiming for "imperial we" when mixing it up with another word (that I'm not repeating now), but that isn't it, either, according to Wikipedia at least. Serves me right for trying to translate to English when "pluralis maiestatis" will do just fine... :-)

>>if all n partitions have on average 1 hit, on average n-1 partition searches will have been wasted

Raf, you need to re-read my lengthy post.

When you have ~n matches per criteria, and when these nodes are randomly placed in the vector, then you make a primary partitioning of the vector into n partitions. Each of these partitions on average will have 1 node contained at a random position within the partition. You do NOT assign a task to each of these partitions. Rather you sub-partition each partition into tasks, # tasks == # available threads (an unknown in TBB, therefore you choose thread pool size). The first major partition is searched by all threads, most of the time a hit is made in the first primary partition. Search time ~(size() / averageNumberOfNodes / threadPoolSize / 2) iterations. Note, this further requires 1 CAS. In the next smaller fraction of circumstances, the first major partition will be void of matching criteria, and 2 nodes present in the second partition matching criteria. This yields a search time of:

~(size() / averageNumberOfNodesMatchingCriteria / threadPoolSize) iterations for first primary partition
+~(size() / averageNumberOfNodesMatchingCriteria / threadPoolSize / 4) iterations for second primary partition
+ 1 CAS most of the time, 2 CAS a smaller fraction of time.

Jim Demspey

Ah, missed the redesign...

Well, in general you most likely won't know what to expect, and even if you do it would still be suboptimal to skip ahead by O(input size).

Leave a Comment

Please sign in to add a comment. Not a member? Join today