help, how to stop or terminate parallel_for and parallel reduce

help, how to stop or terminate parallel_for and parallel reduce

Is there a way to terminate an iteration when doing parallel_for or parallel reduce?? Like for example the "break" in a normal for loop??

In advance, thanks for reading this and much more if there is feedback. Godbless..

30 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - mblizz

Is there a way to terminate an iteration when doing parallel_for or parallel reduce?? Like for example the "break" in a normal for loop??

In advance, thanks for reading this and much more if there is feedback. Godbless..

Hello, there is a functionality in TBB called "Task cancellation". From what you ask, this is exactly what you need. It's described in section 8.3.8 of the Reference manual and here is a short example on how you use it:

class FindElemAndCancel {
    const vector& intVec;
    const int valToFind;
    int * const idx;
public:
    FindElemAndCancel(const vector& intVec_, const int valToFind_, int *idx_)
        : intVec(intVec_), valToFind(valToFind_), idx(idx_) {}

    void operator() (const tbb::blocked_range& r) const {
        int r_end = r.end();
        for ( int i = r.begin(); i < r_end; ++i ) {
            if ( intVec[i] == valToFind ) {
                tbb::task::self().cancel_group_execution();
            }
        }
    }
};

tbb::parallel_for( tbb::blocked_range(0, problemSize), FindElemAndCancel(intVec, valToFind, &valIdx), tbb::auto_partitioner() );

Quoting - mblizz

Is there a way to terminate an iteration when doing parallel_for or parallel reduce?? Like for example the "break" in a normal for loop??

In advance, thanks for reading this and much more if there is feedback. Godbless..

Read sections about task cancellation and task_group_context in the TBB Reference manual (Reference.pdf). Also a series of blogs on exception handling and cancellation by Andrey Marochkocould provide additional information.

Quoting - Anton Pegushin (Intel)

Hello, there is a functionality in TBB called "Task cancellation". From what you ask, this is exactly what you need. It's described in section 8.3.8 of the Reference manual and here is a short example on how you use it:

class FindElemAndCancel { const vector& intVec; const int valToFind; int * const idx; public: FindElemAndCancel(const vector& intVec_, const int valToFind_, int *idx_) : intVec(intVec_), valToFind(valToFind_), idx(idx_) {} void operator() (const tbb::blocked_range& r) const { int r_end = r.end(); for ( int i = r.begin(); i < r_end; ++i ) { if ( intVec[i] == valToFind ) { tbb::task::self().cancel_group_execution(); } } } }; tbb::parallel_for( tbb::blocked_range(0, problemSize), FindElemAndCancel(intVec, valToFind, &valIdx), tbb::auto_partitioner() );


Anton,

Since all instances of the parallel_for are using the same referenced iterator why cann't the thread discovering the termination condition simply set the end to the begin or add a member function to do this (e.g. terminate()). In this manner you do not need to add additional code other than a call to r.terminate().

Jim Dempsey

www.quickthreadprogramming.com

"all instances of the parallel_for are using the same referenced iterator" That would not appear to be so.

Quoting - jimdempseyatthecove
Anton,

Since all instances of the parallel_for are using the same referenced iterator why cann't the thread discovering the termination condition simply set the end to the begin or add a member function to do this (e.g. terminate()). In this manner you do not need to add additional code other than a call to r.terminate().

Jim Dempsey

Such a solution was described in Arch Robison's blog: "Have a fish - how break from a parallel loop in TBB".

Quoting - Raf Schietekat

"all instances of the parallel_for are using the same referenced iterator" That would not appear to be so.

Hi, I'm guessing what Jim meant was that all instances of parallel_for use the same blocked_range object by reference. That is true and that is what Arch Robison used for his sample code that Alexey mentions. Arch wrote that code and that blog entrybeforetask cancellation functionality was added to TBB, but we already had at least two customers requesting it.

Sample code that I posted in my original comment was aimed only to help understand how cancellation can be used. Justifying cancellation and making it the only possible solution would require coming up with a much more complex problem, which might have defeated the purpose of that example.

"all instances of parallel_for use the same blocked_range object by reference" Who are you and what have you done to Anton Pegushin? :-)

I wrote "Who are you and what have you done to Anton Pegushin?" I should probably have reserved that for another situation...

More straightforwardly, the described technique had to rely on explicit inheritance of a reference/pointer to a common external switch (which should probably be tbb::atomic instead), because each Body::operator()invocation uses a distinct Range instance (a corrollary of the splitting-constructor API).

Quoting - Raf Schietekat

"all instances of parallel_for use the same blocked_range object by reference" Who are you and what have you done to Anton Pegushin? :-)

:) you're right. What I said was incorrect in so many ways... what was I thinking.

What I actually meant was that parallel_for algorithm iterates over a range, which can be implemented to reference a global "cancel this operation" flag which is monitored to, when signalled, set the end of the range to the beginning of the range or report the range as empty.

Quoting - Anton Pegushin (Intel)

:) you're right. What I said was incorrect in so many ways... what was I thinking.

What I actually meant was that parallel_for algorithm iterates over a range, which can be implemented to reference a global "cancel this operation" flag which is monitored to, when signalled, set the end of the range to the beginning of the range or report the range as empty.

The parallal_for, which creates the block ranged iterators for each thread (with different ranges) could construct a terminationable class of iterator which point back (instead of to a global termination flag) to a common variable constructed by the parallel_for (call it a parent interator control structure if you wish). The next() for thistype of iterator could then test the termination flag in addition to the limit on the range. You would also need a terminate() or have the dtor of the (component) thread's copy of the iterator perform an implicit terminate().

Conceptually you could view this as a shared reduction variable (as opposed to private copy to be reduced at end of task). When modified, it indicates solution found (or other termination condition).

Using a shared variablewould be less of an impact on the system than a terminate thread (or task)(when each pass of iteration has relatively low computation requirements).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Anton Pegushin (Intel)

Hello, there is a functionality in TBB called "Task cancellation". From what you ask, this is exactly what you need. It's described in section 8.3.8 of the Reference manual and here is a short example on how you use it:

class FindElemAndCancel { const vector& intVec; const int valToFind; int * const idx; public: FindElemAndCancel(const vector& intVec_, const int valToFind_, int *idx_) : intVec(intVec_), valToFind(valToFind_), idx(idx_) {} void operator() (const tbb::blocked_range& r) const { int r_end = r.end(); for ( int i = r.begin(); i < r_end; ++i ) { if ( intVec[i] == valToFind ) { tbb::task::self().cancel_group_execution(); } } } }; tbb::parallel_for( tbb::blocked_range(0, problemSize), FindElemAndCancel(intVec, valToFind, &valIdx), tbb::auto_partitioner() );


I have been looking just for this example. I do wonder if it is correct.
void operator() (const tbb::blocked_range& r) const {
int r_end = r.end();
for ( int i = r.begin(); i < r_end; ++i ) {
if ( intVec[i] == valToFind ) {
tbb::task::self().cancel_group_execution();
*idx=i; // <--- This was missing
}
}
}

It was missing the line *idx=i;
Correct me if I am wrong.. Would that be the way to return information from the for loop?

Some "details" were indeed left out. It might be clearer to put that missing assignment before the cancel_group_instruction(), and then you might as well also break/return. If you're sure that only one index will be found, you can probably get away with a pointer/reference to a plain int, if you only read its value after returning from parallel_for; otherwise make that int atomic. Also don't forget to occasionally poll is_cancelled() when not relying on a suitable grainsize.

Quoting - Raf Schietekat
Some "details" were indeed left out. It might be clearer to put that missing assignment before the cancel_group_instruction(), and then you might as well also break/return. If you're sure that only one index will be found, you can probably get away with a pointer/reference to a plain int, if you only read its value after returning from parallel_for; otherwise make that int atomic. Also don't forget to occasionally poll is_cancelled() when not relying on a suitable grainsize.

Hi,

Raf is right. Index assignment should be before the call to "cancel" and guarded too. Also keep in mind that when searching in parallel like in the example above, you're not guaranteed to obtain an index of the first occurence. An index will be pointing to _some_ element in the array, where one of the worker threads found it first.

Quoting - Raf Schietekat
Some "details" were indeed left out. It might be clearer to put that missing assignment before the cancel_group_instruction(), and then you might as well also break/return. If you're sure that only one index will be found, you can probably get away with a pointer/reference to a plain int, if you only read its value after returning from parallel_for; otherwise make that int atomic. Also don't forget to occasionally poll is_cancelled() when not relying on a suitable grainsize.

What do you mean by occasionally? You mean is_cancelled() expensive? How slow is it?

So that is how it should look.

void operator() (const tbb::blocked_range& r) const { 
int r_end = r.end(); 
for ( int i = r.begin(); i < r_end; ++i ) { 
if (tbb::task::self().is_cancelled())
break;
if ( intVec[i] == valToFind ) { 
*idx*i;
tbb::task::self().cancel_group_execution(); 
break;
} 
} 
}

I don't know how expensive is_cancelled() really is, but it's not inline, and some processors may incur an extrapenalty for the associated memory semantics, so you may perhaps do well by only calling it every 100 iterations or so (benefit or not, and exact number of iterations, to be determined empirically).

I didn't think it is necessary to put the result assigment before the cancellation, though, just clearer.

Quoting - Raf Schietekat

I don't know how expensive is_cancelled() really is, but it's not inline, and some processors may incur an extrapenalty for the associated memory semantics, so you may perhaps do well by only calling it every 100 iterations or so (benefit or not, and exact number of iterations, to be determined empirically).

I didn't think it is necessary to put the result assigment before the cancellation, though, just clearer.

Actually, task::is_cancelled() is inline in task.h; task::self() is out-of-line however, and it contains access to thread local storage which is more expensive than reading e.g. stack variables. So Raf is right; run a few experiments with representative workloads to findgoodbalance between the cost of excessive computations and overhead of the cancellation check.

>>you're not guaranteed to obtain an index of the first occurence. An index will be pointing to _some_ element in the array, where one of the worker threads found it first.

This is a reason why you would favor a termination state variable in the iterator (or pointed to by the iterator). Should you issue a cancel_group() function then the first thread to find a match (not necessarily the earliest index) would terminate the search (note, you may also have multiple threads concurrently issuing the cancel_group).

With the termination state variable, a thread can test not only if other thread found a fish, but also if the found fish was further up/down stream of where you are now looking. Depending on placement the thread can decide NOT to terminate just yet, and continue searching until it passes the position in the stream where the other fish was found.

For this type of search (lowest/highest index) you would want to chunk the range into smaller pieces such that all threads work in smaller sections all progressingfront to back or back to front through the data set.

Jim Dempsey

www.quickthreadprogramming.com

"Actually, task::is_cancelled() is inline in task.h;"
Yes, but that's only the tip of the ice cube, :-) which in an inner loop could perhaps make a bit of a difference, although the acquire is probably more important, even on Intel's own hardware (Itanium).

"task::self() is out-of-line however, and it contains access to thread local storage which is more expensive than reading e.g. stack variables."
Of course... sorry, I just replied to the question where I really should have looked at the code itself. Unfortunately Body and Range are just duck-typed, so there's nothing to easily redefine underneath (nod to Jim Dempsey).

"So Raf is right; run a few experiments with representative workloads to find good balance between the cost of excessive computations and overhead of the cancellation check."
Another idea, instead of or rather in addition to amortisation (sic?), is testing a shared atomic, perhaps the result itself, instead of calling is_cancelled(). You should find that the optimum number of iterations for amortisation will be lower, but it should still be beneficial. cancel_group_execution() would still serve to more efficiently tear down the parallel_for itself (I presume?).

Testing is important. Through testing you canderive ageneralized function for determining partitioning of the problem. This function can then be used in your production code. Note, the partitioning function may be reduced to a series of if test just prior to the parallel_for.

Data (array) xxx, thread partitioning tn

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|t0|t1|t2|t3|t0|t1|t2|t3|t0|t1|t2|t3|t0|t1|t2|t3|...

or

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
| t0 | t1 | t2 | t3 | t0 | t1 | t2 | t3 | t0 | t1 |...

The shorter the run time (|tn|) the lower the latency (from found to all threads terminating)
However, the shorter the run time per (|tn|) the higher the number of session start/stop times.

The recommend technique (my recommendation) would be to provide a6 valued iterator

iBegin, iEnd, iRunLengt, iSkipLength, iThreadNum, foundIndex*

Where

iBegin, iEnd are the full half open range
iRunLength is a consecutive index count
iSkipLength, is the number of elements to skip between consecutive index runs
iThreadNum (0 base team member number working on parallel_for)
foundIndex* where you store the result (atomic and viewable by all team members)

The iRunLength should also be tuned for cache sensitivity.

With the above type of iterator, the parallel_for instantiates the thread team once, then the iterator takes care of parceling out the array slices.

When operating in sections of array prior to any thread finding a match then the *foundIndex will remain unchanged and thus remain in cache (likely L1) and provide very little overhead to test. When a thread makes a find, it performs a CAS only when its found index is .lt. (or .gt.) other found or not found indicator. When thread has not found condition, it test *foundIndex, if not found or if a found condition is seen then if found condition preceeds your position - quit, if found position follows your position - continue.

The above iterator should have relatively little overhead as a two level loop

for(i=iBegin + (iRunLength*iThreadNum); i{
jEnd = min(i+iRunLength, iEnd);
for(j=i; j {
if(*foundIndex < i) return; // lowest match foundIndex initialize toint max
if(match(??))
{
// return our index or new found index whichever is less
for(;;)
{
int t = *foundIndex;
if(i>t) return;
if(CAS(foundIndex, t, i)) return;
// foundIndex changed, try again
}
}
}
}

And a slightly different version for searching from high to low.

Stick the above into a common iterator type once you standardize the functionality.

Jim Dempsey

www.quickthreadprogramming.com

If you want to solve the smallest/largest-index problem, should you really use CAS, I wonder? I'd rather try parallel_reduce() and a simple "item found" signal. Just be careful about how to split a Range (each time getting a new discontiguous piece of a shared state, or something cleverer, see below), and never use task cancellation (which would only be correct with nested contexts in a bespoke algorithm, and would then probably not bring any benefit anyway, or worse). I think that should be strictly faster than the CAS solution, although not necessarily by a lot. Would you agree with that, Jim?

Just giving out discontiguous pieces may not yet be the most efficient way to distribute the work, because the winner may be near the end of a piece, and just making the pieces smaller implies more task scheduling overhead. Another solution (I'm just brainstorming now), assuming good cache-line alignment andN worker threads, would be to give out pieces of the P items at indices (i div N)*N*P+(i mod N)+N*[0,P[, with i growing from 0(maybe some items could be packed together for better alignment). Next question is how to determine good values for N and P at run time, and perhaps vary them during the computation by deviating from the regularity of that formula.

The CAS only occures for new best fit. Depending on frequency of matches, the CAS may only occur once.

Using parallel REDUCE, as Raf outlined would cause threads not making first find (not necessarily earliest index) to bail out without regard to if their search preceed the first found index. Parallel REDUCEcould work with the following modification. Any thread making a find overwrites what ever is in the found shared variable without regard as to if their found index preceeds or follows currently stored found index. Other threads, detecting last written found index (not necessarily earliest index) would compare this against their search index. Should their search positionfollow this found index - bail out, else continue searching. The lowest index can be chosed in the reduction operator.

The benefit is eliminating the CAS
The cost is should an earlier found index be overwrittenalater found index (prior to other threads seeing earlier found index), then this delays the bailout time. The delay in the bailout time could exceed the cost of the CAS. You also have the cost of performing the reduction.

An alternate method would be (for small number of threads <= 8) to have a table of found indexes. One for each thread. A thread bails out on found (and poking mailbox) or if any found index in any mailbox preceeds its search point.

Note, as thread observes change in found table at lower than current search position, the overhead will be equivilent to the CAS. Therefor, elaborating the technique may be of little value. Without running a test it would be hard to prove or disprove this assumption.

I think this is a case where CAS is more effective than the alternatives.

Jim Dempsey

www.quickthreadprogramming.com

I think you slightly misinterpreted my proposal (there would be no bailout that might miss the ultimate result), but you've also convinced me to withdraw it, thanks.

Raf,

>>(there would be no bailout that might miss the ultimate result)

This would be true when you remove the inner loop bail-out test and perform the test after the completion of the run of the inner loops (at leap intervals)and requires synchronization of threadsat leap intervals.

e.g. 4-threads, iteration space divided into 40 zones, 4 threads work on 1st 4 zones, test flag, 4 threads advance to next 4 zones, ...

Your technique required the threads not finding an index to advance through the remainder of its zone (since it would not know if its zone preceeded/followed other threads zones (excepting for seperate mailbox/found flag technique outlined earlier). This would cause an exit latency after found. This latency can be reduced by increasing the number of zones (reducing size of zone), however, this also increases the number of synchronization points.

The method I proposed has no synchronization points. i.e. each thread can pick its zone from the group of 4 zones then leap toits next zone, etc.. with no synchronization between leaps. Should any thread in the team of 4 get preempted (to run some other application) then the three other threads won't get held up at the group of 4 zones boundry.

Don't count out your technique too soon. When the match test is very short, continuing through to end of zone might be faster than having a test for found after each match test. Combinations of running for bursts within your zone without found test might work well too.

There is no one best way for all situations.

Keep up your good contributions.

Jim Dempsey

www.quickthreadprogramming.com

"This would be true when you remove the inner loop bail-out test and perform the test after the completion of the run of the inner loops (at leap intervals) and requires synchronization of threads at leap intervals."
Well, I wasn't thinking of a uniform work distribution with gates along the way, which would only work on a dedicated system: each thread would process pieces at its own pace, and start reducing as soon as it was refused a new one. Your idea to distribute information about where a solution has been found would probably already improve overall performance and decrease latency because some tasks could bail out earlier. Still, to win against CAS (with implicit reduction), pieces might need to be so small that their execution (tasks) and distribution (fetch-and-add) would carry too much overhead, plus the increased cost of the reduction phase, so it doesn't seem a likely winner anymore. Well, maybe I should unwithdraw it anyway, and then somebody could try to put some numbers on it, to confirm this or surprise us. :-)

Case 1: item searched for not in list - # CAS = 0

Case 2: 1st thread to find match finds lowest index - 1 thread performs CAS, others "suffer" cache miss on reading found variable and bails out. Note, this suffrage is equivalent to cache miss on reading of bool bailout flag.

Case 3: 1st thread to find match finds 2ndlowest index - 1 thread performs CAS, others "suffer" cache miss on reading found variable and all but one of these threads bails out, remaining thread performs CAS. 2 CAS

...

So there is a potential of having 0 to nThread number of CAS operations.

Your urge to avoid CAS, IMHO, is unfounded.

Assume you replace CAS with mailbox 1 per thread where threads examine each others mailbox.

Your "training" tells you to place mailboxes on separate cache lines. Whereas themost efficient method (in this case)would be to place in samecache line (or in as few as possible cache lines). Because, any thread making a non-first occurance find will invalidate the other threads cache line when writing to mailbox. Other threads thus experience a cache miss. Should two finds occur ~concurrently the stillsearching threads experience 2 cache misses (with mailbox in seperatecache line) or 1 cache miss (with mailbox in same cache line).

Whatever is used for signaling whether CAS, or write-through, will cause cache eviction and cache miss on other threads. The CAS will delay the thread issuing the CAS but at the benefit of reducing the latency of the other threads completing the task (either to bail out or to determine to continue). The optimization should be focused on mt-Task completion not individual runtimes (from perspective of perspective of thread as opposed to impact on other team members).

Test do need to be made. Rather tweaks need to be made to optimize the code.

Jim

www.quickthreadprogramming.com

"Your urge to avoid CAS, IMHO, is unfounded."
I feel I'd better play dead now to survive this. :-)

Quoting - mblizz

Is there a way to terminate an iteration when doing parallel_for or parallel reduce?? Like for example the "break" in a normal for loop??

In advance, thanks for reading this and much more if there is feedback. Godbless..

Although what I'm going to propose is kinda of odd solution, and isn't really as effective as canceling by hand, maybe block_range could have a cancel() method or something like that...

However honestly I can't think of a good reason, that's not smelly, why block_range would have a cancel method, it would make the library a bit simpler and safer to use, as it removes the burden from the user to break loops in a thread safe way.

Another option would be to provide a cancelable_for_loop, that stops once the operator() returns a false (using a bool operator()(range r) functor).

But either of these solutions would still need to wait until the current longest range finishes processing. So they wouldn't be as good as a hand-break.

Correct Robert,

Conceptually the parallel_for "needs" an optional terminator in addition to the standard iterator. The "hand-break" code is placed into the terminator. You could also have a new class of iterator that contains the standard iterator plus the terminator and mf that coordinate when to call the terminator test.

Either you hide the hand-break inside the classes or you expose the hand-break. Go the hide in class route when you need reusibility (and after you standardize just what you want to do).

Jim

www.quickthreadprogramming.com

Raf,

Below is a sample of break out code followed by runtimes

Note, the search is on an array of longs. This is a memory bandwidth problem. Typically a bail-out technique will be used when the parallel_for step cost is relatively large (it is extreamly small in this example).

Also, the sample is coded in QuickThread as opposed to TBB or OpenMP. Should be easy enough to adapt.

const long nCells = 1000000;
long	Array[nCells];

// parallel_for task to find value in array
//
void doFind(long iBegin, long iEnd, long A[], long find, volatile long* foundIndex)
{
 if(*foundIndex < iBegin) return;	// other thread found first occurance
 for(long i=iBegin; i < iEnd; ++i)
 {
  if((i%100)==0)	// arbitrarily look at memory every 100 iterations
  {
   if(*foundIndex < i) return;	// other thread found first occurance
  }
  if(A[i] == find)
  {
   for(;;)
   {
    long oldIndex = *foundIndex;
    if(oldIndex < i) return;	// other thread found first occurance
    if(_InterlockedCompareExchange(foundIndex, i, oldIndex) == oldIndex)
     return;
   }
  }
 }
}

// parallel_distribute task to find value in array
void doFindDistribute(
  int threadNum, int nThreads, long iBegin, long iEnd,
  long A[], long find, volatile long* foundIndex)
{
  int	L2cacheLineSize = (int)CacheLevelLineSize(2);
  int	runCount = L2cacheLineSize / sizeof(A[0]);
  for(long iOuter = iBegin + (runCount * threadNum); iOuter < iEnd; iOuter += runCount * nThreads)
  {
    if(*foundIndex < iOuter) return;	// other thread found first occurance
    long iEndInner = iOuter + runCount;
    if(iEndInner > iEnd) iEndInner = iEnd;
    for(long i = iOuter; i < iEndInner; ++i)
    {
      if(A[i] == find)
      {
        for(;;)
        {
          long oldIndex = *foundIndex;
          if(oldIndex < i) return;	// other thread found first occurance
          if(_InterlockedCompareExchange(foundIndex, i, oldIndex) == oldIndex)
            return;
        }
      }
    }
  }
}

void test()
{
  // serialy initialize data
  for(long i = 0; i < nCells; ++i)
    Array[i] = rand() * rand();

  for(long i=0; i < nCells; i+=100000)
  {
    __int64 BeginTick, EndTick, ElapseTicks;

    long find = Array[i];
    volatile long foundIndex;
    long Chunk = 50000;
    foundIndex = nCells; // initialize beyond end of array
    BeginTick = _rdtsc();
    for(long j = 0; j < nCells; ++j)
    {
      if(Array[j] == find)
      {
        foundIndex = j;
        break;
      }
    }
    EndTick = _rdtsc();
    ElapseTicks = EndTick - BeginTick;
    if(foundIndex < nCells)
    {
      std::cout << "Ser " << find << " at "
                << foundIndex << " Ticks "
                << ElapseTicks << std::endl;
    }
    else
    {
      std::cout << "Ser " << find
                << " not found Ticks "
                << ElapseTicks << std::endl;
    }

    foundIndex = nCells; // initialize beyond end of array
    BeginTick = _rdtsc();
    parallel_for_v1(Chunk, doFind, 0, nCells, Array, find, &foundIndex);
    EndTick = _rdtsc();
    ElapseTicks = EndTick - BeginTick;
    if(foundIndex < nCells)
    {
      std::cout << "Par " << find << " at " 
                << foundIndex << " Ticks "
                << ElapseTicks << std::endl;
    }
    else
    {
      std::cout << "Par " << find
                << " not found Ticks " 
                << ElapseTicks << std::endl;
    }

    foundIndex = nCells; // initialize beyond end of array
    BeginTick = _rdtsc();
    parallel_distribute_v1(AllThreads$, doFindDistribute, 0, nCells, Array, find, &foundIndex);
    EndTick = _rdtsc();
    ElapseTicks = EndTick - BeginTick;
    if(foundIndex < nCells)
    {
      std::cout << "Dis " << find << " at " 
                << foundIndex << " Ticks "
                << ElapseTicks << std::endl;
    }
    else
    {
      std::cout << "Dis " << find << " not found Ticks " 
                << ElapseTicks << std::endl;
    }

  }
}


Ser    757147 at      0 Ticks   12501
Par    757147 at      0 Ticks  106767
Dis    757147 at      0 Ticks   96138
Ser 388526383 at 100000 Ticks  295803
Par 388526383 at 100000 Ticks  434907
Dis 388526383 at 100000 Ticks  324747
Ser 115913288 at 200000 Ticks  422262
Par 115913288 at 200000 Ticks  469719
Dis 115913288 at 200000 Ticks  223155
Ser 206495010 at 300000 Ticks  654147
Par 206495010 at 300000 Ticks  892719
Dis 206495010 at 300000 Ticks  400932
Ser  76467996 at 400000 Ticks  934578
Par  76467996 at 400000 Ticks 1156104
Dis  76467996 at 400000 Ticks  723879
Ser 230797512 at 500000 Ticks 1056978
Par 230797512 at 500000 Ticks 1310562
Dis 230797512 at 500000 Ticks  705096
Ser  71493905 at 600000 Ticks 1454517
Par  71493905 at 600000 Ticks 1575054
Dis  71493905 at 600000 Ticks  783054
Ser 692857960 at 700000 Ticks 1678023
Par 692857960 at 700000 Ticks 1674198
Dis 692857960 at 700000 Ticks 1170072
Ser  72972742 at 800000 Ticks 1838934
Par  72972742 at 800000 Ticks 1980621
Dis  72972742 at 800000 Ticks 1075095
Ser 194646209 at 900000 Ticks 1904148
Par 194646209 at 900000 Ticks 1983087
Dis 194646209 at 900000 Ticks 1168587

Jim

www.quickthreadprogramming.com

Login to leave a comment.