setting and scanning an array on one go

setting and scanning an array on one go

Hi I'd like to parallelise the following simple serial code:

tmp = Idx

for(i=0; i<n; ++i) {

  y[i] = tmp;

  f[i] = func(i);  // func(i) is expensive and touching f[i] may be expensive, too

  tmp = tmp  x  f[i];   // x is some associative operation

}

and I want to call func(i) exactly once for each i. I could use parallel_for() to set f[i] and thereafter parallel_scan() to set y[i]. This touches most f[i] three times (in parallel_for(), and in the pre- and final scans of parallel_scan()). But, algorithmically 2 touches suffice. So, I wonder whether it's possible to implement this code in parallel using just parallel_scan()?

I tried, and it's not working. I presume it's impossible, because some i are scanned twice (pre-scan and final scan), but not all. The implementation of the scans (body::operator()) would need to know whether a given final scan has been preceeded by a pre-scan or not. While parallel_scan() must have that information, it appears that it is impossible to obtain it for the implementation of body::operator().

Is my analysis correct, i.e. there is no way to implement my simple code without touching most f[i] thrice (as opposed to twice)?

NOTE: using an array of flags or pre-setting f[i] to a value guaranteed not to occur from func(i) is not an acceptable solution (they body require touching each element of an array).

publicaciones de 34 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

I just figured that it may work after all. The required information seems available via the body::reverse_join(). A final scan which was not pre-ceeded by its body beeing reverse_join()ed (by another body) has not been preceeded by a pre-scan. Is this correct?

I implemented it (using a boolean member 'joined' of body, which defaults to false and is set to true if body get's reverse_join()ed) and it does work. If verified, perhaps this example could be added to the examples for parallel_scan() in the docu (would have saved me a lot of trouble)

Imagen de jimdempseyatthecove

Try something like this:

tmp = Idx
atomic<int> i_y;
atomic<int> i_f;
i_y = 0; // initialize before parallel_invoke
i_x = 0;
parallel_invoke(
  [](){
    for(; i_y < n; ++i_y)
    {
      while(i_y == i_f)
        _mm_pause();
      y[i_y] = tmp;
    }},
  [](){
    for(; i_f < n; ++i_f)
    {
      f[i_f] = func(i_f);  // func(i_f) is expensive and touching f[i_f] may be expensive, too
      while(i_y < i_f)
        _mm_pause();
      tmp = tmp  x  f[i_f];   // x is some associative operation
    }});

 Jim Dempsey

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

Alternet method:

tmp = Idx
atomic<int> i_y;
atomic<int> i_f;
i_y = 0; // initialize before parallel_invoke
i_x = 0;
parallel_invoke(
  [](){
    y[i_y++] = tmp; // y[0]
    for(; i_y < n; ++i_y)
    {
      while(i_y == i_f)
        _mm_pause();
      tmp = tmp  x  f[i];   // x is some associative operation
      y[i_y] = tmp;
    }},
  [](){
    for(; i_f < n; ++i_f)
    {
      f[i_f] = func(i_f);  // func(i_f) is expensive and touching f[i_f] may be expensive, too
      }});

 Jim Dempsey

www.quickthreadprogramming.com
Imagen de Raf Schietekat

Cita:

Walter D. escribió:

I just figured that it may work after all. The required information seems available via the body::reverse_join(). A final scan which was not pre-ceeded by its body beeing reverse_join()ed (by another body) has not been preceeded by a pre-scan. Is this correct?

I implemented it (using a boolean member 'joined' of body, which defaults to false and is set to true if body get's reverse_join()ed) and it does work. If verified, perhaps this example could be added to the examples for parallel_scan() in the docu (would have saved me a lot of trouble)

The scan operator has a parameter than is either pre_scan_tag or final_scan_tag, so there's no need to guess. If func is expensive, you should consider caching its result between pre-scan and final scan. 

Imagen de jimdempseyatthecove

On the two suggestions I made, they both require the availability of two threads. To be safe you should add the equivilent of a spinwait, such that the active thread can fall back to single thread mode.

Jim Dempsey

www.quickthreadprogramming.com

Jim: I want to avoid an atomic index and run efficiently on more than 2 threads. tbb::parallel_scan() obtains a scaling of 50% for large n_proc, i.e. takes 2/n_proc as long as serial code.

Raf: you completely missed the whole point. Perhaps it would be useful if you read posts more carefully.

Of course I'm caching f[i]=func(i) as in the serial code. But this is more complicated than you thought: it cannot simply by done in the pre-scan. As explained in my original post, the reason is that for some i no pre-scan is made. These are those at the very begin and very end of the total range. For these, the call f[i]=func(i) must be done in the final (and only) scan. So, obviously, the information pre-scan vs. final scan is not sufficient. What is required in addition is to know whether a final scan was preceeded by a pre-scan or not. I found a way to obtain that information. See also the attached code.

Adjuntos: 

AdjuntoTamaño
Descargar test.cpp1.08 KB
Imagen de Raf Schietekat

I absolutely did miss the point. To avoid offending again, I promise never to answer to your questions anymore.

Imagen de jimdempseyatthecove

Actually you do not need to make these variables atomic. Just declare them with volatile.
atomic was used due to the fact that people on this forum are adverse to using volatile.

Your original post had

f[i] = func(i);  // func(i) is expensive and touching f[i] may be expensive, too

An (2) atomic++ should be irrelivant. The volatile might take a couple clock ticks extra per iteration.

Jim Dempsey

www.quickthreadprogramming.com
Imagen de Raf Schietekat

After a private message, I've had another look. Both workarounds I would think of have been addressed. The one with the flags could be enhanced for less overhead by grouping a block of elements under one flag, and enforcing alignment by iterating over blocks instead of over individual elements; flags can be initialised in bulk using memset() (even atomics, if needed, preferably only before initiating concurrent access).

I'm not confident that the proposed rule would be generally applicable: what if a Body does a reverse join, a final scan of a range that had a prescan, and then proceeds off to the right? One would have to prove that this is fundamentally impossible, or ask Intel never to change the current algorithm if it happens to behave like that, because I don't remember seeing any guarantee about that, and if it were that reliable, why wouldn't there be an only_scan_tag already? This may have been discussed before, perhaps in a thread with the participation of TBB's original architect, but I don't remember all the details, and I can only hope that I might have remembered if a solid solution was suggested; still, you might want to try searching for that thread, which at least resulted in the addition of some nice graphics for better understanding of the algorithm (both its mechanism and its range applicability).

I have to disagree about using volatile where an atomic should be used: there's no gain in performance (if you don't need atomic ++ you can split up the operation yourself), and it relies on implementation details that are not technically guaranteed (even if the increment does not have to be atomic, at least the load and store operations should be). But I tend to be pedantic about such matters.

Imagen de Raf Schietekat

For clarity, the flags should be kept separate, probably as an array of bytes, one for every 100 or 1000 elements or so.

About volatile not being more efficient than atomic: that could also be read as atomics not being more expensive than volatile (with the right precautions about avoiding unnecessary read-modify-write). I think there should be further semantics "notrans" to avoid the cost of read-modify-write atomicity where it is not needed, without having to resort to spelling it all out.

(Correction) Well, that would be orthogonal to memory semantics, you might still want to release.

I'm not confident that the proposed rule would be generally applicable: what if a Body does a reverse join, a final scan of a range that had a prescan, and then proceeds off to the right?

No problem, since the body has not been reverse_join()ed by another.

why wouldn't there be an only_scan_tag already?

Because the need for such a tag was not recognised. For a typical scan operation, there is no need to distinguish final scans preceeded or not by a pre-scan. In fact, it is not possible, with the current implementation of the algorithm (using compile time information passing), to provide the only_scan information in such a way that code which doesn't use that information would continue to work.


Imagen de Raf Schietekat

I don't know the facts well enough to do more than argue about plausibility. If you want to use this, you'll have to prove it, and know whether it depends on essential properties or just on current implementation. If you can communicate that proof, others may be able to benefit, and perhaps it can be incorporated in the documentation. I'm not going to compete to be the first to provide such proof, but maybe if you get stuck I might be tempted to have a go at it myself.

Imagen de jimdempseyatthecove

Walter,

Have you put together a working benchmark with dummy type (T) and functor (Func) representative of your array sizes and computational loads? The code should have the serial version and parallel version using the same input data but generating separate output data which could then have an untimed verification pass. The Type you supply should be a mock-up of, but also similar complexity of, and size of the objects you intend to use. This can be dummy code but functional enough to support the operators used in your struct body (=, +=, [], ...).

If so would you mind uploading the program?

www.quickthreadprogramming.com
Imagen de Raf Schietekat

While looking at the pretty picture for how parallel_scan operates, I was wondering whether it would also be possible to provide better affinity between the two phases, because clearly they are mostly if not always executed by separate Body instances, and presumably a Body instance now stays with the same thread during its lifetime. The parallel-scan implementation reminds me of how complexity can be added to separately predict carry information out of a group of binary digits during an addition, which improves latency for arriving at the result. In this case the circuitry is predetermined and an algorithm has to be devised to make the best use of what already exists. Maybe affinity is just mutually exclusive with that goal, maybe it isn't, but this should be explored.

It would also be interesting to have a survey of all the possible use cases. Here we have one where a guarantee of either only_scan_tag or pre_scan_tag+final_scan_tag is rather crucial, and even one of just always pre_scan_tag+final_scan_tag would be preferable to final_scan_tag or pre_scan_tag+final_scan_tag (does parallel_scan at least provide that guarantee, i.e., no element to be visited more often than during one possible pre-scan and one mandatory final scan?). Perhaps this information can be extracted out of the current implementation, as Walter surmised, or otherwise another trick can be used behind the scenes to implement a new operation with just that guarantee, so that there is no need to mix those concerns in an application.

Any ideas about this?

Imagen de jimdempseyatthecove

I have rewritten your struct body, had to make some edits to get it to work. The test program shows a descrepancy in the results.

Walter, Raf, can you please check the code for errors.

Changes:

T is provide as type double
arrays f and y are now fTTB and and yTBB, and fSerial and ySerial
The argument to the supplied function changed from int to double*
The two calls functor args changed from int to &f[i]
The "expensive and touching f[i]" function is a delay loop returning a simple modification of f[i]
(you can alter the contents for delay time or computational overhead)

The test program allocates arrays, initializes the TBB and Serial arrays to same value,
Then performs (assumed) same process.
Timing is made, output arrays are verified, output sum is shown

Tests indicate errors along the line of what Raf expressed as his concernes.

I believe at issue here is the partitions TBB starts with all begin with sum=0.0, rather than sum from prior partition.
The output arrays generated for each TBB partition will start with this sum=0.0 as opposed to the resultant sum from the equivilent of the prior partition in the serial run. Note, although the TBB reduction operation accumulates each partition's sum, the sum's are incorrect.

Jim Dempsey

Adjuntos: 

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

While we are waiting for Walter (or Raf) to comment/fix the failing TBB implementation of the derivative of Walter's program, I thought this would be an interesting problem to address with the QuickThread Parallel Manifold. The program is attached. You will be able to see the coding, it is a little less intrusive that tbb::pipeline and qt::parallel_pipeline (neither used in test program). You won't be able to run it without downloading the QuickThread code from my website.

Command window output:

N=5000000
Serial test
TBB test
QuickThread test
TBB Scale: 4.909084
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[i] != yTBB[i]
fSerial[i] != fTBB[i]
QuickThread Scale: 6.786223
resultSerial = 653402.120624, resultQuickThread = 653402.120624

The test was run on Core i7 2600K, 4 core with HT

Note, the TBB test failed to produce the same results as the serial code (waiting for fix from Walter or Raf)
The runtime for the TBB test would be representative of what the fixed code may perform using this program construct in TBB. TBB pipeline should be written using a similar technique as I did with two-stage Parallel Manifold. The TBB pipeline may then produce correct output.

The QuickThread Parallel Manifold produces the same output as the serial run and is 6.786x faster than serial and ~38% faster than TBB.
This was one run with the same synthetic load function for all tests. The actual performance differences will vary with this function (the synthetic load function stalls for 10,000 _rdtc() clock ticks).

Jim Dempsey

Adjuntos: 

www.quickthreadprogramming.com
Imagen de Raf Schietekat

I would have preferred that program not to have been so very needlessly Windows-specific, because I had to first make it portable before being able to try it out for myself (I didn't see anything wrong with the TBB side of it from just looking at the code, well, no problem that would make it fail to compute the running sum, anyway). I had no interest in timing, so I just took out those statements.

From the Reference Manual: "Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section."

I didn't analyse how unstable the calculation really is, but "resultSerial = 430605.730309, resultTBB = 426995.560837" still seems to be in the same ballpark, so TBB appears to be working as it should. Still, those excessively demanding comparisons were even failing when I ran the program with one thread and got "resultSerial = 430605.730309, resultTBB = 430605.730309", but I didn't investigate why.

okay, I transcribed Jim's hack into a C++ program, see attached. I works correctly immediately (apart from round-off in the 14th significant digit; Jim: you confused the arguments y and f to the body for the parallel_scan()).

When issuing a wait in the function call (2nd argument to main), the scale factor is near to the number of processors, i.e. near ideal (using gcc 4.8.1 or clang 3.2) on a i7 (4 cores) intel chip.

Raf, I thought a little about prooving that my implementation works, but with no result yet (though I have never yet found an error either).

Adjuntos: 

AdjuntoTamaño
Descargar setandscan.cpp3.18 KB
Imagen de jimdempseyatthecove

Raf,

I added a TBB pipeline.

*** I tried to construct it for two stages, had issues with trying to get initialized void* comming into the input stage (not part of TBB design). Maybe you will have better luck. two stage should operate faster (as indicated by my two-stage manifold runs).

N=5000000
Serial test
TBB test
TBB pipeline test
QuickThread test
TBB Scale: 4.991204
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[i] != yTBB[i]
fSerial[i] != fTBB[i]
TBB pipeline Scale: 2.923964
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 6.653138
resultSerial = 653402.120624, resultQuickThread = 653402.120624

Jim Dempsey

Adjuntos: 

www.quickthreadprogramming.com
Imagen de Raf Schietekat

How do you get a pipeline... that must be one seriously expensive map!

Imagen de jimdempseyatthecove

Walter,

I incorporated your changes, made one small change relating to your func(int i) to my func(double).

Your code runs, however at ySerial[321500] != yTBB[321500] (N=5000000)

Used timer code I had in my app, also init code too. Maybe you can see if I goofed up.

I conditionalized out the QuickThread code

Your runtime improved significantly:

N=5000000
Serial test
TBB test
TBB pipeline test
QuickThread test
TBB Scale: 4.969774
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[i] != yTBB[i]
fSerial[i] != fTBB[i]
TBB pipeline Scale: 2.895157
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 6.313559
resultSerial = 653402.120624, resultQuickThread = 653402.120624

Jim Dempsey

 

Adjuntos: 

www.quickthreadprogramming.com

I slightly modified my earlier file to use 64-bit integer arithmetic instead of double (also: added a comparison with the serial results). This avoids floating point comparison and round-off errors and hence enables a more rigorous tests of correctness. I havn't found any error yet.

Jim, you keep on using non-standard (MS specific) code, a strange function call f=func(f), and floating-point arithmetic. Moreover, you seem not really interested in the question at hand, which was whether my implementation using tbb::parallel_scan() is correct. I will ignore any comments not aimed at answering the question. If you think there is some value in discussing various ways to solve this problem, then, I presume, you're welcome to open your own thread for that. However, I do think that promoting your own commercial software here is inappropriate.

Adjuntos: 

AdjuntoTamaño
Descargar setandscan.cpp4.07 KB
Imagen de jimdempseyatthecove

Walter,

Your original description had:

tmp = Idx
for(i=0; i<n; ++i) {
  y[i] = tmp;
  f[i] = func(i);  // func(i) is expensive and touching f[i] may be expensive, too
  tmp = tmp  x  f[i];   // x is some associative operation
}

The above being pseudo code. Your proffered solution used a template to generate the code. For simplification, I chose to use a specific type (double), then incorporate that type into a non-template generate "func" function containing the loop . This makes the example simple and clear as there is no potential for hidden operator functions that may have side effects.

The original func function is your sample load function. You can change the argument of the function back to (int i) if you wish, it realy doesn't matter, but your comment leads to the assumption that the load function, receiving i (an index), also knows where array f[] is located. For the example load function in the sample code I provided, I did not give it access to f, principally because the same load function was to run all variations of the test, and it would not know the location of the differing f's (fSerial, fTBB, ...). Therefore, the load function func should be supplied either a) value (f[i]), b) reference (f[i]), c) pointer (&f[i]), d) be a member function of some object in which the loop is enclosed (f is thus known), e) be a member function whereby the pseudo code statement becomes "f[i] = obj.func(i);" 

T func(int)
became
double func(double)

This function is arbitrary in any event. If you want it to receive and int, you can code it that way.

Now then, choice of type. I specifically chose type double and chose to generate values with fractional values that will contain values that use the full extent of the floating point precision. The purpose of this is to assure that any possibility of sequencing error (difference) between the serial implimentation and parallel implimentation would tend to be exposed. Isn't the goal to produce a parallel implimentation of a serial algorithm that produces the same results?

RE: However, I do think that promoting your own commercial software here is inappropriate.

You are misinformed.

The QuickThread toolkit is available free (subject to GNU General Public License) in source form from www.quickthreadprogramming.com

I am in the process of re-working the site, so bear with me any inconvienecies you may have with downloads and content and documentation. The prior layout had the QuickThread libraries in binary format. Now developers (who know what they are doing) can download the source code to the library.

FWIW This library code represents over 5 man-years of my programming effort. Most of the effort was spent identifying, locating, and eliminating complicated thread interaction issues (adverse and performance) that happens inside thread scheduling code.

RE:  you keep on using non-standard (MS specific) code

Grow up and quit whining. Your complaint involves you adding:

#if derined(__linux)
... linux headers
#else
... Windows headers
#endif

I wrote the test program on a Windows platform and submitted to this forum. It is unnecessary for me to configure this for your platform, and unwarrented for you to complain about this. You can add the 6 or so lines to the top of the program.

You have a problem with your code as it gets different results from the serial code. You cannot brush this asside (we won't care about floating point round off errors...). I have spent several hours of my time trying to help you resolve your problems.

I found your simple problem a rather interesting problem for parallelization. Basically you have two statements within a loop, where the first statement depends on the result of the second statement from the prior iteration of the loop (or initial value on initial iteration of loop). This particular problem, as simple as it seams in serial format, is spectatularly difficult for someone to impliment this properly in parallel (if they have never done ths before). I've shown a) your technique is flawed (subject to your correction of my code), b) a fix for using TBB pipeline, c) a suggestion for investigation for improving the TBB method, d) a completly different and unique solving the problem using the QuickThread Parallel Manifold.

The principal problems with your implimentation (results and performance) (IMHO) (subject to code revision by you) are:

a) while the f[i] = func(i) values are computed properly, the y[i] sum carry from prior iteration are not. The sum carry must propigate through the fi]= results in the same sequence (else you (may) generate roundoff accumulation differences/errors).

b) As you pop out of your partitioning recursion levels, you (appear to me) to be splicing the sum carry at only the juncture of the partitions whereas the proper sum carry would be required to propigate through all the elements of the partition's y[i] sub-range. Thus requiring each y[i] to have number of recursion levels number of += operations (thus too attributing towards a result difference with the serial form). While your current implimentation may produce a correct total sum (by chance/approximation) the y[i] will not be correct.

The TBB pipeline method looked promising. However it requires 3 stages. Two stages would be better, however, due to a design shortcomming in TBB pipeline, in that the initial tokens to the input stage cannot be initialized/specified. It appears that some undocumented arbitrary non-NULL void* is passed into the pipeline input state for startup. And it is not clear as to if the same non-NULL viod* passed out of the last stage is passed back to the input state. One work-around for this would be to eliminate the input stage and use an atomic<int> variable for use as "i". But you and Raf seem to be adverse in doing so. Using this would eliminate a pipelie stage, free-up a thread, and use less computational resources.

Jim Dempsey

 

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

P.S. Build of the QuickThread library (currently on website) fails on Ubuntu Linux. I fixed the code here today, and will configue the sample program I posted on the form for running on Linux (assuming the library still runs properly after the handful of fixes I had to make). The princpal issues related to Ubuntu requiring an update to 12.04.2, and this disturbed my Linux development environment. Once I have all working, I will send you the code, at least the library and headers for QuickThread so you can compile and link your program on your computer. I can attach those to this forum with instructions. If (later) you want the full source to the library, you can download that for free - (subject to GPL license).

Jim Dempsey

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

I got the code running on Linux. Getting the app to compile on Linux was problematic with the Windows functions and __int64, and intrinsics. I used GCC++. May have had less problems with Intel C++ (haven't installed it on Ubuntu yet).

Made 3 test runs to check runtimes and consistency of results. The parallel_scan failed in different points. Most likely due to issues discussed earlier. There is some variations in the runtimes, not sure what that is about. I have the QuickThread code starting up after the TBB runs. It creats its own set of threads so there may be an issue of TBB block time at end of available tasks. Cann't be sure.

------------------ run data --------------

Run 1
N=5000000
Serial test 51694802116 ticks
TBB test 9759266303 ticks
TBB pipeline test 17300539246 ticks
QuickThread test 8305274729 ticks
TBB Scale: 5.296997
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[468750] != yTBB[468750]
fSerial[4687500] != fTBB[4687500]
TBB pipeline Scale: 2.988046
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 6.224334
resultSerial = 653402.120624, resultQuickThread = 653402.120624
Run 2
N=5000000
Serial test 51729713120 ticks
TBB test 9784810838 ticks
TBB pipeline test 17331391107 ticks
QuickThread test 9223363363 ticks
TBB Scale: 5.286736
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[937500] != yTBB[937500]
fSerial[4687500] != fTBB[4687500]
TBB pipeline Scale: 2.984741
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 5.608552
resultSerial = 653402.120624, resultQuickThread = 653402.120624
Run 3
N=5000000
Serial test 51706789862 ticks
TBB test 9762459783 ticks
TBB pipeline test 17341465614 ticks
QuickThread test 8269594338 ticks
TBB Scale: 5.296492
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[937500] != yTBB[937500]
fSerial[4687500] != fTBB[4687500]
TBB pipeline Scale: 2.981685
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 6.252639
resultSerial = 653402.120624, resultQuickThread = 653402.120624

---------- end run data --------
Interesting, when I place the QuickThread code in front of TBB...
and leave the QuickThread thread pool active, TBB appears to run single threaded??

If I scope the qt::qtInit init(-1,0); thread pool initiator, such that when it exits scope it disbands the QuickThread thread pool, then TBB again builds a complete thread pool (8 threads).

IOW if building a mix applicaiton (not necessarily the best thing to do) then establish the TBB thread pool first.

If anyone is interested I can post the code for your review. Will get updated archives onto website soon, I can post here if you are anxious to run the test.

Jim Dempsey

www.quickthreadprogramming.com
Imagen de Raf Schietekat

Did you change the tests to allow for differently rounded results, though, because I wouldn't consider it "failing"? Look for "Reference Manual" above to find a quote about that.

Imagen de jimdempseyatthecove

Raf,

I spent last night (till 1:45 this morning), reworking my version of Walter's upload such that it closely exemplifies Walters code with a few exceptions. These exceptions are reasonable and align with Walter's original problem statement. And my code is implimented as templates - so Walter cannot gripe about this. Differences:

1) Walter's synthetic "load" fuunction used std::this_thread::sleep_for(std::chrono::nanoseconds(W)); which may suspend the thread thus taking a load off the processor (and permitting oversubscription to show additional false scaling, though oversubscription was not used). My load function is a non-floating point compute loop (using __rdtsc() and _mm_pause(), with those functions included with the main). This function could be changed to be floating point intensive or some mix of floating point, integer and memory R/W if so desired.

2) Walter's original problem statement stated "func(i) is expensive and touching f[i]". Therefore I changed the load function to take on arguments of i and f[], and then reference f[i] and incorporate this into the result.

3) Walter's original load function returned results that did not tax the percision (generated very small differences per i). This would lead one to assume minor round of errors. Therefore I change this computation too.

Load function:

   // Jim's function, generates higher degree of error in TBBscan
    auto func = [=] (std::size_t i, double* f) -> double
    {
        // stall for delayTicks clock ticks
        __uint64 t0 = __rdtsc();
        while(__rdtsc() - t0 < W)
            _mm_pause();
      return sin(f[i]);
    };

The following is an output report of one test run. I change the report format to be in line with Walter's coding style. But I expanded the error testing to print out the first 10 (up to 10) elements in error between the Serial and other runs for the f and y arrays. Results are:

N=5000000 W=10000
Serial               15174965 micro seconds result = 653402.1206241812
TBB scan             2263908 micro seconds result = 656961.0825077833
TBB pipeline         5104435 micro seconds result = 653402.1206241812
QuickThread Manifold 2453756 micro seconds result = 653402.1206241812
TBB scan Scale: 6.702995439744018
resultSerial =  653402.1206241812
resultTBBscan = 656961.0825077833
ySerial[468750] =   106009.4110755896
yTBBscan[468750] =  106009.4110755854
ySerial[468751] =   106009.8476355567
yTBBscan[468751] =  106009.8476355525
ySerial[468752] =   106010.2841963265
yTBBscan[468752] =  106010.2841963223
ySerial[468753] =   106010.720757899
yTBBscan[468753] =  106010.7207578947
ySerial[468754] =   106011.157320274
yTBBscan[468754] =  106011.1573202698
ySerial[468755] =   106011.5938834517
yTBBscan[468755] =  106011.5938834475
ySerial[468756] =   106012.030447432
yTBBscan[468756] =  106012.0304474278
ySerial[468757] =   106012.4670122149
yTBBscan[468757] =  106012.4670122107
ySerial[468758] =   106012.9035778005
yTBBscan[468758] =  106012.9035777963
ySerial[468759] =   106013.3401441887
yTBBscan[468759] =  106013.3401441845
fSerial[4921875] =   -0.8294587162310608
fTBBscan[4921875] =  -0.7375659637302644
fSerial[4921876] =   -0.8294586000725364
fTBBscan[4921876] =  -0.7375658852913012
fSerial[4921877] =   -0.8294584839134298
fTBBscan[4921877] =  -0.7375658068519348
fSerial[4921878] =   -0.8294583677537409
fTBBscan[4921878] =  -0.7375657284121655
fSerial[4921879] =   -0.82945825159347
fTBBscan[4921879] =  -0.737565649971993
fSerial[4921880] =   -0.8294581354326167
fTBBscan[4921880] =  -0.7375655715314174
fSerial[4921881] =   -0.8294580192711812
fTBBscan[4921881] =  -0.7375654930904387
fSerial[4921882] =   -0.8294579031091633
fTBBscan[4921882] =  -0.7375654146490568
fSerial[4921883] =   -0.8294577869465634
fTBBscan[4921883] =  -0.7375653362072718
fSerial[4921884] =   -0.8294576707833813
fTBBscan[4921884] =  -0.7375652577650839
TBB pipeline Scale: 2.972898
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624
QuickThread Scale: 6.184382
resultSerial = 653402.120624, resultQuickThread = 653402.120624

Using the original load function the observed error in the result sum values differed only after the 12t or 13th place, leading Walter to assume a false confidence in the TBB scan code.

The above report, shows an error in the 3rd place ~10 orders of magnitued greater. This does not give me much confidence in the offered TBB scan implimentation.

Jim Dempsey

Adjuntos: 

AdjuntoTamaño
Descargar walter.cpp22.15 KB
www.quickthreadprogramming.com
Imagen de Raf Schietekat

I'm still not sure which implementation is to be considered in error. I even have more confidence in the outcome from TBB, because it is accumulating smaller values during the prescan (less rounding error than when adding small values to a larger one), before adding the results at coarser intervals, so I think parallel_scan() would have less variation when varying types between float, double and long double. For the full effect, but only if one didn't care about the additional cost of doing one more addition per element, one might even mimic that during the final scan and delay adding the accumulated result from the current chunk's prefix until just before writing each scan value.

Imagen de jimdempseyatthecove

Raf,

I forgot to attach the matching QuickThread archive, so here it is if you are interested in trying to run this on your Mac.

QuickThread_v3/*.*   .cpp and other files for maintaining the library
QuickThread_v3/include   include headers for use by applications and building the QuickThread library
QuickThread_v3/Release  x64 libQuickThred.a (and other .o files)

For building Walter.cpp

Add the QuickThread_v3/include path to your include path
Add the QuickThread_v3/Release path to your lib path
Add the libraries inthe order: QuickThread, rt, dl, numa, tbb

C++ options Add: -std=c++0x

Note, you will need to have boost installed too (I use one template function in the meta language), and you may need to install libnuma as well to resolve link issues (though NUMA functionality won't be used on your Mac)

Let me know if you can build Walter and if it runs on your Mac.

Jim Dempsey

Adjuntos: 

www.quickthreadprogramming.com

I have only little time for this enterprise, but here are some observations:

My implementation of the set and scan is definitely not 100% correct, as uncovered by Jim's test. What happens is that for some i func() is called twice (the attached code explicitly tests for this). This only causes an error in the result for Jim's test, but not my original test. The reason is that while a final scan done by a body that has been reverse_join()ed has definitely been preceeded by a pre-scan, the reverse is not true: a final scan by a body that has not been reverse_join()ed may or may not have been pre-scanned. Typically, a pre-scan in this situation only occurs for the second last range.

For my particular purpose, this is not critical, because the function call is not really that expensive and because it will give the same result with every call.

Now about Jim's code. AFAICS, it scans serially, but in parallel with the set operation. If the set operation is time consuming (as specified in my original question), then this is still efficient for sufficiently few threads. However, for n_proc > cost_set / cost_scan the code will cease to scale well (Jim: you should see this). So this solution is not suitable for my situation.

Jim, at some time I will have a look at QT. How easy is it to use (compared with tbb)? TBB has extensive docu, does QT too? I'm really interested in efficient task-based parallelism. Do you know MassiveThreads? If so, any opinion?

Adjuntos: 

AdjuntoTamaño
Descargar setandscan.cpp5.13 KB
Imagen de jimdempseyatthecove

Yes Walter,

"for n_proc > cost_set / cost_scan the code will cease to scale well"

This is true, but your problem statement had a high cost_set relative to cost_scan (your terms), and a requirement of cost_scan taking the value of a subsequent iteration using an associative accumulation function (into tmp) and inserting it into the stream. This by its nature is sequential (analogous to a ripple carry in an adder, however in this case your "adder" is equivilent to an analog adder as opposed to binary adder and this precludes the use of a carry look-ahead-like operation that you are trying to do with your parallel scan)

Now if your actual requirements are quite different from this why didn't you say so from the beginning?
Different problems require different solutions.

QuickThread is easier to integrate into existing applications since it can call standard functions, standard member functions (in addition to the lambda functions). You do not need special special operator() stuck into new classes. For the most part you can use your old code somewhat as-is (as-was). You will have to add a few things hear and their but not much you can keep the edits simple or go wild. With TBB you (almost) must create a Lambda function to call your old functions, then use that in your parallel_... With QuickThread you can make the call either way (with some limitations and many extensions).

void Foo(int i, int n, double A[], double B[], double C[])
{
... original function code
}
...
// in your code elsewhere
qt::parallel_for(Foo, 0, n, a, b, c);

Now here is where it gets interesting. Assume function Foo formats the data and writes it to a file. In TBB tasks that perform I/O can at times be problematic as they may be part of a nested team. QuickThread has two classes of threads: Compute, and I/O. To "fix" this, issue:

qt::parallel_task(qt::IO$, Foo, 0, n, a, b, c); // I/O class thread to perform task

Compute class threads are affinity pinned to logical processor, I/O class threads are not pinned and run with a boost in prioriy (iow, the O/S is free to resume the thread to idel logical processor)

QuickThread has a large degree of flexibility with regard to control of tasks. If you are on a large system, multi-socket NUMA machine, you can do some really interesting things. For example, you can tile rows across sockets, and columns across HW threads within a socket: (assume qt is namespaced) (sorry for the fomatting, I hope you can read this mess, copy and past into editor and indent according to scoping level)

parallel_for(OneEach_L3$,  // outer level across one thread per L3 cache (socket)
[](int iBegin, int iEnd) {
parallel_for(L3$, // inner level begun by each of one thread of each socket, teams with all other threads within L3 cache
[](int jBegin, int jEnd) {
// in here each thread sees the objects in the scope of the outer parallel_for,
// a thread specific int jBegin, int jEnd
// a socket (L3) specific int iBegin, int iEnd
... do work here in two-way tile arangement
}, // end inner lambda
0, nColumns); // slice inner loop by columns (to each thread in same L3)
}, // end outer lambda
0, nRows); // slice outer loop by rows (into each socket)

Or, if you have a floating point loop that works best using one thread per core on a system with HT

parallel_for(OneEach_L1$, ... // any one thread per core all cores (no HT siblings)

TBB is a great product for many/most of the applicaiton requirements.
QuickThread has different capabilities.
The documents and ticklers are on my website.

Jim Dempsey

www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

Walter,

Could you point out your edits between your last two uploads. I did not see anything in particular.

Because of your problem restatement in your last posting stating you will permit rounding differences due to order of summation, I took the liberty to add an additional QuickThread solution that does not have the requirement of preserving the summation order. This techinque makes use of the qt::parallel_distribute technique with an algorithmic similar to yours: partition, produce partial sums, use partial sums to produce final sum. This technique assumes you will accept a small degree of differences between the serial and parallel implimentations.The qt::parallel_distribute method addresses your "for n_proc > cost_set / cost_scan the code will cease to scale well" issue stated earlier.

template<typename Func, typename T>
T QuickThread_distribute_scan_and_set(std::size_t i, const std::size_t n, Func const&func,
        T*f, T*y)
{
    T    ret;
    std::vector<T> StitchTable(qt::qt_get_num_threads());
    // first pass, partition and compute y[i] for each partition
    qt::parallel_distribute(
            qt::AllThreads$,
            [&](intptr_t iThread, intptr_t nThreads)
            {
                qt::qtSlice<intptr_t> Slice( iThread, nThreads, i, n );
                StitchTable[iThread] = Serial_scan_and_set(Slice.iBegin+i, Slice.iEnd+i, func, f, y); // +i in event i != 0
            });
    // second pass, propagate "carry" (sum) from end of last partition to all elements of our partition
    qt::parallel_distribute(
            qt::AllThreads$,
            [&](intptr_t iThread, intptr_t nThreads)
            {
                T sum(0);
                if(iThread == 0)
                {
                    for(int i=0; i < nThreads; ++i)
                        sum += StitchTable[i];
                    ret = sum;
                }
                else
                {
                    for(int i=0; i < iThread; ++i)
                        sum += StitchTable[i];
                    qt::qtSlice<intptr_t>    Slice(iThread, nThreads, i, n);
                    for(int i=Slice.iBegin; i < Slice.iEnd; ++ i)
                        y[i] += sum;
                }
            });
    return ret;
}

To aid in determining the degree of differences I changed the report to produce an error estimation.

Also, relating to "for n_proc > cost_set / cost_scan the code will cease to scale well"

The demo program can force the issue, even with the low n_proc I have (8), by reducing the time for cost_set.

I ran tests using loads of 100000, 10000, 1000, 100, 10, 0 clock ticks:

*** Caution W=100000 (computational weight of func) long time to run 147s for Serial
N=5000000 W=100000
Serial               147127129 micro seconds result = 653402.1206241812
TBB scan             36794242 micro seconds result = 682202.2004301561
TBB pipeline         20816859 micro seconds result = 653402.1206241812
QuickThread Manifold 18602736 micro seconds result = 653402.1206241812
QuickThread Distribute 18476807 micro seconds result = 653402.1206240761
TBB scan Scale: 3.998645467407645 resultSerial =  653402.1206241812 resultTBBscan = 682202.2004301561
 f error =0.01119778852088225 y error =-0.001588202190593666
TBB pipeline Scale: 7.067691095952564 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 7.908897325640702 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 7.962800553147521 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
------------------------------------------------------------
*** lighter computational weight for func, Serial runtime 15 seconds
N=5000000 W=10000
Serial               15187231 micro seconds result = 653402.1206241812
TBB scan             2402293 micro seconds result = 660682.0901275538
TBB pipeline         5120833 micro seconds result = 653402.1206241812
QuickThread Manifold 2115407 micro seconds result = 653402.1206241812
QuickThread Distribute 1919942 micro seconds result = 653402.1206240761
TBB scan Scale: 6.321972798488777 resultSerial =  653402.1206241812 resultTBBscan = 660682.0901275538
 f error =0.002827559893142322 y error =-8.245543248852384e-05
TBB pipeline Scale: 2.965773537235055 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 7.179342320413991 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 7.910255101456189 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
--------------------------------------------------------------------
N=5000000 W=1000
Serial               1858912 micro seconds result = 653402.1206241812
TBB scan              318373 micro seconds result = 656961.0825078025
TBB pipeline         2856357 micro seconds result = 653402.1206241812
QuickThread Manifold 2724758 micro seconds result = 653402.1206241812
QuickThread Distribute  259899 micro seconds result = 653402.1206240761
TBB scan Scale: 5.838786580520333 resultSerial =  653402.1206241812 resultTBBscan = 656961.0825078025
 f error =0.00137192822210194 y error =-1.938172739993035e-05
TBB pipeline Scale: 0.6507982020454726 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 0.6822301283269927 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 7.152439986302372 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
-------------------------------------------------------------
N=5000000 W=100
Serial                510895 micro seconds result = 653402.1206241812
TBB scan              117841 micro seconds result = 668309.6256249581
TBB pipeline         1836910 micro seconds result = 653402.1206241812
QuickThread Manifold 2310255 micro seconds result = 653402.1206241812
QuickThread Distribute   81975 micro seconds result = 653402.1206240761
TBB scan Scale: 4.335460493376669 resultSerial =  653402.1206241812 resultTBBscan = 668309.6256249581
 f error =0.005840287488286404 y error =-0.0003632632077216882
TBB pipeline Scale: 0.278127398729388 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 0.2211422548593121 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 6.232326928941751 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
-------------------------------------
N=5000000 W=10
Serial                500001 micro seconds result = 653402.1206241812
TBB scan              109771 micro seconds result = 660682.0901275421
TBB pipeline         1850477 micro seconds result = 653402.1206241812
QuickThread Manifold 2604695 micro seconds result = 653402.1206241812
QuickThread Distribute   81658 micro seconds result = 653402.1206240761
TBB scan Scale: 4.554946206192892 resultSerial =  653402.1206241812 resultTBBscan = 660682.0901275421
 f error =0.002827559893142322 y error =-8.245543248006592e-05
TBB pipeline Scale: 0.2702011427323874 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 0.1919614388632834 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 6.12311102402704 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
-------------------------------------------------
N=5000000 W=0
Serial                498863 micro seconds result = 653402.1206241812
TBB scan              113384 micro seconds result = 668309.6256249611
TBB pipeline         1841656 micro seconds result = 653402.1206241812
QuickThread Manifold 2666495 micro seconds result = 653402.1206241812
QuickThread Distribute   81478 micro seconds result = 653402.1206240761
TBB scan Scale: 4.399765398998095 resultSerial =  653402.1206241812 resultTBBscan = 668309.6256249611
 f error =0.005840287488286404 y error =-0.0003632632077174592
TBB pipeline Scale: 0.2708774059867858 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0
QuickThread Manifold Scale: 0.1870856686399187 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0
QuickThread Distribute Scale: 6.122671150494612 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14

The qt::parallel_distribute holds up well for scaling, and has an inconsequential degree of error (when you permit such errors).

Jim Dempsey

Adjuntos: 

AdjuntoTamaño
Descargar walter2.cpp18.7 KB
www.quickthreadprogramming.com
Imagen de jimdempseyatthecove

Oops, take the +1's off the Serial_scan_and_set (and comment). Correct line is:

StitchTable[iThread] = Serial_scan_and_set(Slice.iBegin, Slice.iEnd, func, f, y);

Jim Dempsey

www.quickthreadprogramming.com

Inicie sesión para dejar un comentario.