Recent posts
https://software.intel.com/en-us/recent/961051
enalgorithms
https://software.intel.com/en-us/forums/topic/475501
<p>Hello,</p>
<p>look down the the following link...<br />it's about parallel partition...</p>
<p><a href="http://www.redgenes.com/Lecture-Sorting.pdf" rel="nofollow">http://www.redgenes.com/Lecture-Sorting.pdf</a></p>
<p>I have tried to simulate this parallel partition method ,<br />but i don't think it will scale cause we have to do a merging,<br />which essentially is an array-copy operation but this array-copy<br />operations will be expensive compared to an integer compare<br />operation that you find inside the partition fuinction, and it will still<br />be expensive compared to a string compare operation that you find<br />inside the partition function. So since it's not scaling i have abondoned<br />the idea to implement this parallel partition method in my parallel<br />quicksort.</p>
<p>I have also just read the following paper about Parallel Merging:</p>
<p><a href="http://www.economyinformatics.ase.ro/content/EN4/alecu.pdf" rel="nofollow">http://www.economyinformatics.ase.ro/content/EN4/alecu.pdf</a></p>
<p>And i have implemented this algorithm just to see what is its performance.<br />and i have noticed that the serial algorithm is 8 times slower than the<br />merge<br />function that you find in the serial mergesort algorithm.So 8 times slower,<br />it's too slow.</p>
<p>So the only way to do a somewhat better parallel sorting algorithm,<br />it's to use the following algorithm;</p>
<p><a href="http://www.drdobbs.com/parallel/parallel-merge/229204454?queryText=parallel%2Bsort" rel="nofollow">http://www.drdobbs.com/parallel/parallel-merge/229204454?queryText=parallel%2Bsort</a></p>
<p>The idea is simple:</p>
<p>Let's assume we want to merge sorted arrays X and Y. Select X[m]<br />median element in X. Elements in X[ .. m-1] are less than or equal to<br />X[m]. Using binary search find index k of the first element in Y greater<br />than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well.<br />Elements in X[m+1..] are greater than or equal to X[m] and Y[k .. ]<br />are greater. So merge(X, Y) can be defined as<br />concat(merge(X[ .. m-1], Y[ .. k-1]), X[m], merge(X[m+1.. ], Y[k .. ]))<br />now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and<br />merge(X[m+1 .. ], Y[k .. ]) and then concat results.</p>
<p>But read this:</p>
<p>"Parallel hybrid merge algorithm was developed that outperformed<br />sequential simple merge as well as STL merge by 0.9-5.8 times overall<br />and by over 5 times for larger arrays"</p>
<p><a href="http://www.drdobbs.com/parallel/parallel-merge/229204454?pgno=3" rel="nofollow">http://www.drdobbs.com/parallel/parallel-merge/229204454?pgno=3</a></p>
<p>This paper as you have noticed has fogot to tell that this method is<br />dependant on the distribution of the data</p>
<p>Read for exemple this:</p>
<p>"Select X[m] median element in X. Elements in X[ .. m-1] are less than<br />or equal to X[m]. Using binary search find index k of the first element in<br />Y greater than X[m]."</p>
<p>So if "median element:" of X is not near or equal to the "median<br />element" of Y so this method can have bad parallel performance<br />and it may not scale as you think.</p>
<p>There is another parallel method for parallel partition, here it's:</p>
<p><a href="http://www.cs.sunysb.edu/~rezaul/Spring-2012/CSE613/CSE613-lecture-9.pdf" rel="nofollow">http://www.cs.sunysb.edu/~rezaul/Spring-2012/CSE613/CSE613-lecture-9.pdf</a></p>
<p>but as you will notice it's still too expensive, causeyou have to create<br />3 arrays and copy in them:</p>
<p>3. array B[ 0: n ? 1 ], lt[ 0: n ? 1 ], gt[ 0: n ? 1 ]</p>
<p>You can use SIMD instructions on the parallel-prefix-sum function</p>
<p>8. lt [ 0: n ? 1 ] ¬ Par-Prefix-Sum ( lt[ 0: n ? 1 ], + )<br />:<br />But the algorithm is still expensive i think on a quad or eight cores or<br />even<br />16 cores, you have to have much more than 16 cores to be able to benefit<br />from this method i think.</p>
<p>Bucket sort is not a sorting algorithm itself, rather it is a<br />procedure for partitioning data to be sorted using some given<br />sorting algorithm-a "meta-algorithm" so to speak.</p>
<p>Bucket sort will be better, when elements are uniformly distributed<br />over an interval [a, b] and buckets have not significantly different<br />number of elements.</p>
<p>bucketsort sequential computational complexity using quicksort is<br />= p × (n/p) log(n/p) = n log(n/p)</p>
<p>bucket sort parallel computational complexity using quicksort<br />= (n/p) log(n/p)</p>
<p>Parallel BucketSort gave me also 3x scaling when sorting strings on a<br />quad cores, it scales better than my parallel quicksort and it can be<br />much faster than my parallel quicksort.</p>
<p>I have thought about my parallel quicksort , and i have found<br />that parallelquicksort is fine but the problem with it is that the<br />partition function is not parallelizable , so i have thought about this<br />and i have decided to write a parallel BucketSort,and this parallel<br />bucketsort can give you much better performance than parallel quicksort<br />when sorting 100000 strings or more, just test it yourself and see,<br />parallel bucketsort can sort just strings at the moment, and it can use<br />quicksort or mergesort, mergesort is faster.</p>
<p>I have updated parallel bucketsort to version 1.02 , i have<br />changed a little bit the interface, now you have to pass<br />to the bucketsort method four parameters: the array,<br />a TSortCompare function and a TSort1 function and a constant</p>
<p>ctDescending or ctAscesending to sort in ascending or descending order..</p>
<p>Here is a small example in Object Pascal:</p>
<p>==<br />program test;</p>
<p>uses parallelbucketsort,sysutils,timer;</p>
<p>type</p>
<p>TStudent = Class<br />public<br />mystring:string;<br />end;</p>
<p>var tab:Ttabpointer;<br />myobj:TParallelSort;<br />student:TStudent;<br />i,J:integer;<br />a:integer;</p>
<p>function comp1(Item1:Pointer): string;<br />begin<br />result:=TStudent(Item1).mystring ;<br />end;</p>
<p>function comp(Item1, Item2: Pointer): integer;<br />begin<br />if TStudent(Item1).mystring > TStudent(Item2).mystring<br />then<br />begin<br />result:=1;<br />exit;<br />end;<br />if TStudent(Item1).mystring < TStudent(Item2).mystring<br />then<br />begin<br />result:=-1;<br />exit;<br />end;</p>
<p>if TStudent(Item1).mystring = TStudent(Item2).mystring<br />then<br />begin<br />result:=0;<br />exit;<br />end;<br />end;</p>
<p>begin</p>
<p>myobj:=TParallelSort.create(1,ctQuicksort); // set to the number of cores...</p>
<p>setlength(tab,1000000);<br />?<br />for i:=low(tab) to high(tab)<br />do<br />begin<br />student:=TStudent.create;<br />student.mystring:= inttostr(i);<br />tab[high(tab)-i]:= student;<br />end;</p>
<p>HPT.Timestart;</p>
<p>myobj.bucketsort(tab,comp,comp1,ctAscending); // use ctAscending or CtDescending.<br />//myobj.qsort(tab,comp);<br />writeln('Time in microseconds: ',hpt.TimePeriod);</p>
<p>writeln;<br />writeln('Please press a key to continu...');<br />readln;</p>
<p>for i := LOW(tab) to HIGH(Tab)-1<br />do<br />begin<br />if tstudent(tab).mystring > tstudent(tab[i+1]).mystring<br />then<br />begin<br />writeln('sort has failed...');<br />halt;<br />end;<br />end;</p>
<p>for i := LOW(tab) to HIGH(Tab)<br />do<br />begin<br />writeln(TStudent(tab).mystring,' ');<br />end;</p>
<p>for i := 0 to HIGH(Tab) do freeandnil(TStudent(tab));</p>
<p>setlength(tab,0);<br />myobj.free;</p>
<p>end.</p>
<p>===</p>
<p>You can download parallel bucketsort from:</p>
<p><a href="http://pages.videotron.com/aminer/" rel="nofollow">http://pages.videotron.com/aminer/</a></p>
<p>Amine Moulay Ramdane.</p>
Wed, 02 Oct 13 03:50:12 -0700lara h.475501parallel parallel for overhead in OpenMP
https://software.intel.com/en-us/forums/topic/392946
<p>I have written a function that incurs a tremendous amount of overhead in [OpenMP dispatcher] called by [OpenMP fork] called on behalf of a particular parallel region of mine, according to VTune. That fork accounts for roughly a third of all CPU time in my program. My code is as follows. My intention is to have two parfor loops running concurrently.</p>
<p>#pragma omp parallel<br />{ <br /> #pragma omp for<br /> for( int y =0; y< h; y++)<br /> {<br /> // somthing fairly time consuming<br /> }<br /> #pragma omp for<br /> for( int x=0 ;x<w; x++)<br /> {<br /> // something else time consuming<br /> }<br />}</p>
<p>I already know that the function is worth parallelizing in this way because my whole program performs much more poorly if I comment out the pragmas or insert a num_threads(1) clause. Also, I have tried changing the code to use two consecutive parallel regions each containing one parfor. That version performs significantly worse than the version shown, as well.</p>
<p>The two time-consuming sections of code take approximately the same amounts of time, within about +/-20%. There are other threads executing at the same time. VTune says that there is no oversubscription and in fact CPU usage is well below the target of 12 for my 6-core hyperthreaded i7. Windows Task Manager reports CPU usage around 85%.</p>
<p>I would appreciate any suggestions about what this fork/dispatcher overhead is and how it can be reduced.</p>
Fri, 24 May 13 11:42:23 -0700mjc392946