I have tried to simulate this parallel partition method , but i don't think it will scale cause we have to do a merging, which essentially is an array-copy operation but this array-copy operations will be expensive compared to an integer compare operation that you find inside the partition fuinction, and it will still be expensive compared to a string compare operation that you find inside the partition function. So since it's not scaling i have abondoned the idea to implement this parallel partition method in my parallel quicksort.

I have also just read the following paper about Parallel Merging:

And i have implemented this algorithm just to see what is its performance. and i have noticed that the serial algorithm is 8 times slower than the merge function that you find in the serial mergesort algorithm.So 8 times slower, it's too slow.

So the only way to do a somewhat better parallel sorting algorithm, it's to use the following algorithm;

Let's assume we want to merge sorted arrays X and Y. Select X[m] median element in X. Elements in X[ .. m-1] are less than or equal to X[m]. Using binary search find index k of the first element in Y greater than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well. Elements in X[m+1..] are greater than or equal to X[m] and Y[k .. ] are greater. So merge(X, Y) can be defined as concat(merge(X[ .. m-1], Y[ .. k-1]), X[m], merge(X[m+1.. ], Y[k .. ])) now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and merge(X[m+1 .. ], Y[k .. ]) and then concat results.

But read this:

"Parallel hybrid merge algorithm was developed that outperformed sequential simple merge as well as STL merge by 0.9-5.8 times overall and by over 5 times for larger arrays"

This paper as you have noticed has fogot to tell that this method is dependant on the distribution of the data

Read for exemple this:

"Select X[m] median element in X. Elements in X[ .. m-1] are less than or equal to X[m]. Using binary search find index k of the first element in Y greater than X[m]."

So if "median element:" of X is not near or equal to the "median element" of Y so this method can have bad parallel performance and it may not scale as you think.

There is another parallel method for parallel partition, here it's:

but as you will notice it's still too expensive, causeyou have to create 3 arrays and copy in them:

3. array B[ 0: n ? 1 ], lt[ 0: n ? 1 ], gt[ 0: n ? 1 ]

You can use SIMD instructions on the parallel-prefix-sum function

8. lt [ 0: n ? 1 ] ¬ Par-Prefix-Sum ( lt[ 0: n ? 1 ], + ) : But the algorithm is still expensive i think on a quad or eight cores or even 16 cores, you have to have much more than 16 cores to be able to benefit from this method i think.

Bucket sort is not a sorting algorithm itself, rather it is a procedure for partitioning data to be sorted using some given sorting algorithm-a "meta-algorithm" so to speak.

Bucket sort will be better, when elements are uniformly distributed over an interval [a, b] and buckets have not significantly different number of elements.

bucketsort sequential computational complexity using quicksort is = p × (n/p) log(n/p) = n log(n/p)

bucket sort parallel computational complexity using quicksort = (n/p) log(n/p)

Parallel BucketSort gave me also 3x scaling when sorting strings on a quad cores, it scales better than my parallel quicksort and it can be much faster than my parallel quicksort.

I have thought about my parallel quicksort , and i have found that parallelquicksort is fine but the problem with it is that the partition function is not parallelizable , so i have thought about this and i have decided to write a parallel BucketSort,and this parallel bucketsort can give you much better performance than parallel quicksort when sorting 100000 strings or more, just test it yourself and see, parallel bucketsort can sort just strings at the moment, and it can use quicksort or mergesort, mergesort is faster.

I have updated parallel bucketsort to version 1.02 , i have changed a little bit the interface, now you have to pass to the bucketsort method four parameters: the array, a TSortCompare function and a TSort1 function and a constant

ctDescending or ctAscesending to sort in ascending or descending order..

Here is a small example in Object Pascal:

== program test;

uses parallelbucketsort,sysutils,timer;

type

TStudent = Class public mystring:string; end;

var tab:Ttabpointer; myobj:TParallelSort; student:TStudent; i,J:integer; a:integer;

function comp1(Item1:Pointer): string; begin result:=TStudent(Item1).mystring ; end;

function comp(Item1, Item2: Pointer): integer; begin if TStudent(Item1).mystring > TStudent(Item2).mystring then begin result:=1; exit; end; if TStudent(Item1).mystring < TStudent(Item2).mystring then begin result:=-1; exit; end;

if TStudent(Item1).mystring = TStudent(Item2).mystring then begin result:=0; exit; end; end;

begin

myobj:=TParallelSort.create(1,ctQuicksort); // set to the number of cores...

setlength(tab,1000000); ? for i:=low(tab) to high(tab) do begin student:=TStudent.create; student.mystring:= inttostr(i); tab[high(tab)-i]:= student; end;

HPT.Timestart;

myobj.bucketsort(tab,comp,comp1,ctAscending); // use ctAscending or CtDescending. //myobj.qsort(tab,comp); writeln('Time in microseconds: ',hpt.TimePeriod);

writeln; writeln('Please press a key to continu...'); readln;

for i := LOW(tab) to HIGH(Tab)-1 do begin if tstudent(tab).mystring > tstudent(tab[i+1]).mystring then begin writeln('sort has failed...'); halt; end; end;

for i := LOW(tab) to HIGH(Tab) do begin writeln(TStudent(tab).mystring,' '); end;

for i := 0 to HIGH(Tab) do freeandnil(TStudent(tab));

## algorithms

Hello,

look down the the following link...

it's about parallel partition...

http://www.redgenes.com/Lecture-Sorting.pdf

I have tried to simulate this parallel partition method ,

but i don't think it will scale cause we have to do a merging,

which essentially is an array-copy operation but this array-copy

operations will be expensive compared to an integer compare

operation that you find inside the partition fuinction, and it will still

be expensive compared to a string compare operation that you find

inside the partition function. So since it's not scaling i have abondoned

the idea to implement this parallel partition method in my parallel

quicksort.

I have also just read the following paper about Parallel Merging:

http://www.economyinformatics.ase.ro/content/EN4/alecu.pdf

And i have implemented this algorithm just to see what is its performance.

and i have noticed that the serial algorithm is 8 times slower than the

merge

function that you find in the serial mergesort algorithm.So 8 times slower,

it's too slow.

So the only way to do a somewhat better parallel sorting algorithm,

it's to use the following algorithm;

http://www.drdobbs.com/parallel/parallel-merge/229204454?queryText=parallel%2Bsort

The idea is simple:

Let's assume we want to merge sorted arrays X and Y. Select X[m]

median element in X. Elements in X[ .. m-1] are less than or equal to

X[m]. Using binary search find index k of the first element in Y greater

than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well.

Elements in X[m+1..] are greater than or equal to X[m] and Y[k .. ]

are greater. So merge(X, Y) can be defined as

concat(merge(X[ .. m-1], Y[ .. k-1]), X[m], merge(X[m+1.. ], Y[k .. ]))

now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and

merge(X[m+1 .. ], Y[k .. ]) and then concat results.

But read this:

"Parallel hybrid merge algorithm was developed that outperformed

sequential simple merge as well as STL merge by 0.9-5.8 times overall

and by over 5 times for larger arrays"

http://www.drdobbs.com/parallel/parallel-merge/229204454?pgno=3

This paper as you have noticed has fogot to tell that this method is

dependant on the distribution of the data

Read for exemple this:

"Select X[m] median element in X. Elements in X[ .. m-1] are less than

or equal to X[m]. Using binary search find index k of the first element in

Y greater than X[m]."

So if "median element:" of X is not near or equal to the "median

element" of Y so this method can have bad parallel performance

and it may not scale as you think.

There is another parallel method for parallel partition, here it's:

http://www.cs.sunysb.edu/~rezaul/Spring-2012/CSE613/CSE613-lecture-9.pdf

but as you will notice it's still too expensive, causeyou have to create

3 arrays and copy in them:

3. array B[ 0: n ? 1 ], lt[ 0: n ? 1 ], gt[ 0: n ? 1 ]

You can use SIMD instructions on the parallel-prefix-sum function

8. lt [ 0: n ? 1 ] ¬ Par-Prefix-Sum ( lt[ 0: n ? 1 ], + )

:

But the algorithm is still expensive i think on a quad or eight cores or

even

16 cores, you have to have much more than 16 cores to be able to benefit

from this method i think.

Bucket sort is not a sorting algorithm itself, rather it is a

procedure for partitioning data to be sorted using some given

sorting algorithm-a "meta-algorithm" so to speak.

Bucket sort will be better, when elements are uniformly distributed

over an interval [a, b] and buckets have not significantly different

number of elements.

bucketsort sequential computational complexity using quicksort is

= p × (n/p) log(n/p) = n log(n/p)

bucket sort parallel computational complexity using quicksort

= (n/p) log(n/p)

Parallel BucketSort gave me also 3x scaling when sorting strings on a

quad cores, it scales better than my parallel quicksort and it can be

much faster than my parallel quicksort.

I have thought about my parallel quicksort , and i have found

that parallelquicksort is fine but the problem with it is that the

partition function is not parallelizable , so i have thought about this

and i have decided to write a parallel BucketSort,and this parallel

bucketsort can give you much better performance than parallel quicksort

when sorting 100000 strings or more, just test it yourself and see,

parallel bucketsort can sort just strings at the moment, and it can use

quicksort or mergesort, mergesort is faster.

I have updated parallel bucketsort to version 1.02 , i have

changed a little bit the interface, now you have to pass

to the bucketsort method four parameters: the array,

a TSortCompare function and a TSort1 function and a constant

ctDescending or ctAscesending to sort in ascending or descending order..

Here is a small example in Object Pascal:

==

program test;

uses parallelbucketsort,sysutils,timer;

type

TStudent = Class

public

mystring:string;

end;

var tab:Ttabpointer;

myobj:TParallelSort;

student:TStudent;

i,J:integer;

a:integer;

function comp1(Item1:Pointer): string;

begin

result:=TStudent(Item1).mystring ;

end;

function comp(Item1, Item2: Pointer): integer;

begin

if TStudent(Item1).mystring > TStudent(Item2).mystring

then

begin

result:=1;

exit;

end;

if TStudent(Item1).mystring < TStudent(Item2).mystring

then

begin

result:=-1;

exit;

end;

if TStudent(Item1).mystring = TStudent(Item2).mystring

then

begin

result:=0;

exit;

end;

end;

begin

myobj:=TParallelSort.create(1,ctQuicksort); // set to the number of cores...

setlength(tab,1000000);

?

for i:=low(tab) to high(tab)

do

begin

student:=TStudent.create;

student.mystring:= inttostr(i);

tab[high(tab)-i]:= student;

end;

HPT.Timestart;

myobj.bucketsort(tab,comp,comp1,ctAscending); // use ctAscending or CtDescending.

//myobj.qsort(tab,comp);

writeln('Time in microseconds: ',hpt.TimePeriod);

writeln;

writeln('Please press a key to continu...');

readln;

for i := LOW(tab) to HIGH(Tab)-1

do

begin

if tstudent(tab).mystring > tstudent(tab[i+1]).mystring

then

begin

writeln('sort has failed...');

halt;

end;

end;

for i := LOW(tab) to HIGH(Tab)

do

begin

writeln(TStudent(tab).mystring,' ');

end;

for i := 0 to HIGH(Tab) do freeandnil(TStudent(tab));

setlength(tab,0);

myobj.free;

end.

===

You can download parallel bucketsort from:

http://pages.videotron.com/aminer/

Amine Moulay Ramdane.