Radix sort with both keys and values for double/f64 data?

Radix sort with both keys and values for double/f64 data?

I wrote a program for numerical analysis, where one of the performance-critical parts of the program is to sort large vector or matrix. 

Once I am done with the coding, I test the performance, and it looks a  bit slower than I expected,  after timing my code, I find the problem lies in the sorting function, for large array, performance-wise, the only acceptable sorting method is radix-sort, but unfornatuely the current implentmention of radix-sort in IPP that sorting double/f64 vectors cannot return the index of sorted vector, which is problematic.

Without the application of radix-sort, for large array, the performance of multi-threading IPP sorting is vastly slower than calling the standard Matlab sort function, which is quite disappointing.

I just want to know if it is possible to include a radix-sort, with the return of index,  for double data in the future IPP release, I was expected it is rather straight forwards to return both index and value for a radix-sort, but I could be wrong since I dont know the  methods used in IPP radix sort,  thanks.

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>... I test the performance, and it looks a bit slower than I expected...

Why do you think so? Did you compare performance of the test-case which uses Radix sort with another sorting algorithms?

>>...after timing my code, I find the problem lies in the sorting function, for large array, performance-wise...

How big is the array?

>>...the only acceptable sorting method is radix-sort...

Again, Why do you think so? Did you try Heap or Merge sorting algorithms?

Note: I'm not trying to convince you to walk away from Radix sort. I'd like to understand that you made your decision based on some real analysis of performance of different sorting algorithms.

>>...unfornatuely the current implentmention of radix-sort in IPP that sorting double/f64 vectors cannot return the index of sorted vector,
>>which is problematic...

Could you post your test-case? I didn't understand '...cannot return the index of sorted vector...'.

Thanks in advance.

Well, its pretty straight forward to guess which part is likely to be the performance bottleneck when you get one part of code which has a computation complexity of O(NlogN) inserting into pieces of codes of O(N), O(logN) or O(1).

As for the timing, well I first use Intel Vtune, then pick out the sort functions in a sample C program to test it.

Basically what I found is, for the problem I am facing, Intel IPP's non-radix sort is nowhere near the performance of matlab's sort, you can do your own research to verify it:

for instance, timing the following code:

ippsSortIndexAscend_64f_I(x,ind,N);

//Using what ever timing function you can find

Then call Intel compiler (mine being 13.0, with O3 optimization, IPP threaded-static lib) to compile it and run:

and compare it to matlab's sort by typing this in matlab:

tic;[x,ind]=sort(x);toc

Where x a large N element 1-D double array, ind is the index for x, which is a N-element int32 array.

For N=1 million
Matlab sort: 0.056 sec
IPP sort: 0.136 sec

For N=8 million
Matlab sort: 0.4 sec.
IPP sort: 1.16 sec.

For N=64 Million:
Matlab sort: 3.7 sec
IPP sort: 10.31 sec

Matlab version is ver 8.0-64bit;
OS is windows 7-64;
system configurations are core i7 2670QM+8G RAM;
Compiler version is intel ICC ver 13.0;
Building settings are:64-bit, O3, AVX, IPP threaded-static lib.

Thanks.

>>... Intel IPP's non-radix sort is nowhere near the performance of matlab's sort...

I saw your performance numbers and the difference is ~2.6x. But, what can we do here? Also, when you talk about '...standard Matlab sort function...' do you mean based on the Radix Sort algorithm?

I could test a couple of IPP sorting functions from DSP domain and compare with a set of sorting algorithms implemented on our project and I'll provide some feedback.

Best regards,
Sergey

PS1: Just for the reference. Some time ago we had a very good discussion on Cilk Plus forum regarding performance of Merge Sort algorithm and a couple of another. Please take a look if interested:

Web-link: http://software.intel.com/en-us/forums/topic/265736

PS2: Merge Sort is my favorite algorithm to sort very large data sets ( larger then 64MB ) and it significantly outperforms Heap and Quick sorting algorithms. All my results are based on many real tests (!) and tests for all algorithms were deterministic ( data sets were the same / not generated by 'rand' function ). Sorry, I don't use Radix Sort algorithm.

In theory, radix sort is of O(N) complexity, unlike most O(NlogN) or O(N^2) sort (merge sort, quick sort, heap sort, bubble sort or straight sort etc.), so given sufficient memory capacity and sufficiently large input array, radix sort should be vastly faster than merge sort which is O(NlogN), and this is indeed the case in IPP, the IPP radix sort function is significantly faster than IPP other sort functions when the array to be sort is large.

Actually I suspect Matlab's sort function employed some radix-sort method for large input array, thats why it is so much faster.

Unforunately current IPP 's radix sort cannot sort keys with values for double data, thats why I ask if there is a plan to include such features in the future IPP release.

>>...given sufficient memory capacity and sufficiently large input array, radix sort should be vastly faster than merge sort...

You need to prove this in absolutely deterministic tests when the same data set is used for both algorithms. Expression '...should be vastly faster...' can not be taken for granted.

I'm talking about practical applications of sorting algorithms and this is what you have done. That is, compared performance of MATLAB and IPP functions.

Also, very often on different forums developers, who like talking about complexities of different algorithms, never tried these algorithms in real tests. In most cases their exposure / experience could be described as '...a couple of minutes on Wiki...' or a couple of pages read in some book. Just take a look what different guys are talking, for example, about Strassen Algorithm for Matrix Multiplication ( sorry for a not sorting algorithm... ).

Please try Bubble, Insert, Select, Heap, Merge, Quick, Pigeonhole, Radix, Flash, etc algorithms with data sets of different sizes and you will be very surprised with performance results.

Best regards,
Sergey

Dear Sergey,

With all respect, I think what I said is "in theory", and if you are familiar with computation complexity theory, then, yes, it is well-known, and has been proved that radix sort, given sufficient memory storage, is highly-likely its faster than merge sort, or any O(NlogN) sort methods, for large array, for the simple fact it is of only O(N) (or more strictly speaking O(kN)) complexity.

I am not awared this forum only accept peer-reviewed research articles, so I have not provided "deteminstic tests" to prove my point, I prefer trust some research on these topics instead of reinviting every single wheel, and yes, the research is consistent with my personal practical experience on this subject.

As for why I suspect Matlab's sort employed some O(N) style sorting methods for large array, again, I am not 100% sure about that, but given the fact I think Intel's researchers should not be so far behind Matlab's coder on some simple task such as sort (instead of some highly theortical coding tasks which requires the coder have very deep theoritical knowledge), I suspect the reason why Intel's non-radix sort is so far behind Matlab's sort at sorting large arrays, is due to the latter employed some methods with less computational complexity.

And yes, its nothing beyond reasonable doubt, but I am not in a courtroom nor I am writing a reseach article here.

As for academic reference, I give you one:
http://dl.acm.org/citation.cfm?id=1074233

>>>You need to prove this in absolutely deterministic tests when the same data set is used for both algorithms. Expression '...should be vastly faster...' can not be taken for granted.>>>

I agree with Sergey.Theory can not be applicable to every system and/or to every real-world scenario where the sorting algorithms are used.
Even in absolutly deterministic test-case or in real-world application whole system behaviour is not deterministic , but stochastic in nature ,so the exact complexity can have varying higher order terms.
By "deterministic test case" I mean not randomly generated values.

>>...and has been proved that radix sort, given sufficient memory storage, is highly-likely its faster than merge sort, or any O(NlogN) sort
>>methods, for large array, for the simple fact it is of only O(N) (or more strictly speaking O(kN)) complexity...

You don't take into account implementation complexity. Also, theory never takes into account that some extra processing is always needed for loops, variable exchanges, assignments, recursion calls, allocation memory on the stack, etc ( should I continue here? ). I've spent hundreds of hours on optimizing testing, verifying and tuning many well known sorting algorithms ( and several my own ) before selecting three fastest for different applications with different types of data sets. These are as follows: Merge, Heap and Pigeonhole sorting algorithms.

You should know that in image processing for images with a range of integer values from 0 to 32768 a Pigeonhole sorts an array in two passes in a single-threaded application.

>>...I have not provided "deteminstic tests" to prove my point...

If you decide to provide ( test-cases etc ) I will find time to verify your results and that is not a problem.

>>>You don't take into account implementation complexity. Also, theory never takes into account that some extra processing is always needed for loops, variable exchanges, assignments, recursion calls, allocation memory on the stack, etc ( should I continue here? ). I've spent hundreds of hours on optimizing testing, verifying and tuning many well known sorting algorithms ( and several my own ) before selecting three fastest for different applications with different types of data sets. These are as follows: Merge, Heap and Pigeonhole sorting algorithms>>>

I was talking about the non-deterministic factors in my previous post.Also theory never takes into account that tested sorting algorithm which is running inside some thread can be preempted and scheduled to run later,even performance measurement instruction like RDTSC has its own latency
which can induce or add some uncertainity into fine-grained time measurement.

@Sergey
Did you test a multithreaded version of your sorting algorithms?

Well, from my experience intel's radix sort is far faster than their non-radix sort counterpart at sorting LARGE arrays, and Intel's own research report also confirmed this, google this article: "Radix Sort on CPUs, GPUs and Intel® MIC Architectures".

Should we stop there, afterall, I just tell an observation that Matlab's sort is faster than IPP's non-radix sort at sorting large arrays, based on my personal and practical experiences.

If you have counter example, good, show me then (dont show me the extreme examples), otherwise, with all respects, let me waiting for the answer from intel IPP team about my question, thanks.

>>...Did you test a multithreaded version of your sorting algorithms?

No. All of them are single-threaded because they are tuned for applications in embedded environments.

>>>No. All of them are single-threaded because they are tuned for applications in embedded environments.>>>

I ran into some issue with the bubble sort algorithm which I posted a few weeks ago on my sine function thread,but so far was not able to solve the problem.

>>>are tuned for applications in embedded environments.>>>

Do you mean ARM processors?

Hi Iliya,

>>...I ran into some issue with the bubble sort algorithm...

We're about to start mixing different subjects. So, why wouldn't you create a new thread in http://software.intel.com/en-us/forums/watercooler-catchall with a subject:

"Optimization techniques for sorting algorithms ( Practical applications / No theory, please )"

>>...Do you mean ARM processors?

Yes, and a couple of more:
...
// ARM _M_ARM
// SH _M_SH
...
// MOTOROLA _M_M68K
...
Best regards,
Sergey

>>>We're about to start mixing different subjects. So, why wouldn't you create a new thread in http://software.intel.com/en-us/forums/watercooler-catchall with a subject:

"Optimization techniques for sorting algorithms ( Practical applications / No theory, please )">>>

Ok. I will give a try.Two threads will be created on 'catchall' forum and on Android forum,but I suppose that no one will be able to help me.

>>>SH _M_SH>>>

Is this Analog Devices SHARC processor?

Leave a Comment

Please sign in to add a comment. Not a member? Join today