Performance conundrum

Performance conundrum

SixSort is a faster and higher quality in-place sorter than
Quicksort. A complexity analysis suggests that the SixSort/Quicksort
time use ratio is 0.58 on large arrays.

A test on a 16M array yields the ratio 0.59 (against a best in class
Quicksort). On an AMD box.

SixSort can be parallelized easily, like Quicksort. Parallel SixSort
against Quicksort with two threads yields an excellent time ratio of
0.33. On an AMD box.

Tests on multiple Intel boxes give disappointing results. Here what
was observed on MTL:

Comparison: Sixsort versus 3-layered Quicksort/ single thread
Intel's MTL machine:
array size (Sixsort time)/(Quicksort time)
1M 0.92
2M 0.91
4M 0.87
8M 0.86
16M 0.84

Comparison: Sixsort versus 3-layered Quicksort/ two threads
Machine: Intel's multi-core lab/ Linux
array size (Sixsort time)/(Quicksort time)
1M 0.74
2M 0.71
4M 0.68
8M 0.59
16M 0.57

I did many more tests on other Intel boxes and on MTL with threads
upto 10. Similarly disappointing results.

Clay Breshears has suggested to get into the guts of the (serial/
parallel) execution to pinpoint a/the bottleneck. That scares me
because I have only anecdotal knowledge/ opinions about cpu
architecture.

At this point my best conjecture is that there are persistent cache
failures on Intel boxes because:
-- SixSort has way more code than tiny Quicksort
-- SixSort nibbles at 4 locations in the input array in contrast with
Quicksort at 2.
How do we go from here?

publicaciones de 10 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Please note that the MTL is not set up for any kind of comparison of architectures or algorithms.
We don't have the resources to configure the system(s) to support a specific BIOS configuration or toprovide guidance onwhere one algorithm is performing better than another.

The MTL is primarily a tool toshowcase the thread scaling of a particular workload or algorithm.

Mike Pearce's response puzzle me. I have been contrasting >relative<performance of two algorithms on one platform versus what is observedon MTL; with one thread and with two threads. In both cases withdisappointing results on MTL. No idea why BIOS configuration oralgorithm comparison is raised in the response. That has nothing todo with what I reported in my post.You emphasize that MTL is aimed at thread scaling. Here you go,everything on MTL:Comparison: Paralellized Quicksort against paralellized SixSortThe comparison is based on their time ratios against our sequentialQuicksortMachine: Intel's MTLArray size 16M# threads PQuicksort time ratio SixSort time ratio2 0.65 0.573 0.47 0.484 0.39 0.385 0.35 0.346 0.33 0.317 0.32 0.308 0.30 0.289 0.28 0.2710 0.27 0.27Remember that SixSort has a time ratio of 0.59 on AMD with a singlethread and 0.33 with two threads against sequential Quicksort.SixSort's superiority against Quicksort is virtually wiped out on MTLin multi threaded settings.How do we go from here?Mike Pearce's response puzzle me. I have been contrasting >relative
You emphasize that MTL is aimed at thread scaling. Here you go,everything on MTL:
Comparison: Paralellized Quicksort against paralellized SixSortThe comparison is based on their time ratios against our sequentialQuicksortMachine: Intel's MTLArray size 16M# threads PQuicksort time ratio SixSort time ratio2 0.65 0.573 0.47 0.484 0.39 0.385 0.35 0.346 0.33 0.317 0.32 0.308 0.30 0.289 0.28 0.2710 0.27 0.27
Remember that SixSort has a time ratio of 0.59 on AMD with a singlethread and 0.33 with two threads against sequential Quicksort.SixSort's superiority against Quicksort is virtually wiped out on MTLin multi threaded settings.
How do we go from here?

[A previous attempt to reply to Mike's reponse messed up/ sorry]

Mike Pearce's response puzzle me. I have been contrasting >relative<
performance of two algorithms on one platform versus what is observed
on MTL; with one thread and with two threads. In both cases with
disappointing results on MTL. No idea why BIOS configuration or
algorithm comparison is raised in the response. That has nothing to
do with what I reported in my post.

You emphasize that MTL is aimed at thread scaling. Here you go,
everything on MTL:

Comparison: Paralellized Quicksort against paralellized SixSort
The comparison is based on their time ratios against our sequential
Quicksort
Machine: Intel's MTL
Array size 16M
# threads PQuicksort time ratio SixSort time ratio
2 0.65 0.57
3 0.47 0.48
4 0.39 0.38
5 0.35 0.34
6 0.33 0.31
7 0.32 0.30
8 0.30 0.28
9 0.28 0.27
10 0.27 0.27

Remember that SixSort has a time ratio of 0.59 on AMD with a single
thread and 0.33 with two threads against sequential Quicksort.
SixSort's superiority against Quicksort is virtually wiped out on MTL
in multi threaded settings.

How do we go from here?

We fully understand you are contrasting the >relative< performance of 2 algorithms, now providing relative scaling performance based on the number of threads.

It now appears that the scaling thread performance of the SixSort time ratio does scale slightly better than thePQuickSort in the numbers you have just provided.

But again, you're asking why is the SixSort ratio better on AMD, and worst on Intel platforms. This we cannot answer - there are many ways the resultscan be different, e.g.. BIOS configs, Processor speed & Turbo Boost,compiler andoptions, cache configoptions, even down to the SixSort code, etc, etc. And was the workload run as a benchmark on the MTL, you may recall that the login node is not set up as a benchmarking platform, and willl automatically degrade performance.

Finally we could suggest thatyou avail yourself of the extensiveIntel performancetool set on the MTL, to determine why the SixSort algorithm is deficient on Intel platforms, as you report.

I can provide also more test data for five Intel platforms, three of themwithmulti threaded results. They are ALL disappointing (I could have used"terrible") against the theoretical analysis and what I found on the AMDmachine.I had hoped that you would accept the challenge to deal with what Ireport instead of putting the ball in my court to debug your hardwarearchitecture about which I have only anecdotal knowledge -- as Iadmitted already in my first posting.If you don't address the problem I have identified (with my conjecture):sorry, so be it.I have too much on my plate to solve Intel'sproblems ...

You still have quite a bit more work to do before you can completely pin the blame on Intel. For instance:
- You are using quicksort as a baseline. Intel's current-generation
memory interconnects have both significantly higher bandwidth and
throughput than AMDs (see my tech report for more details
http://www.cs.uchicago.edu/research/publications/techreports/TR-2011-02

). This means that you should expect your ratios to be better on
the AMD machine because changing to an algorithm with better locality
will be more of a win on the AMD machine than the Intel machine.
You can see this information by looking at the difference between the 1 processor output for quicksort/sixsort.
- Which compiler are you using and with which flags? For example, for bulk-data synchronous parallel programs, switching from gcc to icc and turning on the streaming-stores optimization can increase your bandwidth by up to 40% on AMD machines (less effect on Intel, in my experience).
- Your dataset size may be a little small. Make sure to try it out across larger sizes to see how the ratio changes. At only a few million integers, you can end up starving for more parallelism. Our experience with our language (which has only 1/4-ish the speed of C) is that we needed 10M integers, so I suspect that with an optimized C program you might want at least a factor of four more. Try plotting your speedup across ranges of processors; elbows in that graph commonly indicate either bus saturation, limitations due to parallel overheads, or just running up against Amdal's Law.
- Along those lines, what are the raw numbers? I've frequently found that raw,
single-processor performance on Intel machines is so much larger than on
similar-generation AMD machines that datasets that are large enough to see good speedups on an AMD bottom out on Intel due to parallel overhead beginning to dominate the available work.

I hope this helps! And don't be discouraged -- doing high-quality systems research requires far more measurement and analysis than it might appear by looking at final papers. Even seemingly simple statements such as X is faster than Y on machine M because of Z can require a stunning amount of time. Doing that across two machines will take even more time and effort.
- Lars

Mr Bergstrom,Phew ...> ... before you can completely pin the blame on Intel.I don't blame anything. I report unexpected, bad performance ratios ona bunch of Intel platforms - ranging from 2001 to 2011, including MTL.Since I am NOT a hardware expert it does not make sense to asks me todo hardware tracing; plus why should >>I<< do that?> ... changing to an algorithm with better locality ...SixSort does not have better locality than Quicksort. Thus yourargument fails - to the extend I grasp your statement.The balance of your reply I can't really follow -- it appears toconsist of throwing doubts around - as done in earlier responses ...> I hope this helps!Thanks, but no not really. > And don't be discouragedYou miss the point. I am VERY satisfied that two years of hard work on SixSort (which is acooperative of several sorting algorithms being faster and havinghigher quality than qsort in the C-library - in terms of not degradingin quadratic explosions) has been validated to MY satisfaction on theAMD machine and especially because it confirms my complexity analysis.The results of the equivalent tests on the Intel platforms areuniformly different and, I believe, poor. There can be good reasons why Intel is simply not interested in myfindings. I assume that there is a sizeable test bench of algorithmsused by Intel to assess its (cpu) architectures. SixSort may notsatisfy the characteristics of that test bench. If Intel isinterested to figure out what is going: fine with me ... I am NOTqualified and way too busy with other challenges. Feel free to takethe ball and run with it, as you see fit

Let me give you folks a little background.** All testing is done by comparing time ratios of competing algorithms.** Test runs are done repeatedly to ascertain that load fluctations donot bias the observed ratios - I am not stupid ....** Algorithms are coded, by me uniformely, in C (including Quicksortbecause qsort in the C-library has a wrong design)** Coding is done "close to the metal"/ assembly language style toavoid interferences by the compiler** Questions about used compile parameters are red herrings becausethey are not used, and because of measuring the time ratios ofalgorithms that are virtually identical (except what I alreadydescribed in my first posting as part of my conjecture what is messingthings up on all Intel platforms)Please read my first & second posting carefully before offering youradvice what I should do, which I, quite likely, will not do anyway.

Hey ddcc,

VERY interesting discussion here! I find myself admiring both your general approach and your professional restraint.

While its true that -- for a very, very long laundry list of reasons --us vs. them threads are discouragedin this forum, you raise a challenge that I can only describe as quite interesting.

Using the Intel VTune Amplifier XE on the SixSort code in question would allow one to drill down to (or pinpoint) its potential bottlenecks, to the object or line of code level (as suggested by Dr. Breshears to you some time ago). Finding the bottleneck(s) in this fashion does not require CPU expertise of course; changing the code to improve its performance on a specific system based on what is learned MIGHT

And of course, if the bottlenecks are precisely where expected, an entirely new course of action might emerge.

All this said, I get the distinct impression that your challenge here may pique the interest of more than a few of our technicians, who are scattered throughout the world. In that eventuality, can you provide your source and makefile in this thread so that any interested person can download them and dig in, as needed?

I know that youve appeared to dismiss the role of the compiler in this experiment, but it would be pretty standard to examine what works well in situation A with what might work better in the same situation (as well as in situation B).

If you agree this might be a good next step, I note that you can add files to responses here in this forum. Otherwise, feel free to email the files to intel_mtl@intel.com and Ill add them to this thread myself.

Your continued interest in the causes of this behavior is appreciated.

Cheers,

Jdg

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya