Performance conundrum

Performance conundrum

Portrait de ddcc

SixSort is a faster and higher quality in-place sorter than
Quicksort. A complexity analysis suggests that the SixSort/Quicksort
time use ratio is 0.58 on large arrays.

A test on a 16M array yields the ratio 0.59 (against a best in class
Quicksort). On an AMD box.

SixSort can be parallelized easily, like Quicksort. Parallel SixSort
against Quicksort with two threads yields an excellent time ratio of
0.33. On an AMD box.

Tests on multiple Intel boxes give disappointing results. Here what
was observed on MTL:

Comparison: Sixsort versus 3-layered Quicksort/ single thread
Intel's MTL machine:
array size (Sixsort time)/(Quicksort time)
1M 0.92
2M 0.91
4M 0.87
8M 0.86
16M 0.84

Comparison: Sixsort versus 3-layered Quicksort/ two threads
Machine: Intel's multi-core lab/ Linux
array size (Sixsort time)/(Quicksort time)
1M 0.74
2M 0.71
4M 0.68
8M 0.59
16M 0.57

I did many more tests on other Intel boxes and on MTL with threads
upto 10. Similarly disappointing results.

Clay Breshears has suggested to get into the guts of the (serial/
parallel) execution to pinpoint a/the bottleneck. That scares me
because I have only anecdotal knowledge/ opinions about cpu
architecture.

At this point my best conjecture is that there are persistent cache
failures on Intel boxes because:
-- SixSort has way more code than tiny Quicksort
-- SixSort nibbles at 4 locations in the input array in contrast with
Quicksort at 2.
How do we go from here?

10 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Mike Pearce (Intel)

Please note that the MTL is not set up for any kind of comparison of architectures or algorithms.
We don't have the resources to configure the system(s) to support a specific BIOS configuration or toprovide guidance onwhere one algorithm is performing better than another.

The MTL is primarily a tool toshowcase the thread scaling of a particular workload or algorithm.

Portrait de ddcc
Mike Pearce's response puzzle me. I have been contrasting >relative< performance of two algorithms on one platform versus what is observed on MTL; with one thread and with two threads. In both cases with disappointing results on MTL. No idea why BIOS configuration or algorithm comparison is raised in the response. That has nothing to do with what I reported in my post. You emphasize that MTL is aimed at thread scaling. Here you go, everything on MTL: Comparison: Paralellized Quicksort against paralellized SixSort The comparison is based on their time ratios against our sequential Quicksort Machine: Intel's MTL Array size 16M # threads PQuicksort time ratio SixSort time ratio 2 0.65 0.57 3 0.47 0.48 4 0.39 0.38 5 0.35 0.34 6 0.33 0.31 7 0.32 0.30 8 0.30 0.28 9 0.28 0.27 10 0.27 0.27 Remember that SixSort has a time ratio of 0.59 on AMD with a single thread and 0.33 with two threads against sequential Quicksort. SixSort's superiority against Quicksort is virtually wiped out on MTL in multi threaded settings. How do we go from here?

Mike Pearce's response puzzle me. I have been contrasting >relative

You emphasize that MTL is aimed at thread scaling. Here you go,everything on MTL:
Comparison: Paralellized Quicksort against paralellized SixSortThe comparison is based on their time ratios against our sequentialQuicksortMachine: Intel's MTLArray size 16M# threads PQuicksort time ratio SixSort time ratio2 0.65 0.573 0.47 0.484 0.39 0.385 0.35 0.346 0.33 0.317 0.32 0.308 0.30 0.289 0.28 0.2710 0.27 0.27
Remember that SixSort has a time ratio of 0.59 on AMD with a singlethread and 0.33 with two threads against sequential Quicksort.SixSort's superiority against Quicksort is virtually wiped out on MTLin multi threaded settings.
How do we go from here?

Portrait de ddcc

[A previous attempt to reply to Mike's reponse messed up/ sorry]

Mike Pearce's response puzzle me. I have been contrasting >relative<
performance of two algorithms on one platform versus what is observed
on MTL; with one thread and with two threads. In both cases with
disappointing results on MTL. No idea why BIOS configuration or
algorithm comparison is raised in the response. That has nothing to
do with what I reported in my post.

You emphasize that MTL is aimed at thread scaling. Here you go,
everything on MTL:

Comparison: Paralellized Quicksort against paralellized SixSort
The comparison is based on their time ratios against our sequential
Quicksort
Machine: Intel's MTL
Array size 16M
# threads PQuicksort time ratio SixSort time ratio
2 0.65 0.57
3 0.47 0.48
4 0.39 0.38
5 0.35 0.34
6 0.33 0.31
7 0.32 0.30
8 0.30 0.28
9 0.28 0.27
10 0.27 0.27

Remember that SixSort has a time ratio of 0.59 on AMD with a single
thread and 0.33 with two threads against sequential Quicksort.
SixSort's superiority against Quicksort is virtually wiped out on MTL
in multi threaded settings.

How do we go from here?

Portrait de Mike Pearce (Intel)

We fully understand you are contrasting the >relative< performance of 2 algorithms, now providing relative scaling performance based on the number of threads.

It now appears that the scaling thread performance of the SixSort time ratio does scale slightly better than thePQuickSort in the numbers you have just provided.

But again, you're asking why is the SixSort ratio better on AMD, and worst on Intel platforms. This we cannot answer - there are many ways the resultscan be different, e.g.. BIOS configs, Processor speed & Turbo Boost,compiler andoptions, cache configoptions, even down to the SixSort code, etc, etc. And was the workload run as a benchmark on the MTL, you may recall that the login node is not set up as a benchmarking platform, and willl automatically degrade performance.

Finally we could suggest thatyou avail yourself of the extensiveIntel performancetool set on the MTL, to determine why the SixSort algorithm is deficient on Intel platforms, as you report.

Portrait de ddcc
I can provide also more test data for five Intel platforms, three of them withmulti threaded results. They are ALL disappointing (I could have used "terrible") against the theoretical analysis and what I found on the AMD machine. I had hoped that you would accept the challenge to deal with what I report instead of putting the ball in my court to debug your hardware architecture about which I have only anecdotal knowledge -- as I admitted already in my first posting. If you don't address the problem I have identified (with my conjecture): sorry, so be it.I have too much on my plate to solve Intel'sproblems ...
Portrait de Lars Bergstrom

You still have quite a bit more work to do before you can completely pin the blame on Intel. For instance:
- You are using quicksort as a baseline. Intel's current-generation
memory interconnects have both significantly higher bandwidth and
throughput than AMDs (see my tech report for more details
http://www.cs.uchicago.edu/research/publications/techreports/TR-2011-02

). This means that you should expect your ratios to be better on
the AMD machine because changing to an algorithm with better locality
will be more of a win on the AMD machine than the Intel machine.
You can see this information by looking at the difference between the 1 processor output for quicksort/sixsort.
- Which compiler are you using and with which flags? For example, for bulk-data synchronous parallel programs, switching from gcc to icc and turning on the streaming-stores optimization can increase your bandwidth by up to 40% on AMD machines (less effect on Intel, in my experience).
- Your dataset size may be a little small. Make sure to try it out across larger sizes to see how the ratio changes. At only a few million integers, you can end up starving for more parallelism. Our experience with our language (which has only 1/4-ish the speed of C) is that we needed 10M integers, so I suspect that with an optimized C program you might want at least a factor of four more. Try plotting your speedup across ranges of processors; elbows in that graph commonly indicate either bus saturation, limitations due to parallel overheads, or just running up against Amdal's Law.
- Along those lines, what are the raw numbers? I've frequently found that raw,
single-processor performance on Intel machines is so much larger than on
similar-generation AMD machines that datasets that are large enough to see good speedups on an AMD bottom out on Intel due to parallel overhead beginning to dominate the available work.

I hope this helps! And don't be discouraged -- doing high-quality systems research requires far more measurement and analysis than it might appear by looking at final papers. Even seemingly simple statements such as X is faster than Y on machine M because of Z can require a stunning amount of time. Doing that across two machines will take even more time and effort.
- Lars

Portrait de ddcc
Mr Bergstrom, Phew ... > ... before you can completely pin the blame on Intel. I don't blame anything. I report unexpected, bad performance ratios on a bunch of Intel platforms - ranging from 2001 to 2011, including MTL. Since I am NOT a hardware expert it does not make sense to asks me to do hardware tracing; plus why should >>I<< do that? > ... changing to an algorithm with better locality ... SixSort does not have better locality than Quicksort. Thus your argument fails - to the extend I grasp your statement. The balance of your reply I can't really follow -- it appears to consist of throwing doubts around - as done in earlier responses ... > I hope this helps! Thanks, but no not really. > And don't be discouraged You miss the point. I am VERY satisfied that two years of hard work on SixSort (which is a cooperative of several sorting algorithms being faster and having higher quality than qsort in the C-library - in terms of not degrading in quadratic explosions) has been validated to MY satisfaction on the AMD machine and especially because it confirms my complexity analysis. The results of the equivalent tests on the Intel platforms are uniformly different and, I believe, poor. There can be good reasons why Intel is simply not interested in my findings. I assume that there is a sizeable test bench of algorithms used by Intel to assess its (cpu) architectures. SixSort may not satisfy the characteristics of that test bench. If Intel is interested to figure out what is going: fine with me ... I am NOT qualified and way too busy with other challenges. Feel free to take the ball and run with it, as you see fit
Portrait de ddcc
Let me give you folks a little background. ** All testing is done by comparing time ratios of competing algorithms. ** Test runs are done repeatedly to ascertain that load fluctations do not bias the observed ratios - I am not stupid .... ** Algorithms are coded, by me uniformely, in C (including Quicksort because qsort in the C-library has a wrong design) ** Coding is done "close to the metal"/ assembly language style to avoid interferences by the compiler ** Questions about used compile parameters are red herrings because they are not used, and because of measuring the time ratios of algorithms that are virtually identical (except what I already described in my first posting as part of my conjecture what is messing things up on all Intel platforms) Please read my first & second posting carefully before offering your advice what I should do, which I, quite likely, will not do anyway.
Portrait de jeffrey-gallagher (Intel)

Hey ddcc,

VERY interesting discussion here! I find myself admiring both your general approach and your professional restraint.

While its true that -- for a very, very long laundry list of reasons --us vs. them threads are discouragedin this forum, you raise a challenge that I can only describe as quite interesting.

Using the Intel VTune Amplifier XE on the SixSort code in question would allow one to drill down to (or pinpoint) its potential bottlenecks, to the object or line of code level (as suggested by Dr. Breshears to you some time ago). Finding the bottleneck(s) in this fashion does not require CPU expertise of course; changing the code to improve its performance on a specific system based on what is learned MIGHT

And of course, if the bottlenecks are precisely where expected, an entirely new course of action might emerge.

All this said, I get the distinct impression that your challenge here may pique the interest of more than a few of our technicians, who are scattered throughout the world. In that eventuality, can you provide your source and makefile in this thread so that any interested person can download them and dig in, as needed?

I know that youve appeared to dismiss the role of the compiler in this experiment, but it would be pretty standard to examine what works well in situation A with what might work better in the same situation (as well as in situation B).

If you agree this might be a good next step, I note that you can add files to responses here in this forum. Otherwise, feel free to email the files to intel_mtl@intel.com and Ill add them to this thread myself.

Your continued interest in the causes of this behavior is appreciated.

Cheers,

Jdg

Connectez-vous pour laisser un commentaire.