Let me give more data, organize them and provide more detail to
support my conjecture about "Intel's problem".
-
Note: I report time ratio of 2 algorithms on different platforms and
with varying number of threads. The two algorithms are (parallel)
SixSort and (parallel) Quicksort (my version of Quicksort because what
I found out there is not industrial strength). No, I do not do the
Mickey Mouse thing of sorting integers. Instead, I sort object;
i.e. the arrays contain references to structs.
-
Single thread
=========
Comparison: SixSort versus Quicksort
Single thread / array size 16M
Machine (Sixsort time)/(Quicksort time)
AMD 64 X2 4000+ 2.1 GHz/ Linux 0.59
Intel Celeron 1.07Ghz/ XP-Cygwin 0.79
Intel quad core xeon 2.5Ghz/ Linux 0.81
Intel Pentium M 1.73Ghz/ XP-Cygwin 0.83
Intel MultiCore Testing Lab (MTL) 0.84
Intel i3-2310M 2.1Ghz/ Win7-Cygwin 0.86
-
Two threads
========
Comparison: parallel SixSort versus single threaded Quicksort
two threads / array size 16M
AMD 64 X2 4000+ 2.1 GHz/ Linux 0.33
quad core xeon 2.5Ghz/ Linux 0.50
Intel i3-2310M 2.1Ghz/ Win7-Cygwin 0.51
MTL 0.57
-
Many threads
=========
Comparison: Paralellized Quicksort against paralellized SixSort
The comparison is based on their time ratios against our sequential
Quicksort
Machine: Intel's MTL
Array size 16M
# threads PQuicksort time ratio SixSort time ratio
2 0.65 0.57
3 0.47 0.48
4 0.39 0.38
5 0.35 0.34
6 0.33 0.31
7 0.32 0.30
8 0.30 0.28
9 0.28 0.27
10 0.27 0.27
-
This last test set is really bad because the superiority of (parallel)
SixSort against (parallel) Quicksort is virtually wiped out.
-
Remember: theory predicts a SixSort/Quicksort single thread time ratio
of 0.58.
-
My conjecture
=========
The only difference I see between (parallel) SixSort and (parallel)
Quicksort is the size of their executables. Quicksort is tiny.
SixSort (32 bit) has an executable of around 50Kb. Hence my guess
that the Intel boxes suffer from cache failures of the executable - to
the extend that this conjecture makes sense given my lack of knowledge
about hardware.
-
If you folks want to test this conjecture, give me access to a (single
threaded) box with a large enough cache for execute memory.
-
PS If you reply to me, please don't use the royal "we" because that
rubs us in the wrong way.
Performance conundrum - more data and summary
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.



