Regarding the speed of the program dsptrd Intel MKL

Regarding the speed of the program dsptrd Intel MKL

Iguess my dear colleagues are able to explain suchbig differences in the speed of calculations (>96%). Hardware configuration: i7 860 processor (Speed: 2.80 GHz), Motherboard DP55KG,DDR31333 MHz (8 GB), OS Windows XP Professional x64 Edition SP2,the Intel MKL 10.3 Beta, EM64T,HT off.
(updated 04/09/2010)

//icl /O2 comparision_dsptrd.c /link mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib sigal.lib

#include
#include
#include

#include
#include
#include

int main() {
int n;
double *ap;
double *d;
double *e;
double *tau;
int info;

clock_t t_begin;
int j, k, i__;

for (n = 43000; n >= 1000; n -= 1000) {
ap = (double*) malloc(n * (n + 1) / 2 * sizeof(double));
d = (double*) malloc(n * sizeof(double));
e = (double*) malloc((n - 1) * sizeof(double));
tau = (double*) malloc((n - 1) * sizeof( double));
if (!ap || !d || !e || !tau) {
printf("Not enough memory to allocate buffer\\n");
exit(1);
}
i__ = 0;
for (j = 0; j < n; j++) {
for (k = j; k < n; k++) {
ap[i__++] = (double)((k + 1) * 100 + (j + 1));
}
}
t_begin = clock();
dsptrd_("L", &n, ap, d, e, tau, &info);
printf("n=%5d The time was dsptrd Intel MKL: %8d ms. info=%d\\n", n, clock() - t_begin, info);
i__ = 0;
for (j = 0; j < n; j++) {
for (k = j; k < n; k++) {
ap[i__++] = (double)((k + 1) * 100 + (j + 1));
}
}
t_begin = clock();
dsptrd_sig("L", &n, ap, d, e, tau, &info);
printf("n=%5d The time was my dsptrd: %8d ms. info=%d\\n\\n", n, clock() - t_begin, info);
free(tau);
free(e);
free(d);
free(ap);
}
return 0;
}

n=43000 The time was dsytrd Intel MKL: 14413547 ms. info=0
n=43000 The time was my dsptrd : 7329828 ms. info=0
..........................................................
..........................................................
n=35000 The time was dsytrd Intel MKL: 7741563 ms. info=0
n=35000 The time was my dsptrd: 3962328 ms. info=0

n=34000 The time was dsytrd Intel MKL: 7118641 ms. info=0
n=34000 The time was my dsptrd: 3633078 ms. info=0

n=33000 The time was dsytrd Intel MKL: 6480797 ms. info=0
n=33000 The time was my dsptrd: 3323547 ms. info=0

n=32000 The time was dsytrd Intel MKL: 5939719 ms. info=0
n=32000 The time was my dsptrd: 3030782 ms. info=0

n=31000 The time was dsytrd Intel MKL: 5357406 ms. info=0
n=31000 The time was my dsptrd: 2755828 ms. info=0

n=30000 The time was dsytrd Intel MKL: 4877547 ms. info=0
n=30000 The time was my dsptrd: 2498797 ms. info=0

n=29000 The time was dsytrd Intel MKL: 4373578 ms. info=0
n=29000 The time was my dsptrd: 2257656 ms. info=0

n=28000 The time was dsytrd Intel MKL: 3954922 ms. info=0
n=28000 The time was my dsptrd: 2032938 ms. info=0

n=27000 The time was dsytrd Intel MKL: 3531782 ms. info=0
n=27000 The time was my dsptrd: 1823312 ms. info=0

n=26000 The time was dsytrd Intel MKL: 3171781 ms. info=0
n=26000 The time was my dsptrd: 1628719 ms. info=0

n=25000 The time was dsytrd Intel MKL: 2799281 ms. info=0
n=25000 The time was my dsptrd: 1448672 ms. info=0

n=24000 The time was dsytrd Intel MKL: 2482109 ms. info=0
n=24000 The time was my dsptrd: 1282578 ms. info=0

n=23000 The time was dsytrd Intel MKL: 2160562 ms. info=0
n=23000 The time was my dsptrd: 1129500 ms. info=0

n=22000 The time was dsytrd Intel MKL: 1899421 ms. info=0
n=22000 The time was my dsptrd: 989203 ms. info=0

n=21000 The time was dsytrd Intel MKL: 1645437 ms. info=0
n=21000 The time was my dsptrd: 861172 ms. info=0

n=20000 The time was dsytrd Intel MKL: 1421594 ms. info=0
n=20000 The time was my dsptrd: 746547 ms. info=0

n=19000 The time was dsytrd Intel MKL: 1209344 ms. info=0
n=19000 The time was my dsptrd: 638938 ms. info=0

n=18000 The time was dsytrd Intel MKL: 1025391 ms. info=0
n=18000 The time was my dsptrd: 543937 ms. info=0

n=17000 The time was dsytrd Intel MKL: 855171 ms. info=0
n=17000 The time was my dsptrd: 460234 ms. info=0

n=16000 The time was dsytrd Intel MKL: 714203 ms. info=0
n=16000 The time was my dsptrd: 383219 ms. info=0

n=15000 The time was dsytrd Intel MKL: 585125 ms. info=0
n=15000 The time was my dsptrd: 316203 ms. info=0

n=14000 The time was dsytrd Intel MKL: 474891 ms. info=0
n=14000 The time was my dsptrd: 257609 ms. info=0

n=13000 The time was dsytrd Intel MKL: 377844 ms. info=0
n=13000 The time was my dsptrd: 206703 ms. info=0

n=12000 The time was dsytrd Intel MKL: 295094 ms. info=0
n=12000 The time was my dsptrd: 163015 ms. info=0

n=11000 The time was dsytrd Intel MKL: 224157 ms. info=0
n=11000 The time was my dsptrd: 125969 ms. info=0

n=10000 The time was dsytrd Intel MKL: 168735 ms. info=0
n=10000 The time was my dsptrd: 94969 ms. info=0

n= 9000 The time was dsytrd Intel MKL: 122218 ms. info=0
n= 9000 The time was my dsptrd:69562 ms. info=0

n= 8000 The time was dsytrd Intel MKL: 86718 ms. info=0
n= 8000 The time was my dsptrd:49156 ms. info=0

n= 7000 The time was dsytrd Intel MKL: 58265 ms. info=0
n= 7000 The time was my dsptrd: 33125 ms. info=0

n= 6000 The time was dsytrd Intel MKL: 36968 ms. info=0
n= 6000 The time was my dsptrd: 21015 ms. info=0

n= 5000 The time was dsytrd Intel MKL: 22000 ms. info=0
n= 5000 The time was my dsptrd: 12265 ms. info=0

n= 4000 The time was dsytrd Intel MKL: 11671 ms. info=0
n= 4000 The time was my dsptrd: 6343 ms. info=0

n= 3000 The time was dsytrd Intel MKL: 5078 ms. info=0
n= 3000 The time was my dsptrd: 2672 ms. info=0

n= 2000 The time was dsytrd Intel MKL: 1453 ms. info=0
n= 2000 The time was my dsptrd: 719 ms. info=0

My web page (it is not currently available) and publications, which used my diagonalization, can be downloaded here: http://depositfiles.com/files/fmy2ueaad

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Unless you have provided the source code or a linkable object code of your version to the Intel developers, before doing which you needed to protect your intellectual property, I don't think that such challenges have any interest.

From the point of view of a user, the issues are:

(i) Does the replacement candidate meet the specifications of the current library routine? In other words, does it do all the tasks that the current routine can do and support all the options presently available?

(ii) How does the candidate stand in regard to stability and accuracy? Is the algorithm, if different, known and has it been peer-reviewed?

(iii) Speed.

Your post addresses only issue (iii).

As to the question about packed versus full storage: full storage is programmer-friendly. The additional programming needed to use packed storage is not justified for one-off calls to routines such as DSPTRD when the time consumed is not considered important.

The guys from Intel are well aware of my designs and my publications on this topic. They believe that they can own up to this task, but they can not do it. That they can be forgiven, because problem is very complicated. In fact, my program is faster, because dsptrd Intel MKL part-time view on two core, and this time the processor speeds up to frequencies of 3.47 GHz.

The guys from Intel are well aware of my designs and my publications on
this topic
.

I see.

They believe that they can own up to this task, but they
can not do it.

You probably meant "They believe that they can do the task on their own, but ...". What you wrote means something quite different from this.

That they can be forgiven, because problem is very
complicated. In fact, my program is faster, because dsptrd Intel MKL
part-time view on two core, and this time the processor speeds up to
frequencies of 3.47 GHz.

I find that last sentence undeciperable, involving as it does a non sequitur. Nor do I understand the phrase "part-time view on two core". Perhaps an online translation tool would help.

>They believe ...
This task is very difficult for them

>part-time view on two core
Part of the problem is only on the two processor cores

Quoting yuriisigpart of the problem is only on the two processor cores

If you take a large matrix (eg, 40000 * 40000), the Task Manager shows that more than 3 / 4 time dsptrd Intel MKL believes only one processor core. It turns out that my dsptrd faster at 115%.

Using the technology of the future the algorithm of super-fast matrix tridiagonalization is developed: http://software.intel.com/en-us/forums/showthread.php?t=76595. This algorithm is much better than the fastest one for square matrices dsytrd Intel MKL (>24%, processor x7 860, XP x64, EM64T,the Intel MKL 10.3 Beta, HT off) :

n=12000 The time was dsytrd Intel MKL: 203860 ms. info=0
n=12000 The time was my dsptrd : 164391 ms. info=0

n=13000 The time was dsytrd Intel MKL: 259016 ms. info=0
n=13000 The time was my dsptrd: 208703 ms. info=0

n=14000 The time was dsytrd Intel MKL: 322344 ms. info=0
n=14000 The time was my dsptrd: 259828 ms. info=0

n=15000 The time was dsytrd Intel MKL: 396094 ms. info=0
n=15000 The time was my dsptrd: 318688 ms. info=0

n=16000 The time was dsytrd Intel MKL: 480797 ms. info=0
n=16000 The time was my dsptrd: 386532 ms. info=0

n=17000 The time was dsytrd Intel MKL: 574360 ms. info=0
n=17000 The time was my dsptrd: 462375 ms. info=0

n=18000 The time was dsytrd Intel MKL: 681281 ms. info=0
n=18000 The time was my dsptrd: 548657 ms. info=0

n=19000 The time was dsytrd Intel MKL: 801937 ms. info=0
n=19000 The time was my dsptrd: 644031 ms. info=0

n=20000 The time was dsytrd Intel MKL: 933172 ms. info=0
n=20000 The time was my dsptrd: 750235 ms. info=0

n=26000 The time was dsytrd Intel MKL: 2041297 ms. info=0
n=26000 The time was my dsptrd: 1640625 ms. info=0

If these results are combined with http://software.intel.com/en-us/forums/showthread.php?t=73653&o=d&s=lr, the gap in the rate calculation will be very large.

Here is an example for the new Intel processors (Intel(R) Core(TM) i7-5820K CPU@3.30 GHz) with six cores (parallel_studio_xe_2017_update4):

n=10000 The time was the Intel MKL dsptrd: 100.5 s.
n=10000 The time was the Intel MKL dsytrd: 41.5 s.

But my dsptrd faster than the Intel MKL dsytrd!!! i.e. Rectangular Full Packed (RFP) storage scheme proposed by Intel that allows you to work with the matrix using dgemm and dtrmm is not optimal!!!

Is not resolved also other issues: for example: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/290238

Yuri, accordingly perf results you provided, your implementation significantly faster then MKL for some problem sizes. The problem is how we may check these results? Could we take the evaluation version of your library?

Genadiy, I already published my algorithms: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/287728

Check my algorithms very simply: send me a representative.

Yurii, the two Depositfile links (in the forum post that you cited in #10) are dead: if either is chosen, after a 30-second wait we see the note:

This file does not exist, the access to the following file is limits or it has been removed due to infringement of copyright.

So, where does one go (in year 2017) to learn the main points of your algorithm?

According to the lead post in this thread, you are judging by the total time of all threads, while mkl should be optimized for minimum elapsed time, possibly varying number of active cores. You would need to experiment with number of threads and affinity to reduce total cpu time.

Tim,

First of all, I'm talking about the difference between my algorithms and Intel MKL algorithms: my algorithms are much faster.

--Yurii

mecej4,

I published my algorithm only for the dormtr function: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/287728

--Yurii

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today