Newbie performance question

Newbie performance question

Bild des Benutzers Deleted user

Hi,

I have just gottenan evaluationcopy oftheMKL andhave successfully linked it with MS Visual Studio .NET.

As a quick test of the Level 1 BLAS, Ireplaced some functions that I wrote myself (such asmy_ddot and my_daxpy) with thecorresponding CBLASfunctions (such ascblas_ddot and cblas_daxpy).

To my surprise,the code became noticeably slower with the MKL functions.

I realize that Ihave not given many specifics, but is this kind of performancereasonable? Or have I made a mistake?

Thanks,
Sam

3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Tim Prince

Up to a certain size, probably several hundred array elements, it's nearly certain that you could write an in-lined method which would be faster than calling MKL Level 1 BLASfunctions. If you attached your test code, you might get more useful feedback.


The greater strength of MKL comes in Level 2 BLAS, in problems large enough to take advantage of cache blocking. That's getting to be pretty big.


You might be running on an architecture for which MKL doesn't choose the best suited code path at run time, assuming you are using the dynamic choice scheme. On my Centrino laptop, P-III optimizations sometimes outperform P4.


Bild des Benutzers Community Admin

I will add some comments here.


First of all the BLAS have a fairly heavy interface - there are a lot of checks to assure the user has not erred in ways that can be detected, there are checks for the stride with jumps to the appropriate code and so on. Calling these functions from the cblas adds additional costs. Furthermore, the Intel MKL does some things in the interface which makes certain that the functions will work well, including checking that the floating point stack is empty (some compilers in the past have left values on the FP stack and if we try to use all the entries in the stack, the stack overflows and NaNs will appear in the results), and the library also, in some cases, sets the FP status register to do 80-bit arithmetic. All of these actionsaffect performance on shorter vectors.


Finally, compilers can generate quite good code (and SHOULD be able to generate quite good code) for simple vector operations.


Bruce

Melden Sie sich an, um einen Kommentar zu hinterlassen.