I've compared performance of f77 test program
a) calls BLAS1 (daxpy) and BLAS2
b) do (theoretically) the same work, but with
explicit coding of f77 loops instead of library calls.
(I used for compilation ifc -O3 -tpp6 on my old
Celeron 433 w/RH 6.2).
I didn't find any significant performance
increase at MKL (and Atlas also) library calls
in comparison w/direct compilation of f77 loops:-(
(it was for measured for some PIII based systems).
I understand that "BLAS3 calls" works much more
better and will give speed-up at using of DGEMM
(for example), but it's BLAS3... Is there some
available data about speed-up (in comparison
w/loops coding and compilers optimization)
at using of BLAS1 or dgemv calls for modern Intel
P4 (w/SSE2) CPUs ?
Or may be like data for IA-64 ?
Zelinsky Inst. of Organic Chemistry