For parallel architecture with model of the shared
memory the size of a cache and its competent use is important.
For BLAS3 it is achieved by competent programming.
For example, I BLAS3 is much faster BLAS3 from Inek MKL for IA32.
For BLAS2 and other settlement methods where effectively to use a cache it is
impossible, are important both competent programming (I BLAS2 is much faster
BLAS2 from Inek MKL for IA32 and EM64T), and frequency of operative
memory. For example, Intel MKL at use BLAS2 manages only one core.
(see my page: http://www.thesa-store.com/products)