slower on a P4 than on a athalon?

slower on a P4 than on a athalon?

I'm using Intel's MKL for linux on a 1.2 Ghz AMD athalon w/ 256k L2 cache running fedora core 3 and a mobile Pentium 4 2.8 GHz w/ 1Mb L2 cache with a much faster FSB. Both systems have 512M of RAM, and I believe the RAM in the P4 computer is on a faster bus and it is dual channel.

I've written some code that exploits all aspects of the MKL: the cblas routines, the vsl, the vml, and even one or two functions from lapack.

I've now compiled and debugged the code on both of the two computers using gcc with the -pg option:

gcc turboSPD.c -L/opt/intel/mkl721/lib/32 -lmkl_lapack32 -lmkl_ia32 -lguide -lpthread -lg2c -pg

and I've used gprof to look at the call tree. I am absolutely astounded to find that the athalon computer runs the resulting executable 4 times faster than the P4. The athalon takes approximately 500 seconds to run the same code that it takes the P4 2,000 seconds to run. According to the call graph, almost all of the time in my code is spent on routines from mkl's CBLAS.

I must be configuring something improperly, since these libraries are optimized for the P4 architecture, and the P4 I have has a much better (and much more expensive) processor, motherboard, RAM, etc. than the Athalon.

Any ideas? Is gcc causing this to happen? How might I go about figuring out why this is happening?

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Well, this is an interesting situation but I need a lot more information to help me understand what is goint on. In all those LAPACK routines which use level 3 BLAS, for instance, would at worst, run at the ratio of the frequecies.

Some of the important information includes:
1. Frequency, models and configurations such as memory size of the computers
2. Actual LAPACK routines that are being called
3. Problem sizes
4. Any other information that can bear on the performance


It appears that the problem was caused by using single precision arithmetic rather than double precision arithmetic. Once I switched to double precision values for everything, the library was *VERY* fast on the P4. It practically has wings!

I guess the slowdown due to the casting from the FPU natural double precision actually outweighed the increase in speed due to lower requirements from the memory when using floats?

I am glad to hear about this favorable result. This suggests that you may have been using an older version of MKL. At one time we had not optimized the single precision BLAS for the Pentium 4 processor - we just hadn't had the time to do it.

I would venture a guess that were you to use MKL 7.2.1, the latest release, you would also have excellent performanc on the single precision problem.


Leave a Comment

Please sign in to add a comment. Not a member? Join today