I'm using Intel's MKL for linux on a 1.2 Ghz AMD athalon w/ 256k L2 cache running fedora core 3 and a mobile Pentium 4 2.8 GHz w/ 1Mb L2 cache with a much faster FSB. Both systems have 512M of RAM, and I believe the RAM in the P4 computer is on a faster bus and it is dual channel.
I've written some code that exploits all aspects of the MKL: the cblas routines, the vsl, the vml, and even one or two functions from lapack.
I've now compiled and debugged the code on both of the two computers using gcc with the -pg option:
gcc turboSPD.c -L/opt/intel/mkl721/lib/32 -lmkl_lapack32 -lmkl_ia32 -lguide -lpthread -lg2c -pg
and I've used gprof to look at the call tree. I am absolutely astounded to find that the athalon computer runs the resulting executable 4 times faster than the P4. The athalon takes approximately 500 seconds to run the same code that it takes the P4 2,000 seconds to run. According to the call graph, almost all of the time in my code is spent on routines from mkl's CBLAS.
I must be configuring something improperly, since these libraries are optimized for the P4 architecture, and the P4 I have has a much better (and much more expensive) processor, motherboard, RAM, etc. than the Athalon.
Any ideas? Is gcc causing this to happen? How might I go about figuring out why this is happening?