Is this the highest possible performance?

Is this the highest possible performance?

Hi
I'm evaluating MKL 8.1 under Linux.
At the first step, we wanted to see if it is possible to reach the theoritical CPU performance using MKL. We developed a very simpel example using SGEMM routine. This example multiplies two 4000x4000 elements arrays and adds the results to another matrix.

This example needs 64*10^9 computations. Its execution time is near 35 seconds on a Pentium M 2.0GHz 2Mb Cache CPU.

We concluded that the highest possible performance of this CPU is near 2GFLOPS.

Is this correct?
Do we need any special additional optimizations. We have used GCC for compiling the application under Debian. ( we did not use intel compiler, can it make the results better? )

thanks.

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Pentium-M isn't designed to support a particularly high performance for floating point operations. On Pentium-M, parallel floating point operations are split into 64-bit chunks (2 single/float operations per instruction). So, 2Gflops looks possible, if all operations are parallel SSE. Apparently, MKL has done a good job even on Pentium-M, which I doubt was considered as a specific target for optimization.
Apparently, only an insignificant fraction of your operations are performed outside MKL, so the compiler and options chosen would not make measurable difference. If you made the problem much smaller, with vector lengths on the order of 100, you might see an advantage for a vectorizing compiler. I suppose you aren't using gcc-4.2, which might be capable of vectorizing your additions. So far, the Intel compilers have generally managed to vectorize better than the others, but you seem to have such a simple case that any vectorizing compiler should do.
You probably are aware that you should be doing your array operations with purely sequential access in the inner loop. I'm not sure how you can ask about additional optimizations, when you don't say what you consider to be in the category of "special additional."
gcc -O3 -funroll-loops -march=pentium-m would be a typical choice, adding -ftree-vectorize if you are using 4.2.

Leave a Comment

Please sign in to add a comment. Not a member? Join today