strange performance difference

strange performance difference

recently I tested a new server from Dell (PowerEdge 2850) and I observed a strange performance degradation over an older system we have (Dell's PE 2650).

The newer system has 2*3.0 GH Intel Xeon Foster processors, each with 16KB of L1 cache and 1MB of L2 unified cache as well as 2GB of DDR2 RAM running at 800MHz. The x86info identifies cpu as
"Family: 15 Model: 4 Stepping: 1 Type: 0 Brand: 0"

The older system has 2*2.80 GHz Pentium 4 (Northwood) [C1] with 8KB L1 and 512KB L2 cache and 2GB of DDR RAM running at 400Mhz. The x86info says
Family: 15 Model: 2 Stepping: 7 Type: 0 Brand: 11

Both system have latest BIOS, and run under Linux with hyperthreading enabled. The older system runs "Debian/unstable" with kernel 2.6.10-686-smp and libc-i686 library, and the newer system was tried with several distributions, mainly RedHat based.

The puzzle is that our software performs _noticeably_ better on the older system. The software (sequence alignment using dynamic programming) is quite memory intensive and mostly uses integer computations. The difference is such that we cannot say it is a measurement errors. In some cases the system was twice as slow.

The problem can also be reproduced with a public tool/compiler. For example, the FASTA package ( compiled with gcc 3.3.5, flags "-O3 -march=pentium4" shows the following results:

[old] /usr/bin/time fasta/ssearch34 -b 50 -d 0 -H -Q -O 1 HSBA150A6 HSBA150A6

301.96user 0.12system 5:02.08elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (0major+11309minor)pagefaults 0swaps

[new] /usr/bin/time fasta/ssearch34 -b 50 -d 0 -H -Q -O 1 HSBA150A6 HSBA150A6

374.65user 0.07system 6:14.73elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (179major+11121minor)pagefaults 0swaps

here HSBA150A6 is a nucleic sequence (97392 bases) from GenBank144.

As you can see, the newer system is ~20% slower, despite that it has a faster CPU, more cache and faster memory.

It must be said that we couldn't run the test on identical OSes, but we tried to run the older system with kernel 2.4.27 and it was still noticeably faster.

We noticed that the PE2850 has a setting in the bios related to tune-up for applications that access memory sequentially/randomly. It has no effect on our tests.

I would appreciate any ideas why the seemingly better processor/memory in fact performs worse in practice?


2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Foster was a very early Xeon model. From your description, I'm guessing you're running a Nocona, maybe with the 32-bit OS. You don't say whether you are running a threaded application, or whether you have HT enabled on one or both machines. RH EL3 doesn't do a good job of scheduling on a Xeon with HT enabled. I haven't had a chance to see whether EL4 may be better.
If you do have a Nocona, your increased clock speed ought to offset the longer pipelines. If you are using gcc, you should use a version with the -march=nocona switch. You may require oprofile or Vtune to get useful information about performance events.

Leave a Comment

Please sign in to add a comment. Not a member? Join today