An evaluation of the impact of memory configuration on the performance of applications running on Intel® Xeon® processor 5500-series based servers

Submit New Article

Last Modified On :   October 28, 2009 12:59 PM PDT
Rate
 


Introduction

Servers using the 5500 series of Intel® Xeon® processors have different memory configuration properties from previous Intel® Xeon® processors.  A commonly used rule of thumb for configuring memory on systems running HPC applications has been “2 GB per physical core”; this should be reevaluated when configuring servers built with Intel® Xeon® processors in the 5500 series. This processor’s three-channel memory controller means that optimal memory configurations in a dual-socket server will use multiples of six memory DIMMs, whereas earlier dual-socket servers generally used multiples of four or eight.

It is not the purpose of this paper to recommend a specific memory configuration. Rather, this paper illustrates the performance impact of different memory configurations on a set of High Performance Computing (HPC) applications. This paper compares the performance of 16 HPC application benchmarks on 12 different memory configurations.
The applications chosen for this study are representatives of applications in three performance characterizations groups: (1) low memory bandwidth, (2) moderate to high I/O bandwidth and (3) moderate to high memory bandwidth. The memory configuration in this study uses various combinations of 1, 2 and 4GB DIMMS, which give a total memory size ranging between 12 and 36 GB.



Test Configuration

The base platform for this memory configuration experiment is two-socket server:
Platform           2-Socket 2-U server
Baseboard        Supermicro X8DTN+ -IN001 Rev 1.02
Processors       Intel® Xeon® processor X5570, 2.93 GHz
Chipset Intel® 5520
OS                   Red Hat* Enterprise Linux 5, Update 3

Twelve different memory configurations were used in this study, with the total memory size on the test system ranging from 12 to 36 GB. The baseboard on this Supermicro platform has three DIMM slots on each of the processor’s three memory channel. This gives nine DIMM slots connected to each processor’s memory controller for a total of 18 DIMM slots on the platform.

The various memory configurations are more fully described in the Appendix. For the twelve memory configurations used in this study, the DIMM sizes and placement were identical on each processor’s nine DIMM slots.

The DIMM placement can be either uniform or non-uniform. Uniform DIMM placement is defined as each memory channel having the same number of DIMMS, and the DIMM sizes and placement are identical across all six memory channels on the test system.

Non-uniform DIMM placement resulted in slower performance than uniform placement. Therefore, the main body of this paper will concentrate on five of the uniform configurations. The results for all twelve configurations, including all of the non-uniform configurations, are listed in the appendix.

The five uniform configurations considered in this part of the paper are described in the following table. The nine digits in the DIMM Placement columns show the DIMM size used in the three slots in each of three memory channels.

Total
Memory

Memory
Speed

DIMM Placement,
Processor 1

DIMM Placement,
Processor 2

18 GB

800 MHz

111-111-111

111-111-111

18 GB

1067 MHz

210-210-210

210-210-210

24 GB

1067 MHz

220-220-220

220-220-220

24 GB

1067 MHz

400-400-400

400-400-400

24 GB

1333 MHz

400-400-400

400-400-400

Benchmark workloads from 16 applications were selected to illustrate the performance impact of the various memory configurations. All but one of these applications runs with 12 GB of memory without paging. These benchmarks were selected as representative applications characterized by

  1. Low memory bandwidth
  2. Moderate to high I/O bandwidth
  3. Moderate to high memory bandwidth

The benchmark workloads selected for each these group are summarized in the following table:

Characterization

Application and Version

Workload

Low memory
Bandwidth

ABAQUS-std* v6.8-2

s2a

Amber* v9

nine standard workloads

BlackScholes* v3.0

one standard workload

BLAST* v2.2.18

one standard workload

MonteCarlo* v0.1

one standard workload

Low to high
I/O bandwidth

ABAQUS-std* v6.8-2

s4b

Gaussian* g03-E.01

apinefreq

MD.NASTRAN* R3

xl0xdy0

MD.NASTRAN* R3

xx0cmd2

Moderate to high
memory bandwidth

E3D* vFinal

SEG_Subsalt

Eclipse* v2008.1

ONEM1

Fluent* v12.0.9 Beta

sedan_4m

LS-DYNA* mpp971_R3.2.1

car2car

MILC* v7.6.2b

Medium-NSFt2

POP* v3.0

x1

WRF* v2.2.1

conus12



Summary of Performance Results

In the summary graphs that follow, results for all applications are shown relative to the results for the 24GB-1333 400-400-400 configuration. This configuration is expected to best memory performance because it is able to run the memory at the highest speed, 1333 MHz. The Appendix contains a detailed description of the results on the individual applications in these three groups.

Applications with low memory bandwidth requirements
As shown on the following graph, the five applications characterized by low memory bandwidth requirements show no significant differences in performance on the five uniform memory configurations. The slight variation in results is attributed to experimental noise.

These results were expected as these applications fit within memory and their performance does not depend on memory bandwidth. There is essentially no performance difference between these memory configurations.

Applications with moderate to high I/O bandwidth requirements
Four of the benchmarks used in this study perform significant write and read operations during the execution of the benchmark workload. These applications were selected to illustrate performance impact of applications whose I/O bandwidth ranges from moderate to high. I/O performance is partially a function of the size and speed of the memory subsystem as the operating system maintains a file buffer cache to help improve I/O latency and performance.

The Gaussian apinefreq workload has a relatively low I/O bandwidth. The results of this benchmark on the five memory configurations are similar to the low memory bandwidth applications described above: there are little performance differences in the results.

The MD.NASTRAN xl0xdy0 workload can be characterized as having a moderate I/O bandwidth.  The two 18 GB configurations run about 8% slower than the baseline.

The remaining two applications in this group show much different performance responses to the different memory configurations. To a large degree they track the amount of memory available to the operating system for the file buffer cache.

The s4b workload causes ABAQUS-std to execute as a direct sparse linear equation solver. Its static analysis indicates that the amount of memory to minimize actual disk I/O is 31 GB. Since the largest memory size for these five memory configurations is 24 GB, a significant amount of I/O is performed.

The MD.NASTRAN xx0cmd2 workload was the one application that did not run in 12 GB of memory. When run with eight processes, this workload performs almost 200,000 write operations and over 800,000 read operations, most with a buffer size of 256 KB. The high-water mark for the size of the scratch files is over 40 GB.
The results for both of these workloads show that the performance of the two 18 GB memory configurations is similar and significantly slower than the three 24 GB configurations. This is attributed to the increased memory available to the operating system for the file buffer cache.

Applications with moderate to high memory bandwidth requirements
The third group of application workloads is characterized with a moderate to high memory bandwidth. The expectation is that applications in this group will show significant performance differences based on the memory configuration. The graph below confirms this expectation.

The 800 MHz 18GB memory configuration is clearly the slowest among the five configurations shown. In all but two cases it is significantly slower (more than 5%) than the other 1067 MHz 18 GB configuration.

The two 1067MHz 24 GB configurations show some of the difference between dual- and quad-ranked DIMMS. The 2 GB DIMMS used in this study are dual-ranked and the 1067 MHz 4 GB DIMMS are quad-ranked. The performance of these applications on the quad-ranked DIMMS is about 2½% faster than the dual-ranked 2 GB DIMMS.


Summary

Evaluation of various memory configurations in dual-socket servers based on the 5500 series Intel® Xeon® processors indicates that the performance of applications that make high demands on memory bandwidth may benefit from uniform memory configurations and the fastest available memory.  Applications that have more modest memory bandwidth requirements may achieve satisfactory performance with more flexibility in their configurations. The follow observations and recommendations may also be useful:

  • A system should be configured with sufficient memory to prevent swapping.
  • For best performance, DIMM sizes and placement should be uniform across all memory channels.
  • Applications that have high memory bandwidth requirements will likely perform fastest on systems configured with the fastest memory system.  This is achieved with one 1333 MHz DIMM in each memory channel.
  • Applications that have high I/O bandwidth requirements often perform faster on systems configured with additional memory, which can increase the efficacy of the operating system’s file buffering.
  • On systems running a heterogeneous mixture of applications, no single memory configuration may ideal. The best compromise is often six of the fastest and largest DIMMs configured with one DIMM per channel.


Appendix A: Memory Configuration Details

Twelve different memory configurations were used in this experiment using various combinations of 1, 2 and 4GB DIMMS, giving a total memory size ranging from 12 to 36 GB. The DIMMS used in this study are described in the following table:

DIMM
Size

Manufacturer

Description

Model Number

1 GB

Qimonda*

1Rx8 PC3-8500R

IMSH1GP03A1F1C-10F T2
B3L85111004

2 GB

Qimonda*

2Rx8 PC3-10600R

IMSH2GP13A1F1C-13H T2
B3S82336006

4 GB

Micron*

4Rx8 PC3-8500R

MT36JSZF51272PDY-1G1DYESDD
BZAECGB001

4 GB

Micron*

2Rx4 PC3-10600P

MT36JSZF51272PY-1G4DZES
BZAE0GSA04


The server has a NUMA memory architecture, with the memory controller in each processor supporting three channels with up to three DIMMS per channel. For this experiment the DIMM sizes and positions on each processor are identical for the memory combination tested.

The speed of the memory is a function of the manufacturer’s rating as well as the number of DIMMS used in memory channels. If a system is configured with two DIMMS per channel, then the BIOS will enforce a maximum speed of 1067 MHz even if the individual DIMMS are rated at 1333MHz. If all three DIMM slots in a single channel are used, the memory speed will be set to 800 MHz.

The specific memory configurations used in this study are listed in the following table:

Total
Memory

Processor 1
DIMM Configuration

Processor 2
DIMM Configuration

Memory
Speed

12 GB

200-200-200

200-200-200

1333 MHz

16 GB

220-220-000

220-220-000

1067 MHz

16 GB

210-210-200

210-210-200

1067 MHz

16 GB

220-200-200

220-200-200

1067 MHz

18 GB

111-111-111

111-111-111

800 MHz

18 GB

210-210-210

210-210-210

1067 MHz

20 GB

220-220-110

220-220-110

1067 MHz

20 GB

220-220-200

220-220-200

1067 MHz

24 GB

220-220-220

220-220-220

1067 MHz

24 GB

400-400-400

400-400-400

1067 MHz

24 GB

400-400-400

400-400-400

1333 MHz

36 GB

222-222-222

222-222-222

800 MHz


While there are other possible configurations, it is believed that the twelve used in this study are representative of other possible DIMM placement.


Appendix B: Benchmark Results

Benchmarks from 16 HPC applications were used in this experiment. The applications were selected, with just one exception, to run in 12 GB of memory without swapping, since one of the test cases for this study was a configuration with 12 GB. The one exception was one of the I/O intensive workloads.
The results shown in the following tables are relative to the 400-400-400 1333 MHz case.

Applications with low memory bandwidth
Workloads from five applications that are characterized with a low memory bandwidth were selected for this study. They are:

  • ABAQUS-std v6.8-2, s2a workload

Abaqus/Standard is a general-purpose solver using a traditional implicit integration scheme to solve finite element analyses. The s2a workload is a mildly nonlinear static analysis of a flywheel with centrifugal loading. It is a 474,744 DOF model with a moderate iteration count.

  • Amber v9, nine standard workloads

A molecular dynamics program used to calculate properties of macromolecular systems. The value reported is the geometric mean of this application’s nine standard workloads.

  • BlackScholes v3.0

BlackScholes models the market for an equity using the Black Scholes formula

  • BLAST v2.2.18

Bioinformatics code used to perform similarity searches against databases of genome or protein sequences.

  • MonteCarlo v0.1

Financial simulation engine using Monte Carlo technique

The expectation is that the benchmark results for applications with this characterization should show little difference on the twelve memory configurations tested. Their results confirmed this expectation as shown by the following table:

Memory
Configuration

ABAQAUS-std
s2a

Amber

BlackScholes

BLAST

MonteCarlo

12GB-1333
200-200-200

0.994

0.997

1.001

0.997

0.996

16GB-1067
220-220-000

0.982

0.967

1.001

0.995

0.997

16GB-1067
210-210-200

0.985

0.977

1.001

0.996

1.001

16GB-1067
220-200-200

0.982

0.970

1.001

0.996

1.003

18GB-800
111-111-111

0.988

0.974

1.001

0.997

0.993

18GB -1067
210-210-210

0.988

0.995

1.001

0.998

0.998

20GB-1067
220-220-110

0.982

0.979

1.000

0.999

0.991

20GB-1067
220-220-200

0.994

0.995

1.001

1.000

1.001

24GB-1067
220-220-220

0.997

0.993

0.997

1.005

1.001

24GB-1067
400-400-400

0.997

1.014

1.001

1.002

1.001

24GB-1333
400-400-400

1.000

1.000

1.000

1.000

1.000

36GB-800
222-222-222

0.980

1.000

1.001

1.003

1.001


The results for BlackScholes, BLAST and MonteCarlo applications show less than 1% difference across all twelve memory configurations. This is within the observed run-to-run variability for benchmarking these applications. ABAQUS-std shows about a 2% performance degradation on some of the smaller non-uniform memory configurations; Amber shows about a 3% performance degradation at these configurations. While more than normal run-to-run variation, these degradations are not considered significant (more than 5%)

Applications with moderate to high I/O bandwidth
Four of the benchmarks used in this study perform a significant amount of disk I/O and can be characterized by having a moderate to high I/O bandwidth. I/O performance is partially a function of memory size and speed as the operating system maintains a file buffer cache to help improve I/O latency and performance. With more memory in system, the OS can maintain a larger buffer cache.

The applications in this group are:

  • ABAQUS-std v6.8-2, s4b workload

This is the same ABAQUS-std described in the previous group. The s4b workload is a mildly nonlinear static analysis that simulates bolting a cylinder head onto an engine block. It is a 5,000,000 DOF model with a low iteration count.

  • Gaussian g03-E.01, apinefreq workload

Gaussian is a quantum chemistry code.

  • MD.NASTRAN R3, xl0xdy0 workload

NASTRAN is a general purpose finite element analysis solution for small to complex assemblies. The xl0xdy0 workload is a model of a truck crash, has 286,216 DOF and uses the explicit nonlinear solution sequence.

  • MD.NASTRAN R3, xx0cmd2

This workload is car body model with 1,315,340 DOF using Normal Modes Analysis solution sequence with ACMS.

The Gaussian apienfreq workload has a relatively low I/O bandwidth. The results of this benchmark on the twelve memory configurations looks similar the low memory bandwidth applications shown above: there is some performance degradations, up to 3%, on the smaller non-uniform configurations.

Memory
Configuration

ABAQAUS-std
s4b

Gaussian.E
apinefreq

MD.NASTRAN
xl0xdy0

MD.NASTRAN
xx0cmd2

12GB-1333
200-200-200

0.798

1.005

0.952

 

16GB-1067
220-220-000

0.733

0.970

0.821

0.417

16GB-1067
210-210-200

0.727

0.993

0.935

0.440

16GB-1067
220-200-200

0.678

0/979

0.931

0.407

18GB-800
111-111-111

0.804

0.984

0.919

0.695

18GB -1067
210-210-210

0.840

0.998

0.932

0.749

20GB-1067
220-220-110

0.711

0.989

0.884

0.845

20GB-1067
220-220-200

0.717

0.992

0.931

0.750

24GB-1067
220-220-220

1/006

0.999

1.023

0.962

24GB-1067
400-400-400

1.027

1.004

0.982

0.953

24GB-1333
400-400-400

1.000

1.000

1.000

1.000

36GB-800
222-222-222

1.949

0.989

0.874

1.159


The MD.NASTRAN xl0xdy0 workload has a moderate I/O bandwidth, and this benchmark shows a performance degradation of about 18% on some of the non-uniform memory configurations. The two memory configurations running at 800 MHz also show significant performance degradations.

The remaining two applications in this group, ABAQUS-std v6.8-2 s4b and MD.NASTRAN R3 xx0cmd2, show much different performance responses to the different memory configurations.

The s4b workload actually fits in memory for the 36 GB configuration. Consequently it runs almost twice as fast as the 24 GB baseline configuration. The performance of this benchmark also suffered on the non-uniform DIMM configurations.

The best configuration for the MD.NASTRAN R3 xx0cmd2 benchmark was on the 36 GB configuration, running about 16% faster than the 24 GB baseline. This shows the benefit of the operating system’s larger file buffer cache. The effect of the 800 MHz and 1333 MHz memory speed is also evident in these results. The results on the two 24 GB configurations running at 1067 MHz are about 4% slower than the 24 GB baseline running at 1333 MHz. Likewise the 18 GB configuration running at 800 MHz is about 5% slower than the other 18 GB configuration running at 1067 MHz.

Applications moderate to high memory bandwidth
The third group of application benchmarks is characterized with a moderate to high memory bandwidth. The expectation is that applications in this group will show significant performance differences based on the memory configuration. The applications and workloads in this group are:

  • E3D vFinal, SEG_Subsalt workload

E3D is a seismic code used to “see” the underground geographical formations of oil and gas reservoirs

  • Eclipse v2008.1, ONEM1 workload

Eclipse is a oil reservoir simulation code.

  • Fluent v12.0.9 Beta, aircraft_2m workload

Fluent is a computational fluid dynamics code

  • Fluent v12.0.9 Beta, sedan_4m workload

Fluent is a computational fluid dynamics code

  • LS-DYNA mpp971_R3.2.1, car2car workload

LS-DYNA is a general purpose transient dynamic finite element program.

The car2car workload is a simulation of head-on collision of two vehicles. It is similar to car crash analysis models used by automotive companies.

  • MILC v7.6.2b, Medium-NSFt2 workload
  • MILC is a quantum chromo dynamics code
  • POP v3.0, x1 workload

POP is an ocean circulation model derived from earlier models of Bryan, Cox, Semtner and Chervin in which depth is used as the vertical coordinate.

  • WRF v2.2.1,  conus12 workload

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs.

The consus12 workload is a 48-hour, 12km resolution case over the Continental U.S. (CONUS) domain October 24, 2001

The applications selected for this group are examples of applications in the CAE, Energy, QCD and Numerical Weather Simulation classes of HPC applications. Their performance results are shown on the following table:

Memory
Configuration

E3D
SEG_Subsalt

Eclipse
ONEM1

Fluent
sedan_4m

LS-DYNA
car2car

MILC
Medium-
NSFt2

POP
x1

WRF
conus12

12GB-1333
200-200-200

0.996

0.998

1.004

0.899

0.998

0.999

0.912

16GB-1067
220-220-000

0.649

0.817

0.799

0.774

0.698

0.776

0.731

16GB-1067
210-210-200

0.791

0.922

0.911

0.943

0.861

0.902

0.836

16GB-1067
220-200-200

0.658

0.817

0.806

0.785

0.703

0.781

0.712

18GB-800
111-111-111

0.741

0.850

0.845

0.889

0.795

0.864

0.797

18GB -1067
210-210-210

0.820

0.945

0.944

0.912

0.932

0.955

0.850

20GB-1067
220-220-110

0.759

0.904

0.873

0.821

0.813

0.871

0.783

20GB-1067
220-220-200

0.690

0.843

0.867

0.893

0.763

0.806

0.775

24GB-1067
220-220-220

0.928

0.981

0.965

0.916

0.949

0.961

0.914

24GB-1067
400-400-400

0.958

0.987

0.980

0.925

0.967

0.980

0.979

24GB-1333
400-400-400

1.000

1.000

1.000

1.000

1.000

1.000

1.000

36GB-800
222-222-222

0.738

0.862

0.843

0.913

0.798

0.868

0.822


The applications selected for this group clearly show the effect of the memory speed on performance. Two of the configurations were able to run the DIMMS at 1333 MHz, the 24 GB baseline with 6 x 4GB DDR3-1333 and the 12 GB configuration with the 6 x 2GB DDR3-1333 DIMMS. Five of the seven the applications in this set had almost identical results on these two 1333 MHz configurations. Both 1333 MHz configurations ran these benchmarks significantly faster than the two configurations with three DIMMS per channel, which ran the memory at 800 MHz.

Five of memory configurations used in this study were non-uniform in that the processor’s three memory channels did not have the same number or size of DIMMS. On three of these configurations the benchmarks for these eight applications performed the slowest on the twelve configurations tested (16 GB 220-220-000, 16 GB 220-200-200 and 20 GB 220-220-200). On average they were about 20% to 25% slower than the baseline.