<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Sat, 26 May 2012 06:00:41 -0700 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/articles/xeon/type/technical-article/feed/" rel="self" type="application/rss+xml" />
    <title>Intel Software Network articles Feed</title>
    <link>http://software.intel.com/en-us/articles/xeon/type/technical-article/</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Superscalar programming 101 (Matrix Multiply) Part 5 of 5</title>
      <description><![CDATA[ In <a href="http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-4/">part 4</a> we saw the effects of the QuickThread Parallel Tag Team Transpose method of Matrix Multiplication performed on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:<br /><br />
<p ><img height="366" width="567" src="http://software.intel.com/file/29844" /></p>
<br />
<div >Fig 18 (17 on part-4)<br /></div>
<br />The Intel Many Core Testing Laboratory was kind enough to provide me some time using their systems. <a href="http://software.intel.com/en-us/articles/intel-many-core-testing-lab/ ">http://software.intel.com/en-us/articles/intel-many-core-testing-lab/ </a><br /><br />Running the same method (sans Cilk++) on a 4 processor Intel Xeon 7560, each processor with 8 cores plus Hyper Threading (total of 32 cores, 64 threads) we observe:<br /><br />
<p ><img height="389" width="567" src="http://software.intel.com/file/29848" /></p>
<br />
<div >Fig. 19<br /></div>
<br />In this chart we do not see a plateau in the scaling. This is due to the problem size at N=2048 being fully contained within the system cache. Caution, keep in mind that the above chart represents the scaling to the cache insensitive Serial method.<br /><br />When comparing this to the cache sensitive Serial Transpose method we find a different set of results:<br /><br />
<p ><img height="389" width="567" src="http://software.intel.com/file/29849" /></p>
<br />
<div >Fig 20<br /></div>
<br />The sharp change in slope at N=1500-1700 is mainly due to the drop in performance of the reference data of the Serial Transpose method, rather than due to improvement in PTTx.<br /><br />Looking at scaling factors (parallel performance / number of hardware threads) is often used as a decision factor in making a purchase. Let’s look at the scaling factor charts:<br /><br />
<p ><img height="377" width="567" src="http://software.intel.com/file/29850" /></p>
<br />
<div >Fig 21<br /></div>
<br />We find that as the problem size increases we observe a nice positive slope on the scaling factor. This looks exceptionally good. Too good to be believed. It is important to remember that the Serial method is not cache sensitive and is not a valid base line for comparison. <br /><br />When we produce the scaling factor as compared to the cache friendly Serial Transpose method we find a completely different picture:<br /><br />
<p ><img height="379" width="567" src="http://software.intel.com/file/29851" /></p>
<br />
<div >Fig 22<br /></div>
<br />This chart will really deflate your programmers ego. After all this hard work, we find that the scaling factor to a cache sensitive Serial Transpose method does not pay off (factor crosses 1) until N = 1824.<br /><br />Comparing the factors of the 2x Intel Xeon 5570, as factored against the Serial Transpose we find:<br /><br />
<p ><img src="http://software.intel.com/file/29852" /></p>
<br />
<div >Fig 23<br /></div>
<br />As expected, both systems can attain super scaling at different problem sizes. This is due to the different amount of cache memory on each system.<br /><br />Although scaling factor provides a good perspective as to return on investment with respect to purchase of more processors, the scaling factor of one processor architecture is not meaningful when compared to a different processor archetecture. A company ought to be interested in total return on investment, and this includes a time element. <br /><br />When looking at the time element, we get a completely different picture. When comparing the fastest method (QuickThread Parallel Tag Team Transpose with SSE intrinsic functions) we find:<br /><br />
<p ><img src="http://software.intel.com/file/29853" /></p>
<br />
<div >Fig 24<br /></div>
<br />When including time, as a determination for cost benefit, we find that there is a rather drastic transition in the cost benefit ratio as you cross a particular threshold in the problem size (N=1400). The point being made here is to use appropriately sized test cases when making evaluations for purchase decisions. The cost/benefit and performance curves will not always be suitable for extrapolation.<br /><br />When we run the fastest method (QuickThread Parallel Tag Team Transpose with SSE intrinsic functions) to larger matrix sizes we find<br /><br />
<p ><img src="http://software.intel.com/file/29854" /></p>
<br />
<div >Fig 25<br /></div>
<br />Matrixes up to N = 3000 to 4096 can be handled with 4 processors (32 cores / 64 threads), larger matrixes may require additional processors and/or a revised method. <br /><br />Conclusions up to this point:<br /><br />The fastest method: Parallel Tag Team Transpose with SSE intrinsic functions, relies on the QuickThread ability to schedule affinity pinned threads by cache level proximity. The ability to coordinate work using cache sensitivity can pay off big in your optimization strategies.<br /><br />Larger matrix sizes could be handled in an improved manner with the same number of processors (4x 8 core with HT) when combined with an additional tiling strategy which will include additional overhead. This is typically called the divide and concur method, often used by parallel programmers.<br /><br />Taking the matrix at N = 5200, and splitting it in two (both axis) yields a tile of N = 2600 and four such tiles. This requires 4 x 2 = 8 iterations using the smaller tiles. The matrix at N = 2600 took approximately 0.33 seconds to compute, therefore estimated computation time would be at 0.33 x 8 = 2.64. An estimated 10x improvement over the un-tiled method, but which may not be fast enough for your purposes.<br /><br />Would divide and concur be the best strategy to use?<br /><br />This depends on the interpretation of best.<br /><br />In terms of relative performance return for effort in programming, this may be so. However, in looking at Fig 18 (17 on part-4), and comparing the Cilk++ to QuickThread Parallel Tag Team XMM method, we have demonstrated that by paying particular attention to cache locality, specifically, what’s in L1, L2 and L3 caches, and when it is in those caches, that you can attain an additional 1.4x to 2.5x performance boost in performance.<br /><br />I will attempt to lay out the strategy which I believe will make effective use of the system caches. While the sketch below won’t show the specific method, it will demonstrate the general plan of attack. <br /><br />The current Parallel Tag Team (transpose) method divides the work by L3, then subsequently L2 regions and then takes an L1 friendly path in producing the results. This strategy works exceptionally well up until the size of the matrix reaches a point where the execution path begins to evict data from the L3 cache. It is my postulation that by employing a method where you follow the same path of the Parallel Tag Team (transpose) method, but impose a clipping technique on the distance from the diagonal, that you can minimize L3 cache evictions. The chart below illustrates a clipped L2 path through the current L3 workspace.<br /><br /><br />
<p ><img src="http://software.intel.com/file/29855" /></p>
<br />
<div >Fig 26<br /></div>
<br />In the above chart, the general execution path follows the arrow. The colored (red) cells indicate the output cells who’s results have been computed. The white cells indicate those output cells that have yet to be computed.<br /><br />In the current Parallel Tag Team (transpose) method all of the above cells would have been colored, in the proposed method for large matrixes, a clipping technique limits the distance from the diagonal of the output cells to be computed while processing the diagonal. N.B. The above is a simplification of a 1P system.<br /><br />In processing of a large matrix, had the computations included the empty cells, the computation would first suffer L2 cache evictions to L3, then at some size, eviction of L3. This is (postulation) possibly confirmed by Fig 25.<br /><br />
<p ><img src="http://software.intel.com/file/29856" /></p>
<br />
<div >Fig 27<br /></div>
<br />In above Fig 27 (Fig 25 with arrows added), the red arrow depicts L2 evictions and the blue arrow depicts L3 evictions.<br /><br /><br />Back to Fig 26. Upon completion of output cells in the Fig 26 we will find:<br /><br />
<p ><img src="http://software.intel.com/file/29857" /></p>
<br />
<div >Fig 28<br /></div>
<br />Where the X’s mark the cells in the output L2 zone who’s results have been completed. The blue cells represent columns (stored as rows) in the m2t array that are still residing in L2 cache, and the red cells represent row cells in the m1 array that are in the L2 cache. Additionally, (not depicted by colorization) some portion of the bottom row(s) and right most column(s) are still residing in the L1 cache. The remaining un-X’d white may, or may not, be residing in the L3 cache.<br /><br />The next computation sequence (subject to verification) ought to follow the sequence as depicted by the arrows in Fig 29:<br /><br />
<p ><img src="http://software.intel.com/file/29858" /></p>
<br />
<div >Fig 29<br /></div>
<br />The red and blue ends of the output matrix should be processed in an alternating sequence as you progress along the arrows towards the first diagonal.<br /><br />In the earlier mentioned divide and concur method (tiling), you would process 4 smaller tiles twice each or 8x the time of a smaller tile, presumably of a size found optimal for L2 cache size. The tiling method might benefit from L2 residual data resulting in a 6x to 8x run time of the smaller matrix as opposed to 8x the run time.<br /><br />In the proposed method (call it cross diagonal), and for the size range depicted above in Figs 28 and 29, and based upon my prior experience with the Parallel Tag Team Transpose technique, it is estimated that it may be possible to produce the result in 1.5x to 2x the time of the smaller matrix. Potentially besting the divide and concur method (tiling) by a factor of 4x. It should be stressed that the actual differences may vary from this estimate. Extrapolation, as mentioned earlier, often does not follow the curve established by present data.<br /><br />I hope you have found my series of articles insightful. This article cannot convey the detail of the QuickThread Parallel Tag Team Transpose XMM method whereas the code can convey this detail. For those interested in obtaining a copy of the code and a demo license for QuickThread feel free to contact me at my email address below. QuickThread runs on Windows and Linux systems. Both x32 and x64 for Windows but only x64 for Linux (Ubuntu and Red Hat).<br /><br />Jim Dempsey<br /><a href="http://software.intel.commailto:jim@quickthreadprogramming.com">jim@quickthreadprogramming.com</a><br /><a href="http://www.quickthreadprogramming.com">www.quickthreadprogramming.com</a> ]]></description>
      <link>http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-5/</link>
      <pubDate>Tue, 24 Aug 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-5/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-5/</guid>
      <category>Parallel Programming</category>
      <category>Xeon</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>Superscalar programming 101 (Matrix Multiply) Part 4 of 5</title>
      <description><![CDATA[ In the <a href="http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-3/">last installment (Part 3)</a> we saw the effects of the QuickThread Parallel Tag Team method of Matrix Multiplication performed on two single processor systems:<br /><br />
<p ><img src="http://software.intel.com/file/29755" /></p>
<br />Where the Intel Q6600 (4 core – no HT) with two cores (two threads) sharing L1 and L2 caches attained a 40x to 50x improvement over serial method, and in Intel Core i7 920 (4 core – with HT) and with four cores (eight threads) sharing one L3 cache and one core (two threads) sharing L1 and L2 caches attained 70x to 80x improvement. Let’s see how this performs using two processors, each similar to Core i7 920.<br /><br />When run on a Dual Xeon 5570 systems with 2 sockets and two L3 caches, each shared by four cores (8 threads). and each processor with four L2 and four L1 caches each shared by one core and 2 threads, we find:<br /><br />
<p ><img height="370" width="572" src="http://software.intel.com/file/29756" /></p>
<br />
<div >Fig 17<br />
<div ><br />for a scale to serial of 140x to 150x in the N = 700 to 1344 range. The performance is almost twice that of the Core i7 920. This was somewhat expected.<br /><br />There are some interesting observations to be made about this performance profile. While the 2x speed-up was expected, the Parallel Transpose method performed as well as the Parallel Tag Team method with N = 700 to 1024, then drops off precipitously. This is about half of the performance peak range of the Parallel Tag Team method (700 to 1344). <br /><br />Why are the plateaus the same height?<br />What is the interpretation of the reason for the drop-off difference?<br /><br />The plateaus are the same height for the same reason we saw in Fig 5 and Fig 6 where the Serial Transpose and the Parallel Transpose performance were essentially the same (yellow and red lines in Fig 17 above). The reason being: a resource bandwidth limitation. In Fig 5 and Fig 6 the limiting resource appeared to be memory bandwidth (due to Parallel Tag Team method having ample room to out perform Parallel Transpose). Due to the relative equalities of the plateaus (in N = 700 to 1024) some other resource than memory band width appears to be the limiting factor. This leaves cache access overhead or SSE Floating Point bottleneck.<br /><br />Both of these bottlenecks will tend to clip the height of the performance curve but not the width. You can observe in the chart above that the two Parallel Tag Team methods managed to double the breadth of the peak performance curve thus permitting larger matrices to be handled effectively by the program. The reason for the increase in the breadth (larger matrixes handled) is principally due to more effective reuse of cached data due to the solution path through the problem (sequence in which computations are made).<br /><br />The insight learned from Fig 17 is: When your problem working data set exceeds that of the cache system, you may find some paths to the solution more efficient than a simple nested loop.<br /><br />In the 5th article we will explore how we can extend the performance curve to handle larger matrixes. Will this involve more cores/CPUs and/or different solution path? You will have to wait for the <a href="http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-5/">next installment (Part 5)</a> to find out.<br /><br />Jim Dempsey<br /><a href="http://software.intel.commailto: jim@quickthreadprogramming.com">jim@quickthreadprogramming.com</a><br /><a href="http://www.quickthreadprogramming.com">www.quickthreadprogramming.com</a><br /></div>
</div> ]]></description>
      <link>http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-4/</link>
      <pubDate>Tue, 17 Aug 2010 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-4/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-4/</guid>
      <category>Parallel Programming</category>
      <category>Xeon</category>
      <category>Server Developer Community</category>
    </item>
    <item>
      <title>The Cost Benefit Case for Database Migration to Intel Servers</title>
      <description><![CDATA[ <p ><i>Value Proposition For Migration:<br />Cost/Benefit Case For IBM DB2 9.7 And Intel Xeon Processor 5500 And 7400 Series-Based Servers</i></p>
<p ><img src="http://software.intel.com/file/23676" /></p>
<p><b>Consolidation Opportunities</b></p>
<p>At yearend 2004, the typical U.S. Fortune 500 corporation contained fewer than 300 server database instances. By the end of 2009, the number will have increased to more than 2,000. Similar trends have occurred in midsize business, in the public sector and in other types of organization worldwide. The fastest rates of growth have been among databases deployed on small x86 servers.</p>
<p>Multiplication of server databases has contributed to “server sprawl,” resulting in low levels of utilization, unnecessary duplication of resources, and inflation of system administration and facilities costs.</p>
<p>Although server consolidation has become pervasive, to date it has been more commonly applied to application and infrastructure servers, rather than database servers. Database consolidation has often raised complex performance issues, making it more difficult to plan for and prepare for initiatives.</p>
<p>One implication is that, in many organizations, the potential for database server consolidation has been little exploited. At a time of economic pressures, it is an obvious area of potential cost savings. Key technology shifts have made consolidation increasingly viable. More powerful multicore processors, along with the growing sophistication of server and database platforms are creating new opportunities.</p>
<p>This report examines the cost savings that may be realized by upgrading and consolidating IBM DB2 databases. Three-year costs are compared for the following:</p>
<ul>
<li>2005 technologies: DB2 Version 8.2 is deployed on xSeries 335 two-socket servers with singlecore Intel Xeon processors and the Windows Server 2003 operating system.</li>
<li>Current technologies: DB2 Version 9.7 is deployed on (1) IBM System x3550 M2 two-socket servers with quad-core Intel Xeon 5500 processors, and (2) IBM System x3850 M2 four-socket servers with six-core Intel Xeon 7400 processors. The Windows 2008 operating system is employed on both System x platforms. </li>
</ul>
<p>Savings are realized in a number of areas, including hardware maintenance, support for DB2 databases and Windows operating systems, and system administration and energy costs.</p>
<p>Calculations for DB2 9.7 deployed on System x3550 M2 and x3850 M2 servers allow for transition costs. These include acquisition and installation of new servers, along with database consolidation, staff retraining and related costs.</p>
<p><b>Cost Comparisons</b></p>
<p>Comparisons are based on six installations with between 25 and 231 DB2 instances employed for a variety of applications in manufacturing, aerospace, government, IT services, insurance and financial services organizations.</p>
<p>Numbers of instances, servers and full time equivalent (FTE) system administration (sysadmin) personnel for use of 2005 technologies are based on user-supplied data. Although organizations employed a variety of two-socket x86 servers, installed bases were normalized to use of DB2 8.2 and IBM xSeries 335 server<br />models for calculation purposes.</p>
<p>Scenarios were then developed for migration of DB2 instances to the latest DB2 Version 9.7 and consolidation of these to System x3550 M2 and x3850 M2 servers. Scenarios draw upon the experiences of more than 30 organizations that have conducted DB2 consolidation initiatives. They are consistent with “best practice” norms for the numbers of instances and workloads that may run on these platforms.</p>
<p>DB2 instances include mixes of DB2 Enterprise Edition and Workgroup Edition, while servers are configured with Enterprise and Standard Editions of Windows Server 2003 and (for Current Technologies scenarios) Windows Server 2008.</p>
<p>Software support costs include IBM Software Maintenance (SWMA) and Microsoft Software Assurance for DB2 and Windows Server licenses respectively. Hardware, maintenance and software support costs are calculated based on “street” prices; i.e., discounted prices paid by the organizations upon which<br />installations are based.</p>
<p>Current Technologies scenarios do not include use of virtualization tools such as VMware and Microsoft Hyper-V. Although these may be employed to support multiple database instances, organizations that contributed to this report were able to achieve high levels of database consolidation without them.</p>
<p><b>Download the rest of the PDF <a href="http://software.intel.com/file/23677">here</a>.</b></p> ]]></description>
      <link>http://software.intel.com/en-us/articles/the-cost-benefit-case-for-database-migration-to-intel-servers/</link>
      <pubDate>Thu, 12 Nov 2009 00:00:00 -0800</pubDate>
      <comments>http://software.intel.com/en-us/articles/the-cost-benefit-case-for-database-migration-to-intel-servers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/the-cost-benefit-case-for-database-migration-to-intel-servers/</guid>
      <category>Xeon</category>
    </item>
    <item>
      <title>An evaluation of the impact of memory configuration on the performance of applications running on Intel® Xeon® processor 5500-series based servers</title>
      <description><![CDATA[ <p class="sectionHeading">Introduction</p>
<p>Servers using the 5500 series of Intel® Xeon® processors have different memory configuration properties from previous Intel® Xeon® processors.  A commonly used rule of thumb for configuring memory on systems running HPC applications has been “2 GB per physical core”; this should be reevaluated when configuring servers built with Intel® Xeon® processors in the 5500 series. This processor’s three-channel memory controller means that optimal memory configurations in a dual-socket server will use multiples of six memory DIMMs, whereas earlier dual-socket servers generally used multiples of four or eight.</p>
<p>It is not the purpose of this paper to recommend a specific memory configuration. Rather, this paper illustrates the performance impact of different memory configurations on a set of High Performance Computing (HPC) applications. This paper compares the performance of 16 HPC application benchmarks on 12 different memory configurations. <br />The applications chosen for this study are representatives of applications in three performance characterizations groups: (1) low memory bandwidth, (2) moderate to high I/O bandwidth and (3) moderate to high memory bandwidth. The memory configuration in this study uses various combinations of 1, 2 and 4GB DIMMS, which give a total memory size ranging between 12 and 36 GB.</p>
<p class="sectionHeading"><br /><br />Test Configuration</p>
<p>The base platform for this memory configuration experiment is two-socket server:<br />Platform           2-Socket 2-U server<br />Baseboard        Supermicro X8DTN+ -IN001 Rev 1.02<br />Processors       Intel® Xeon® processor X5570, 2.93 GHz<br />Chipset Intel® 5520<br />OS                   Red Hat* Enterprise Linux 5, Update 3</p>
<p>Twelve different memory configurations were used in this study, with the total memory size on the test system ranging from 12 to 36 GB. The baseboard on this Supermicro platform has three DIMM slots on each of the processor’s three memory channel. This gives nine DIMM slots connected to each processor’s memory controller for a total of 18 DIMM slots on the platform.</p>
<p>The various memory configurations are more fully described in the Appendix. For the twelve memory configurations used in this study, the DIMM sizes and placement were identical on each processor’s nine DIMM slots.</p>
<p>The DIMM placement can be either uniform or non-uniform. Uniform DIMM placement is defined as each memory channel having the same number of DIMMS, and the DIMM sizes and placement are identical across all six memory channels on the test system.</p>
<p>Non-uniform DIMM placement resulted in slower performance than uniform placement. Therefore, the main body of this paper will concentrate on five of the uniform configurations. The results for all twelve configurations, including all of the non-uniform configurations, are listed in the appendix.</p>
<p>The five uniform configurations considered in this part of the paper are described in the following table. The nine digits in the DIMM Placement columns show the DIMM size used in the three slots in each of three memory channels.</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<p align="center"><b>Total<br />Memory</b></p>
</td>
<td>
<p align="center"><b>Memory<br />Speed</b></p>
</td>
<td>
<p align="center"><b>DIMM Placement,<br />Processor 1</b></p>
</td>
<td>
<p align="center"><b>DIMM Placement,<br />Processor 2</b></p>
</td>
</tr>
<tr>
<td valign="top">
<p>18 GB</p>
</td>
<td valign="top">
<p>800 MHz</p>
</td>
<td valign="top">
<p>111-111-111</p>
</td>
<td valign="top">
<p>111-111-111</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18 GB</p>
</td>
<td valign="top">
<p>1067 MHz</p>
</td>
<td valign="top">
<p>210-210-210</p>
</td>
<td valign="top">
<p>210-210-210</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>1067 MHz</p>
</td>
<td valign="top">
<p>220-220-220</p>
</td>
<td valign="top">
<p>220-220-220</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>1067 MHz</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>1333 MHz</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
</tr>
</tbody>
</table>
<p>Benchmark workloads from 16 applications were selected to illustrate the performance impact of the various memory configurations. All but one of these applications runs with 12 GB of memory without paging. These benchmarks were selected as representative applications characterized by</p>
<ol>
<li>Low memory bandwidth</li>
<li>Moderate to high I/O bandwidth</li>
<li>Moderate to high memory bandwidth</li>
</ol>
<p>The benchmark workloads selected for each these group are summarized in the following table:</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<p align="center"><b>Characterization</b></p>
</td>
<td>
<p align="center"><b>Application and Version</b></p>
</td>
<td>
<p align="center"><b>Workload</b></p>
</td>
</tr>
<tr>
<td rowspan="5">
<p>Low memory<br />Bandwidth</p>
</td>
<td valign="top">
<p>ABAQUS-std* v6.8-2</p>
</td>
<td valign="top">
<p>s2a</p>
</td>
</tr>
<tr>
<td valign="top">
<p>Amber* v9</p>
</td>
<td valign="top">
<p>nine standard workloads</p>
</td>
</tr>
<tr>
<td valign="top">
<p>BlackScholes* v3.0</p>
</td>
<td valign="top">
<p>one standard workload</p>
</td>
</tr>
<tr>
<td valign="top">
<p>BLAST* v2.2.18</p>
</td>
<td valign="top">
<p>one standard workload</p>
</td>
</tr>
<tr>
<td valign="top">
<p>MonteCarlo* v0.1</p>
</td>
<td valign="top">
<p>one standard workload</p>
</td>
</tr>
<tr>
<td rowspan="4">
<p>Low to high<br />I/O bandwidth</p>
</td>
<td valign="top">
<p>ABAQUS-std* v6.8-2</p>
</td>
<td valign="top">
<p>s4b</p>
</td>
</tr>
<tr>
<td valign="top">
<p>Gaussian* g03-E.01</p>
</td>
<td valign="top">
<p>apinefreq</p>
</td>
</tr>
<tr>
<td valign="top">
<p>MD.NASTRAN* R3</p>
</td>
<td valign="top">
<p>xl0xdy0</p>
</td>
</tr>
<tr>
<td valign="top">
<p>MD.NASTRAN* R3</p>
</td>
<td valign="top">
<p>xx0cmd2</p>
</td>
</tr>
<tr>
<td rowspan="7">
<p>Moderate to high<br />memory bandwidth</p>
</td>
<td valign="top">
<p>E3D* vFinal</p>
</td>
<td valign="top">
<p>SEG_Subsalt</p>
</td>
</tr>
<tr>
<td valign="top">
<p>Eclipse* v2008.1</p>
</td>
<td valign="top">
<p>ONEM1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>Fluent* v12.0.9 Beta</p>
</td>
<td valign="top">
<p>sedan_4m</p>
</td>
</tr>
<tr>
<td valign="top">
<p>LS-DYNA* mpp971_R3.2.1</p>
</td>
<td valign="top">
<p>car2car</p>
</td>
</tr>
<tr>
<td valign="top">
<p>MILC* v7.6.2b</p>
</td>
<td valign="top">
<p>Medium-NSFt2</p>
</td>
</tr>
<tr>
<td valign="top">
<p>POP* v3.0</p>
</td>
<td valign="top">
<p>x1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>WRF* v2.2.1</p>
</td>
<td valign="top">
<p>conus12</p>
</td>
</tr>
</tbody>
</table>
<p class="sectionHeading"><br /><br />Summary of Performance Results</p>
<p>In the summary graphs that follow, results for all applications are shown relative to the results for the 24GB-1333 400-400-400 configuration. This configuration is expected to best memory performance because it is able to run the memory at the highest speed, 1333 MHz. The Appendix contains a detailed description of the results on the individual applications in these three groups.</p>
<p align="left"><b><i>Applications with low memory bandwidth requirements</i></b><br />As shown on the following graph, the five applications characterized by low memory bandwidth requirements show no significant differences in performance on the five uniform memory configurations. The slight variation in results is attributed to experimental noise.</p>
<p>These results were expected as these applications fit within memory and their performance does not depend on memory bandwidth. There is essentially no performance difference between these memory configurations.</p>
<p><img src="http://software.intel.com/file/23018" /></p>
<p align="left"><b><i>Applications with moderate to high I/O bandwidth requirements</i></b><br />Four of the benchmarks used in this study perform significant write and read operations during the execution of the benchmark workload. These applications were selected to illustrate performance impact of applications whose I/O bandwidth ranges from moderate to high. I/O performance is partially a function of the size and speed of the memory subsystem as the operating system maintains a file buffer cache to help improve I/O latency and performance.</p>
<p align="left">The Gaussian apinefreq workload has a relatively low I/O bandwidth. The results of this benchmark on the five memory configurations are similar to the low memory bandwidth applications described above: there are little performance differences in the results.</p>
<p align="left">The MD.NASTRAN xl0xdy0 workload can be characterized as having a moderate I/O bandwidth.  The two 18 GB configurations run about 8% slower than the baseline.</p>
<p><img src="http://software.intel.com/file/23019" /></p>
<p>The remaining two applications in this group show much different performance responses to the different memory configurations. To a large degree they track the amount of memory available to the operating system for the file buffer cache.</p>
<p>The s4b workload causes ABAQUS-std to execute as a direct sparse linear equation solver. Its static analysis indicates that the amount of memory to minimize actual disk I/O is 31 GB. Since the largest memory size for these five memory configurations is 24 GB, a significant amount of I/O is performed.</p>
<p>The MD.NASTRAN xx0cmd2 workload was the one application that did not run in 12 GB of memory. When run with eight processes, this workload performs almost 200,000 write operations and over 800,000 read operations, most with a buffer size of 256 KB. The high-water mark for the size of the scratch files is over 40 GB. <br />The results for both of these workloads show that the performance of the two 18 GB memory configurations is similar and significantly slower than the three 24 GB configurations. This is attributed to the increased memory available to the operating system for the file buffer cache.</p>
<p align="left"><b><i>Applications with moderate to high memory bandwidth requirements</i></b><br />The third group of application workloads is characterized with a moderate to high memory bandwidth. The expectation is that applications in this group will show significant performance differences based on the memory configuration. The graph below confirms this expectation.</p>
<p align="left"><img src="http://software.intel.com/file/23020" /></p>
<p>The 800 MHz 18GB memory configuration is clearly the slowest among the five configurations shown. In all but two cases it is significantly slower (more than 5%) than the other 1067 MHz 18 GB configuration.</p>
<p>The two 1067MHz 24 GB configurations show some of the difference between dual- and quad-ranked DIMMS. The 2 GB DIMMS used in this study are dual-ranked and the 1067 MHz 4 GB DIMMS are quad-ranked. The performance of these applications on the quad-ranked DIMMS is about 2½% faster than the dual-ranked 2 GB DIMMS.</p>
<p class="sectionHeading"><br />Summary</p>
<p>Evaluation of various memory configurations in dual-socket servers based on the 5500 series Intel® Xeon® processors indicates that the performance of applications that make high demands on memory bandwidth may benefit from uniform memory configurations and the fastest available memory.  Applications that have more modest memory bandwidth requirements may achieve satisfactory performance with more flexibility in their configurations. The follow observations and recommendations may also be useful:</p>
<ul>
<li>A system should be configured with sufficient memory to prevent swapping. </li>
<li>For best performance, DIMM sizes and placement should be uniform across all memory channels.</li>
<li>Applications that have high memory bandwidth requirements will likely perform fastest on systems configured with the fastest memory system.  This is achieved with one 1333 MHz DIMM in each memory channel.</li>
<li>Applications that have high I/O bandwidth requirements often perform faster on systems configured with additional memory, which can increase the efficacy of the operating system’s file buffering.</li>
<li>On systems running a heterogeneous mixture of applications, no single memory configuration may ideal. The best compromise is often six of the fastest and largest DIMMs configured with one DIMM per channel.</li>
</ul>
<p class="sectionHeading"><br />Appendix A: Memory Configuration Details</p>
<p>Twelve different memory configurations were used in this experiment using various combinations of 1, 2 and 4GB DIMMS, giving a total memory size ranging from 12 to 36 GB. The DIMMS used in this study are described in the following table:</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td>
<p align="center">DIMM<br />Size</p>
</td>
<td>
<p align="center">Manufacturer</p>
</td>
<td>
<p align="center">Description</p>
</td>
<td>
<p align="center">Model Number</p>
</td>
</tr>
<tr>
<td valign="top">
<p>1 GB</p>
</td>
<td valign="top">
<p>Qimonda*</p>
</td>
<td valign="top">
<p>1Rx8 PC3-8500R</p>
</td>
<td valign="top">
<p>IMSH1GP03A1F1C-10F T2<br />B3L85111004</p>
</td>
</tr>
<tr>
<td valign="top">
<p>2 GB</p>
</td>
<td valign="top">
<p>Qimonda*</p>
</td>
<td valign="top">
<p>2Rx8 PC3-10600R</p>
</td>
<td valign="top">
<p>IMSH2GP13A1F1C-13H T2<br />B3S82336006</p>
</td>
</tr>
<tr>
<td valign="top">
<p>4 GB</p>
</td>
<td valign="top">
<p>Micron*</p>
</td>
<td valign="top">
<p>4Rx8 PC3-8500R</p>
</td>
<td valign="top">
<p>MT36JSZF51272PDY-1G1DYESDD<br />BZAECGB001</p>
</td>
</tr>
<tr>
<td valign="top">
<p>4 GB</p>
</td>
<td valign="top">
<p>Micron*</p>
</td>
<td valign="top">
<p>2Rx4 PC3-10600P</p>
</td>
<td valign="top">
<p>MT36JSZF51272PY-1G4DZES<br />BZAE0GSA04</p>
</td>
</tr>
</tbody>
</table>
<p><br />The server has a NUMA memory architecture, with the memory controller in each processor supporting three channels with up to three DIMMS per channel. For this experiment the DIMM sizes and positions on each processor are identical for the memory combination tested.</p>
<p>The speed of the memory is a function of the manufacturer’s rating as well as the number of DIMMS used in memory channels. If a system is configured with two DIMMS per channel, then the BIOS will enforce a maximum speed of 1067 MHz even if the individual DIMMS are rated at 1333MHz. If all three DIMM slots in a single channel are used, the memory speed will be set to 800 MHz.<br /><br />The specific memory configurations used in this study are listed in the following table:</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top">
<p align="center"><b>Total</b><br /><b>Memory</b></p>
</td>
<td valign="top">
<p align="center"><b>Processor 1</b><br /><b>DIMM Configuration</b></p>
</td>
<td valign="top">
<p align="center"><b>Processor 2</b><br /><b>DIMM Configuration</b></p>
</td>
<td valign="top">
<p align="center"><b>Memory</b><br /><b>Speed</b></p>
</td>
</tr>
<tr>
<td valign="top">
<p>12 GB</p>
</td>
<td valign="top">
<p>200-200-200</p>
</td>
<td valign="top">
<p>200-200-200</p>
</td>
<td valign="top">
<p align="right">1333 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16 GB</p>
</td>
<td valign="top">
<p>220-220-000</p>
</td>
<td valign="top">
<p>220-220-000</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16 GB</p>
</td>
<td valign="top">
<p>210-210-200</p>
</td>
<td valign="top">
<p>210-210-200</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16 GB</p>
</td>
<td valign="top">
<p>220-200-200</p>
</td>
<td valign="top">
<p>220-200-200</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18 GB</p>
</td>
<td valign="top">
<p>111-111-111</p>
</td>
<td valign="top">
<p>111-111-111</p>
</td>
<td valign="top">
<p align="right">800 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18 GB</p>
</td>
<td valign="top">
<p>210-210-210</p>
</td>
<td valign="top">
<p>210-210-210</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20 GB</p>
</td>
<td valign="top">
<p>220-220-110</p>
</td>
<td valign="top">
<p>220-220-110</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20 GB</p>
</td>
<td valign="top">
<p>220-220-200</p>
</td>
<td valign="top">
<p>220-220-200</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>220-220-220</p>
</td>
<td valign="top">
<p>220-220-220</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p align="right">1067 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24 GB</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p>400-400-400</p>
</td>
<td valign="top">
<p align="right">1333 MHz</p>
</td>
</tr>
<tr>
<td valign="top">
<p>36 GB</p>
</td>
<td valign="top">
<p>222-222-222</p>
</td>
<td valign="top">
<p>222-222-222</p>
</td>
<td valign="top">
<p align="right">800 MHz</p>
</td>
</tr>
</tbody>
</table>
<p><br />While there are other possible configurations, it is believed that the twelve used in this study are representative of other possible DIMM placement.</p>
<p class="sectionHeading"><br />Appendix B: Benchmark Results</p>
<p>Benchmarks from 16 HPC applications were used in this experiment. The applications were selected, with just one exception, to run in 12 GB of memory without swapping, since one of the test cases for this study was a configuration with 12 GB. The one exception was one of the I/O intensive workloads.<br />The results shown in the following tables are relative to the 400-400-400 1333 MHz case.</p>
<p><b><i>Applications with low memory bandwidth</i></b><br />Workloads from five applications that are characterized with a low memory bandwidth were selected for this study. They are:</p>
<ul>
<li>ABAQUS-std v6.8-2, s2a workload</li>
</ul>
<p>Abaqus/Standard is a general-purpose solver using a traditional implicit integration scheme to solve finite element analyses. The s2a workload is a mildly nonlinear static analysis of a flywheel with centrifugal loading. It is a 474,744 DOF model with a moderate iteration count.</p>
<ul>
<li>Amber v9, nine standard workloads</li>
</ul>
<p>A molecular dynamics program used to calculate properties of macromolecular systems. The value reported is the geometric mean of this application’s nine standard workloads.</p>
<ul>
<li>BlackScholes v3.0</li>
</ul>
<p>BlackScholes models the market for an equity using the Black Scholes formula</p>
<ul>
<li>BLAST v2.2.18</li>
</ul>
<p>Bioinformatics code used to perform similarity searches against databases of genome or protein sequences.</p>
<ul>
<li>MonteCarlo v0.1</li>
</ul>
<p>Financial simulation engine using Monte Carlo technique</p>
<p>The expectation is that the benchmark results for applications with this characterization should show little difference on the twelve memory configurations tested. Their results confirmed this expectation as shown by the following table:</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top" width="86">
<p align="center"><b>Memory<br />Configuration</b></p>
</td>
<td valign="top" width="86">
<p align="center"><b>ABAQAUS-std<br />s2a</b></p>
</td>
<td valign="top" width="86">
<p align="center"><b>Amber</b></p>
</td>
<td valign="top" width="86">
<p align="center"><b>BlackScholes</b></p>
</td>
<td valign="top" width="86">
<p align="center"><b>BLAST</b></p>
</td>
<td valign="top" width="86">
<p align="center"><b>MonteCarlo</b></p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>12GB-1333<br />200-200-200</p>
</td>
<td width="86">
<p align="center">0.994</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">0.996</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>16GB-1067<br />220-220-000</p>
</td>
<td width="86">
<p align="center">0.982</p>
</td>
<td width="86">
<p align="center">0.967</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.995</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>16GB-1067<br />210-210-200</p>
</td>
<td width="86">
<p align="center">0.985</p>
</td>
<td width="86">
<p align="center">0.977</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.996</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>16GB-1067<br />220-200-200</p>
</td>
<td width="86">
<p align="center">0.982</p>
</td>
<td width="86">
<p align="center">0.970</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.996</p>
</td>
<td width="86">
<p align="center">1.003</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>18GB-800<br />111-111-111</p>
</td>
<td width="86">
<p align="center">0.988</p>
</td>
<td width="86">
<p align="center">0.974</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">0.993</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>18GB -1067<br />210-210-210</p>
</td>
<td width="86">
<p align="center">0.988</p>
</td>
<td width="86">
<p align="center">0.995</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">0.998</p>
</td>
<td width="86">
<p align="center">0.998</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>20GB-1067<br />220-220-110</p>
</td>
<td width="86">
<p align="center">0.982</p>
</td>
<td width="86">
<p align="center">0.979</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">0.999</p>
</td>
<td width="86">
<p align="center">0.991</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>20GB-1067<br />220-220-200</p>
</td>
<td width="86">
<p align="center">0.994</p>
</td>
<td width="86">
<p align="center">0.995</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>24GB-1067<br />220-220-220</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">0.993</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">1.005</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>24GB-1067<br />400-400-400</p>
</td>
<td width="86">
<p align="center">0.997</p>
</td>
<td width="86">
<p align="center">1.014</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">1.002</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>24GB-1333<br />400-400-400</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
</tr>
<tr>
<td valign="top" width="86">
<p>36GB-800<br />222-222-222</p>
</td>
<td width="86">
<p align="center">0.980</p>
</td>
<td width="86">
<p align="center">1.000</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
<td width="86">
<p align="center">1.003</p>
</td>
<td width="86">
<p align="center">1.001</p>
</td>
</tr>
</tbody>
</table>
<p><br />The results for BlackScholes, BLAST and MonteCarlo applications show less than 1% difference across all twelve memory configurations. This is within the observed run-to-run variability for benchmarking these applications. ABAQUS-std shows about a 2% performance degradation on some of the smaller non-uniform memory configurations; Amber shows about a 3% performance degradation at these configurations. While more than normal run-to-run variation, these degradations are not considered significant (more than 5%)</p>
<p><b><i>Applications with moderate to high I/O bandwidth</i></b><br />Four of the benchmarks used in this study perform a significant amount of disk I/O and can be characterized by having a moderate to high I/O bandwidth. I/O performance is partially a function of memory size and speed as the operating system maintains a file buffer cache to help improve I/O latency and performance. With more memory in system, the OS can maintain a larger buffer cache.</p>
<p>The applications in this group are:</p>
<ul>
<li>ABAQUS-std v6.8-2, s4b workload</li>
</ul>
<p>This is the same ABAQUS-std described in the previous group. The s4b workload is a mildly nonlinear static analysis that simulates bolting a cylinder head onto an engine block. It is a 5,000,000 DOF model with a low iteration count.</p>
<ul>
<li>Gaussian g03-E.01, apinefreq workload</li>
</ul>
<p>Gaussian is a quantum chemistry code.</p>
<ul>
<li>MD.NASTRAN R3, xl0xdy0 workload</li>
</ul>
<p>NASTRAN is a general purpose finite element analysis solution for small to complex assemblies. The xl0xdy0 workload is a model of a truck crash, has 286,216 DOF and uses the explicit nonlinear solution sequence.</p>
<ul>
<li>MD.NASTRAN R3, xx0cmd2</li>
</ul>
<p>This workload is car body model with 1,315,340 DOF using Normal Modes Analysis solution sequence with ACMS.</p>
<p>The Gaussian apienfreq workload has a relatively low I/O bandwidth. The results of this benchmark on the twelve memory configurations looks similar the low memory bandwidth applications shown above: there is some performance degradations, up to 3%, on the smaller non-uniform configurations.</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top">
<p align="center"><b>Memory<br />Configuration</b></p>
</td>
<td valign="top">
<p align="center"><b>ABAQAUS-std<br />s4b</b></p>
</td>
<td valign="top">
<p align="center"><b>Gaussian.E<br />apinefreq</b></p>
</td>
<td valign="top">
<p align="center"><b>MD.NASTRAN<br />xl0xdy0</b></p>
</td>
<td valign="top">
<p align="center"><b>MD.NASTRAN<br />xx0cmd2</b></p>
</td>
</tr>
<tr>
<td valign="top">
<p>12GB-1333<br />200-200-200</p>
</td>
<td>
<p align="center">0.798</p>
</td>
<td>
<p align="center">1.005</p>
</td>
<td>
<p align="center">0.952</p>
</td>
<td>
<p align="center"> </p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />220-220-000</p>
</td>
<td>
<p align="center">0.733</p>
</td>
<td>
<p align="center">0.970</p>
</td>
<td>
<p align="center">0.821</p>
</td>
<td>
<p align="center">0.417</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />210-210-200</p>
</td>
<td>
<p align="center">0.727</p>
</td>
<td>
<p align="center">0.993</p>
</td>
<td>
<p align="center">0.935</p>
</td>
<td>
<p align="center">0.440</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />220-200-200</p>
</td>
<td>
<p align="center">0.678</p>
</td>
<td>
<p align="center">0/979</p>
</td>
<td>
<p align="center">0.931</p>
</td>
<td>
<p align="center">0.407</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18GB-800<br />111-111-111</p>
</td>
<td>
<p align="center">0.804</p>
</td>
<td>
<p align="center">0.984</p>
</td>
<td>
<p align="center">0.919</p>
</td>
<td>
<p align="center">0.695</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18GB -1067<br />210-210-210</p>
</td>
<td>
<p align="center">0.840</p>
</td>
<td>
<p align="center">0.998</p>
</td>
<td>
<p align="center">0.932</p>
</td>
<td>
<p align="center">0.749</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20GB-1067<br />220-220-110</p>
</td>
<td>
<p align="center">0.711</p>
</td>
<td>
<p align="center">0.989</p>
</td>
<td>
<p align="center">0.884</p>
</td>
<td>
<p align="center">0.845</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20GB-1067<br />220-220-200</p>
</td>
<td>
<p align="center">0.717</p>
</td>
<td>
<p align="center">0.992</p>
</td>
<td>
<p align="center">0.931</p>
</td>
<td>
<p align="center">0.750</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1067<br />220-220-220</p>
</td>
<td>
<p align="center">1/006</p>
</td>
<td>
<p align="center">0.999</p>
</td>
<td>
<p align="center">1.023</p>
</td>
<td>
<p align="center">0.962</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1067<br />400-400-400</p>
</td>
<td>
<p align="center">1.027</p>
</td>
<td>
<p align="center">1.004</p>
</td>
<td>
<p align="center">0.982</p>
</td>
<td>
<p align="center">0.953</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1333<br />400-400-400</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
</tr>
<tr>
<td valign="top">
<p>36GB-800<br />222-222-222</p>
</td>
<td>
<p align="center">1.949</p>
</td>
<td>
<p align="center">0.989</p>
</td>
<td>
<p align="center">0.874</p>
</td>
<td>
<p align="center">1.159</p>
</td>
</tr>
</tbody>
</table>
<p><br />The MD.NASTRAN xl0xdy0 workload has a moderate I/O bandwidth, and this benchmark shows a performance degradation of about 18% on some of the non-uniform memory configurations. The two memory configurations running at 800 MHz also show significant performance degradations.</p>
<p>The remaining two applications in this group, ABAQUS-std v6.8-2 s4b and MD.NASTRAN R3 xx0cmd2, show much different performance responses to the different memory configurations.</p>
<p>The s4b workload actually fits in memory for the 36 GB configuration. Consequently it runs almost twice as fast as the 24 GB baseline configuration. The performance of this benchmark also suffered on the non-uniform DIMM configurations.</p>
<p>The best configuration for the MD.NASTRAN R3 xx0cmd2 benchmark was on the 36 GB configuration, running about 16% faster than the 24 GB baseline. This shows the benefit of the operating system’s larger file buffer cache. The effect of the 800 MHz and 1333 MHz memory speed is also evident in these results. The results on the two 24 GB configurations running at 1067 MHz are about 4% slower than the 24 GB baseline running at 1333 MHz. Likewise the 18 GB configuration running at 800 MHz is about 5% slower than the other 18 GB configuration running at 1067 MHz.</p>
<p><b><i>Applications moderate to high memory bandwidth </i></b><br />The third group of application benchmarks is characterized with a moderate to high memory bandwidth. The expectation is that applications in this group will show significant performance differences based on the memory configuration. The applications and workloads in this group are:</p>
<ul>
<li>E3D vFinal, SEG_Subsalt workload</li>
</ul>
<p>E3D is a seismic code used to “see” the underground geographical formations of oil and gas reservoirs</p>
<ul>
<li>Eclipse v2008.1, ONEM1 workload</li>
</ul>
<p>Eclipse is a oil reservoir simulation code.</p>
<ul>
<li>Fluent v12.0.9 Beta, aircraft_2m workload</li>
</ul>
<p>Fluent is a computational fluid dynamics code</p>
<ul>
<li>Fluent v12.0.9 Beta, sedan_4m workload </li>
</ul>
<p>Fluent is a computational fluid dynamics code</p>
<ul>
<li>LS-DYNA mpp971_R3.2.1, car2car workload</li>
</ul>
<p>LS-DYNA is a general purpose transient dynamic finite element program.</p>
<p>The car2car workload is a simulation of head-on collision of two vehicles. It is similar to car crash analysis models used by automotive companies.</p>
<ul>
<li>MILC v7.6.2b, Medium-NSFt2 workload</li>
<li>MILC is a quantum chromo dynamics code</li>
<li>POP v3.0, x1 workload</li>
</ul>
<p>POP is an ocean circulation model derived from earlier models of Bryan, Cox, Semtner and Chervin in which depth is used as the vertical coordinate.</p>
<ul>
<li>WRF v2.2.1,  conus12 workload</li>
</ul>
<p>The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs.</p>
<p>The consus12 workload is a 48-hour, 12km resolution case over the Continental U.S. (CONUS) domain October 24, 2001</p>
<p>The applications selected for this group are examples of applications in the CAE, Energy, QCD and Numerical Weather Simulation classes of HPC applications. Their performance results are shown on the following table:</p>
<table class="tableFormat1" border="1" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top">
<p align="center"><b>Memory<br />Configuration</b></p>
</td>
<td valign="top">
<p align="center"><b>E3D<br />SEG_Subsalt</b></p>
</td>
<td valign="top">
<p align="center"><b>Eclipse<br />ONEM1</b></p>
</td>
<td valign="top">
<p align="center"><b>Fluent<br />sedan_4m</b></p>
</td>
<td valign="top">
<p align="center"><b>LS-DYNA<br />car2car</b></p>
</td>
<td valign="top">
<p align="center"><b>MILC<br />Medium-<br />NSFt2</b></p>
</td>
<td valign="top" width="56">
<p align="center"><b>POP<br />x1</b></p>
</td>
<td valign="top" width="60">
<p align="center"><b>WRF<br />conus12</b></p>
</td>
</tr>
<tr>
<td valign="top">
<p>12GB-1333<br />200-200-200</p>
</td>
<td>
<p align="center">0.996</p>
</td>
<td>
<p align="center">0.998</p>
</td>
<td>
<p align="center">1.004</p>
</td>
<td>
<p align="center">0.899</p>
</td>
<td>
<p align="center">0.998</p>
</td>
<td width="56">
<p align="center">0.999</p>
</td>
<td width="60">
<p align="center">0.912</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />220-220-000</p>
</td>
<td>
<p align="center">0.649</p>
</td>
<td>
<p align="center">0.817</p>
</td>
<td>
<p align="center">0.799</p>
</td>
<td>
<p align="center">0.774</p>
</td>
<td>
<p align="center">0.698</p>
</td>
<td width="56">
<p align="center">0.776</p>
</td>
<td width="60">
<p align="center">0.731</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />210-210-200</p>
</td>
<td>
<p align="center">0.791</p>
</td>
<td>
<p align="center">0.922</p>
</td>
<td>
<p align="center">0.911</p>
</td>
<td>
<p align="center">0.943</p>
</td>
<td>
<p align="center">0.861</p>
</td>
<td width="56">
<p align="center">0.902</p>
</td>
<td width="60">
<p align="center">0.836</p>
</td>
</tr>
<tr>
<td valign="top">
<p>16GB-1067<br />220-200-200</p>
</td>
<td>
<p align="center">0.658</p>
</td>
<td>
<p align="center">0.817</p>
</td>
<td>
<p align="center">0.806</p>
</td>
<td>
<p align="center">0.785</p>
</td>
<td>
<p align="center">0.703</p>
</td>
<td width="56">
<p align="center">0.781</p>
</td>
<td width="60">
<p align="center">0.712</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18GB-800<br />111-111-111</p>
</td>
<td>
<p align="center">0.741</p>
</td>
<td>
<p align="center">0.850</p>
</td>
<td>
<p align="center">0.845</p>
</td>
<td>
<p align="center">0.889</p>
</td>
<td>
<p align="center">0.795</p>
</td>
<td width="56">
<p align="center">0.864</p>
</td>
<td width="60">
<p align="center">0.797</p>
</td>
</tr>
<tr>
<td valign="top">
<p>18GB -1067<br />210-210-210</p>
</td>
<td>
<p align="center">0.820</p>
</td>
<td>
<p align="center">0.945</p>
</td>
<td>
<p align="center">0.944</p>
</td>
<td>
<p align="center">0.912</p>
</td>
<td>
<p align="center">0.932</p>
</td>
<td width="56">
<p align="center">0.955</p>
</td>
<td width="60">
<p align="center">0.850</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20GB-1067<br />220-220-110</p>
</td>
<td>
<p align="center">0.759</p>
</td>
<td>
<p align="center">0.904</p>
</td>
<td>
<p align="center">0.873</p>
</td>
<td>
<p align="center">0.821</p>
</td>
<td>
<p align="center">0.813</p>
</td>
<td width="56">
<p align="center">0.871</p>
</td>
<td width="60">
<p align="center">0.783</p>
</td>
</tr>
<tr>
<td valign="top">
<p>20GB-1067<br />220-220-200</p>
</td>
<td>
<p align="center">0.690</p>
</td>
<td>
<p align="center">0.843</p>
</td>
<td>
<p align="center">0.867</p>
</td>
<td>
<p align="center">0.893</p>
</td>
<td>
<p align="center">0.763</p>
</td>
<td width="56">
<p align="center">0.806</p>
</td>
<td width="60">
<p align="center">0.775</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1067<br />220-220-220</p>
</td>
<td>
<p align="center">0.928</p>
</td>
<td>
<p align="center">0.981</p>
</td>
<td>
<p align="center">0.965</p>
</td>
<td>
<p align="center">0.916</p>
</td>
<td>
<p align="center">0.949</p>
</td>
<td width="56">
<p align="center">0.961</p>
</td>
<td width="60">
<p align="center">0.914</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1067<br />400-400-400</p>
</td>
<td>
<p align="center">0.958</p>
</td>
<td>
<p align="center">0.987</p>
</td>
<td>
<p align="center">0.980</p>
</td>
<td>
<p align="center">0.925</p>
</td>
<td>
<p align="center">0.967</p>
</td>
<td width="56">
<p align="center">0.980</p>
</td>
<td width="60">
<p align="center">0.979</p>
</td>
</tr>
<tr>
<td valign="top">
<p>24GB-1333<br />400-400-400</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td>
<p align="center">1.000</p>
</td>
<td width="56">
<p align="center">1.000</p>
</td>
<td width="60">
<p align="center">1.000</p>
</td>
</tr>
<tr>
<td valign="top">
<p>36GB-800<br />222-222-222</p>
</td>
<td>
<p align="center">0.738</p>
</td>
<td>
<p align="center">0.862</p>
</td>
<td>
<p align="center">0.843</p>
</td>
<td>
<p align="center">0.913</p>
</td>
<td>
<p align="center">0.798</p>
</td>
<td width="56">
<p align="center">0.868</p>
</td>
<td width="60">
<p align="center">0.822</p>
</td>
</tr>
</tbody>
</table>
<p><br />The applications selected for this group clearly show the effect of the memory speed on performance. Two of the configurations were able to run the DIMMS at 1333 MHz, the 24 GB baseline with 6 x 4GB DDR3-1333 and the 12 GB configuration with the 6 x 2GB DDR3-1333 DIMMS. Five of the seven the applications in this set had almost identical results on these two 1333 MHz configurations. Both 1333 MHz configurations ran these benchmarks significantly faster than the two configurations with three DIMMS per channel, which ran the memory at 800 MHz.</p>
<p>Five of memory configurations used in this study were non-uniform in that the processor’s three memory channels did not have the same number or size of DIMMS. On three of these configurations the benchmarks for these eight applications performed the slowest on the twelve configurations tested (16 GB 220-220-000, 16 GB 220-200-200 and 20 GB 220-220-200). On average they were about 20% to 25% slower than the baseline.</p> ]]></description>
      <link>http://software.intel.com/en-us/articles/an-evaluation-of-the-impact-of-memory-configuration-on-the-performance-of-applications-running-on-intel-xeon-processor-5500-series-based-servers/</link>
      <pubDate>Wed, 28 Oct 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/an-evaluation-of-the-impact-of-memory-configuration-on-the-performance-of-applications-running-on-intel-xeon-processor-5500-series-based-servers/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/an-evaluation-of-the-impact-of-memory-configuration-on-the-performance-of-applications-running-on-intel-xeon-processor-5500-series-based-servers/</guid>
      <category>ISN General</category>
      <category>Xeon</category>
    </item>
    <item>
      <title>Performance of MPI cluster applications with Intel(r) HyperThreading Technology</title>
      <description><![CDATA[  ]]></description>
      <link>http://software.intel.com/en-us/articles/performance-of-mpi-cluster-applications-with-intel-hyperthreading-technology/</link>
      <pubDate>Tue, 13 Oct 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/performance-of-mpi-cluster-applications-with-intel-hyperthreading-technology/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/performance-of-mpi-cluster-applications-with-intel-hyperthreading-technology/</guid>
      <category>Parallel Programming</category>
      <category>Xeon</category>
    </item>
    <item>
      <title>Prana Studios leverages Intel® Xeon® Processor 5500 Series to get better 3D animation rendering</title>
      <description><![CDATA[ <p><b>Introduction:</b> Prana Studios is a leading Animation house based out of Mumbai and Los Angeles. Prana's core business is focused on four main areas: Long-form CG content, location based entertainment, game cinematics, and feature film effects. They began collaborating with Intel to resolve a technical challenge they were facing while working on an ongoing animation movie co-produced with a leading Bollywood Studio.</p>
<p> </p>
<p><b>The Challenge: </b>The technical challenge was to resolve the<b> </b>renderer<b> </b>performance<b> </b>of displacements and 3D motion blur in the Improving<b> </b>SitexGraphics*<b> </b>Air* product .</p>
<p><b><br />The Solution: </b>To meet this challenge, the Prana* team explored Air* multi-core software optimizations and also evaluated the latest hardware platform powered by the Intel® Xeon® processors 5500 series with its Intel® Hyper-Threading Technology (Intel® HT Technology).</p>
<p> </p>
<p><b>The Impact: </b>The performance of Air* renderer was significantly higher on the Intel® Xeon® processor 5500 series-based machine as compared to the earlier generation of the hardware with the Intel® Xeon® processor 5355 series-based platform. The Air* renderer showed phenomenal performance improvement on the Intel® Xeon® processor 5500 series with average gains of 1.8X with Intel® HT Technology ON and 1.45X with Intel® HT Technology OFF as compared to the older generation Intel® Xeon® processor 5355 series-based platform.</p>
<p> </p>
<p><b>Application Optimizations in Air* Software: </b>During evaluation of the Intel® Xeon® processor 5500 series-based platform, the Prana* team found some scalability issues with the Air* renderer when moving from 8-thread to 16-threadexecution. This feedback was given to the SitexGraphics* what? Group?, who investigated the threading issues in the Air* renderer and released a fix that enabled 16-thread execution on the Intel® Xeon® processor 5500 series-based  platform (with the Intel® HT Technology feature ON).</p>
<p> </p>
<p><b>Deploying the Intel® Xeon® Processor 55xx Series-based Platform:</b> Performance of Air* renderer was evaluated using workloads that constituted foliage, fur and texture scenes. Performance measurements on the Intel® Xeon® processor 5500 series-based platform were done with both Intel® HT Technology ON and OFF. It was found that the Intel® Xeon® processor 5355 series-based platform took 199, 871, and 728 seconds? while the Intel® Xeon® processor 5500 series-based platform took 137, 616, and 527 seconds with Intel® HT Technology OFF (8-thread execution) and 93, 495, and 419 seconds respectively, with Intel® HT Technology ON (16-thread execution) for rendering the 3 workloads. Therefore, optimum rendering performance was achieved on the Intel® Xeon® processor5500 series-based platform with the Intel® HT Technology ON (Ref Fig. 1).</p>
<p>The average performance gains on the Intel® Xeon® processor 5500 series-based platform with Intel® HT Technology OFF and ON were 1.45X and 1.8X respectively, when compared to the Intel® Xeon® processor 5355 series-based platform.</p>
<p> </p>
<p>
<table cellspacing="0" cellpadding="0" width="100%">
<tbody>
<tr>
<td>
<p><b>Fig. 1: Lower is better<br /><img title="Prana_fig+1.jpg" alt="Prana_fig+1.jpg" src="http://software.intel.com/file/20092" /></b></p>
</td>
</tr>
</tbody>
</table>
<table cellspacing="0" cellpadding="0" width="100%">
<tbody>
<tr>
<td>
<p> </p>
</td>
</tr>
</tbody>
</table>
<a></a></p>
<p> </p>
<p><b></b><b>"Good things" about Intel® Xeon® processor 5500 series: </b>Intel® Xeon® processor-based servers provide reliable, efficient, and proven performance, designed from ground up to meet data-demanding enterprise requirements. The Intel® Xeon® processors are the ideal choice for business-critical computing.</p>
<p><b><br />Configuration of the machines tested:</b></p>
<p><span >Intel® Xeon® processor 5355 series-based Platform<br /></span>•  <b>Hardware: </b>Dual Processors Intel® Xeon® CPU 5355 Series @ 2.66GHz with 8GB FBD2 800MHz RAM<br />•  <b>OS:</b> Windows* XP* Professional x64 Edition v5.2.3790 Service Pack 2 Build 3790<br />•  <b>Software Stack: </b>Maya*2008 Ext2 (32-bit), MayaMan* 2.0.15 (32-bit), SitexGraphics* Air* 8.09 (32-bit)</p>
<p><span >Intel® Xeon® processor 5500 series-based Platform<br /></span>•  <b>Hardware: </b>Dual Processors Intel® Xeon® CPU 5560 Series @ 2.8GHz with 8 GB DDR3 1066MHz RAM<br />•  <b>OS:</b> Windows XP Professional x64 Edition v5.2.3790 Service Pack 2 Build 3790<br />•  <b>Software Stack: </b>Maya*2008 Ext2 (32-bit), MayaMan* 2.0.15 (32-bit), SitexGraphics* Air* 8.09 (32-bit)</p>
<br /><br /> 
<hr align="left" size="1" width="33%" />
*Other names and brands may be claimed as the property of others.  
<table border="1" rules="none" cellspacing="0" cellpadding="5">
<tbody>
<tr>
<th  valign="middle" align="left">Optimization Notice</th>
</tr>
<tr bgcolor="#ccecff">
<td>
<p>Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.</p>
<p align="right">Notice revision #20110804</p>
</td>
</tr>
</tbody>
</table> ]]></description>
      <link>http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon/</link>
      <pubDate>Tue, 07 Jul 2009 00:00:00 -0700</pubDate>
      <comments>http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon/#comments</comments>
      <guid isPermaLink="true">http://software.intel.com/en-us/articles/prana-studios-leverages-intel-xeon/</guid>
      <category>Xeon</category>
      <category>Visual Computing</category>
      <category>Game Development</category>
    </item>
  </channel></rss>
