With the recent launch of the next-generation Intel® Xeon® processor MP, IBM DB2* 9, and IBM N Series storage, businesses can now enjoy the rich processing power and performance benefits these products have to offer.
This paper unveils the combined performance of Dual-Core Intel® Xeon® processor 7100? Series, DB2 9 pureXML* technology, and IBM N5500 midrange storage servers. Several measurements from an XML-based data server environment are provided to illustrate the performance and scalability of these products. The test scenario simulates a data-centric¹ financial brokerage environment that is characterized by:
- High transaction volumes and concurrency.
- Small transaction size.
- Variable XML document structure.
First, performance metrics are presented for three workloads that stress XML data insertions, XML data reads, and a mixture of XML data processing tasks based on common functions such as selects, inserts, updates, and deletes. The scalability of DB2 9 is then demonstrated by increasing the number of concurrent users for each of the above XML workloads given a fixed number of processors. Last, the efficiency of the Intel Xeon processor 7100 Series is demonstrated by comparing it with the previous generation Xeon processor, and by studying the behavior of a fixed XML workload while increasing the number of processors in the test configuration.
Intel Xeon MP Server Processors
As businesses look to modernize their legacy infrastructure, or to automate new and innovative business processes, the demand for solutions built on a Service Oriented Architecture (SOA) is increasing. The underlying XML metadata format, instrumental to all SOA deployments, is prompting the evolution of enterprise software features and functions to support the richer solution capabilities enabled by this new industry standard. Leading this evolution, IBM has released DB2 9, the first hybrid database server featuring concurrent support for information stored in traditional relational formats and in pure XML data structures.
XML performance requirements, in turn, are demanding increased computational power beyond the incremental gains traditionally enabled by processor frequency increases. Towards this end, Intel has launched the dual-core Xeon processor 7100 Series, specifically for multi-processor server platforms featuring two “execution cores” within each physical processor.
This expansion of parallel processing capability makes Intel® Xeon® processor-based servers a perfect fit for applications such as DB2 9, which can inherently take advantage of such parallelism.
Intel Xeon 7100 Series Processor
For this demonstration, the newly released Dual-Core Intel Xeon processor 7100 Series for multi-processor server systems (i.e., systems with 4 or more physical processors) will be used. Based upon Intel’s 65 nm manufacturing technology, the Xeon processor 7100 Series has numerous key features aimed at providing high performance and reliability.
Intel Xeon processor 7100 Series features:
- Intel NetBurst® microarchitecture
- Dual-core processing
- 64-bit addressing
- Intel® Hyper-Threading Technology+
- 1 MB L2 cache per core
- Up to 16 MB shared L3 cache
- Intel® Virtualization Technologyφ
- Intel® Cache Safe Technology
The Intel Xeon processor 7100 Series is designed for server workloads that demand high performance and maximum scalability. Throughput-oriented enterprise database applications can take advantage of each processor’s high single-thread performance. Furthermore, they can also benefit from the processor’s ability to execute multiple concurrent threads. Given that each processor contains two execution cores, and that each core is enabled with Intel Hyper-Threading technology, a single processor can execute up to four application threads simultaneously. For a four-processor server system, this equates to sixteen concurrent threads available to enterprise applications. Note, the Intel Xeon processor 7100 Series is also designed to readily integrate into larger symmetric multi-processing systems that can support up to 32 processors, and 128 simultaneous threads. Solution developers can enjoy this scalability and performance headroom in systems built with Intel Xeon Processors.
Figure 1. The Intel Xeon processor 7100 Series features on-chip pathways and large L3 cache to improve performance.
SOA deployments often demand high transactional throughput capabilities. Consequently, developers must architect their solutions to minimize system latencies. While software techniques are available to address this concern, the underlying hardware must be architected efficiently to reduce latencies that often result from system interconnects. In addition to its parallel processing capabilities described above, the Intel Xeon processor 7100 Series also incorporates a new 16 MB Level 3 (“L3”) cache that is shared between the two execution cores of the processor. This large L3 cache allows large working datasets to be stored closer to the processor, which in turn reduces the amount of high latency accesses to system memory, or to other processors in the system. In addition to reducing the average latency of requests from each core, the large L3 cache also reduces the total amount of traffic forced on the system front-side-bus (FSB). In particular, the processor implements mechanisms for the two cores to communicate with each other through on-chip pathways. An example of such communication is to exchange data in their private caches (known as L1 and L2 caches), which enables further reduction of FSB traffic and thus improves processor scaling on data-intensive workloads.
The Intel Xeon processor 7100 Series is designed for environments that require additional robustness and virtualization. For providing server reliability and maintaining up-time, Intel Cache Safe Technology allows errors in L3 cache to be overcome by disabling affected cache lines in the processor without reducing performance. In addition, new Intel Virtualization Technology has been enabled in each execution core to deliver more robust hardware-supported virtualization with less system overhead.
Such new features in the Intel Xeon processor 7100 Series allow it to outperform the Int el Xeon processor 7000? Series by up to 60% on business processing (ERP, SCM, CRM), up to 70% on transaction processing, and up to 100% on e-Commerce applications². These performance increases are combined with other design features to lower power consumption, resulting in a 2.8x performance per watt improvement over previous-generation Dual-Core Intel Xeon processors².
Additional performance data can be found at http://www.intel.com/performance/server/xeon_mp/server.htm.
For more information on the Intel Xeon 7100 Series processor, visit /sites/default/files/m/7/2/c/7100_prodbrief.pdf
Intel E8501 Chipset
Intel has also surrounded the Intel Xeon processor 7100 Series with a platform that provides the technologies necessary for providing balanced computing. The Intel® E8501 chipset supports up to 128 GB of DDR2-400 memory, and the latest I/O technology with PCI Express.* It also has a high-speed, 3-load, front-side system bus (800 MHz) that provides 12.8 GB/s system throughput. This platform also provides advanced RAS features, such as hot-plug I/O and memory, memory mirroring, and memory sparing, all of which are used to proactively protect data and improve security.
For more information on the Intel E8501 Chipset, visit http://www.intel.com/products/server/chipsets/e8501/e8501-overview.htm*.
DB2 9 delivers new features that address the needs of today's businesses, including integrating business data from across organizations, focusing limited IT resource on creating business value, or providing a secure and resilient information management system for valuable information assets.
Data server for an XML-based service-oriented architecture
More information is in XML format, or directly storable as XML format, than in relational data tables. Most of this XML information is neither protected nor utilized to the same extent as other data because, until now, doing so has been cost-prohibitive. DB2 9 introduces the first hybrid data server for the industry, serving data from both pure relational and pure XML structures. With DB2 9 pureXML technology, XML documents are stored and processed as type-annotated trees. This is unlike previous technology in commercial relational databases, which stores XML documents either as large objects (BLOBs or CLOBs) or by dividing XML into a set of relational tables. Thistechnology delivers unprecedented application performance and development time/cost savings that makes XML data cost-effective for the first time, enabling greater business insight faster, and at lower cost.
To provide this innovative support for managing XML data, DB2 9 features new XML-specific storage management, indexing, and optimization techniques. It also interfaces to a wide range of popular programming languages, allowing users to optionally validate their XML data prior to storage, and extends popular database utilities important for importing data and admi nistering the environment.
IBM continues to simplify deployment of their DB2 offerings. With new features such as non-administrator installation on the Windows* platform, response file installation enhancements, and support for coexistence of multiple copies of the DB2 database system, DB2 9 allows IT staff to spend more time supporting their business needs instead of installing and deploying database systems. Further, the following new autonomic features help improve productivity by reducing the time required to administer and tune your database system:
New automated database administration features that improve productivity
- Self-tuning memory manager (STMM), which helps reduce or eliminate the task of configuring a DB2 server by continuously updating configuration parameters, resizing buffer pools, and dynamically distributing available memory resources.
- Automatic storage support, which automatically grows the size of the database across disks and file systems, is now available for multipartition databases.
- Automatic configuration, which tunes pre-fetchers and page cleaners based on DB2 database system environment characteristics.
- New automatic table and index reorganization policy options, which provide the database administrator (DBA) with more capabilities for managing table and index reorganization.
For more information on DB2 9 pureXML features, visit http://www-01.ibm.com/software/data/db2/xml/*.
The IBM N Series Storage System
The IBM N Series storage system provides a range of reliable, scalable storage solutions for a variety of storage requirements. N Series storage systems provide transport-independent, seamless data access using block-level and file-level protocols from the same platform. Block-level access is available over a Fiber Channel SAN fabric using FCP and over an IP-based Ethernet network using iSCSI, while file-level access is available over an IP-based Ethernet network using NFS, CIFS, HTTP, and FTP. A Web-based graphical interface (FilerView*), in addition to a command-line interface, provides an administrator with a simple-to-use mechanism for managing all aspects of an N Series storage system.
N Series storage system performance delivers high performance because its operating system, a robust, tightly coupled, multi-tasking, real-time micro-kernel called Data ONTAP,* was designed and optimized from the ground up for network file service. Data ONTAP has a look and feel similar to UNIX,* but is a proprietary kernel that is produced by Network Appliance, Inc. At the lowest level, the Data ONTAP kernel contains three basic elements. They are:
- A network interface driver
- A RAID manager
- The Write Anywhere File Layout (WAFL*) file system
The network interface driver within Data ONTAP is responsible for receiving all incoming NFS, CIFS, FCP, iSCSI, HTTP, and FTP requests. It will log all incoming requests in non-volatile RAM (NVRAM), then send an immediate acknowledgement to the requestor, and finally initiate any processing needed to satisfy the request. Once initiated, this processing will run uninterrupted to completion.
The N Series storage system uses RAID (Redundant Array of Inexpensive Disks) to protect against data loss caused by disk failure. The N Series storage system uses RAID 4 technology that is heavily optimized to work in tandem with the WAFL file system.
This optimization provides all the benefits of RAID 4 protection without incurring the performance disadvantages that are often associated with general-purpose RAID 4 solutions. And because the N Series RAID 4 design does not interleave parity information as generic RAID 5 implementations do, the overall system can be expanded quickly and easily. The N Series also has the ability to support a new type of RAID protection known as RAID Double Parity or RAID-DP,* featuring fault tolerance protection that is 10,000 times more reliable than traditional RAID implementations.
WAFL is a unique file system that is optimized for network file access. In many ways, WAFL is similar to other UNIX file systems such as the Berkeley Fast File System (FFS) and the TransArc Episode file system. Specifically, it is a block-based file system that uses inodes to describe files, and 4 KB blocks with no fragments. But its unique value is its ability to store sufficient meta-data to enable it to function with any of the current mainstream operating systems (UNIX, Linux*, and Windows) as well as interoperate with block-level protocols such as FCP and iSCSI.
For more information on Data ONTAP, WAFL, RAID-DP, and FlexVol, visit
/sites/default/files/m/e/5/d/3356.pdf* (PDF 789KB)
/sites/default/files/m/b/1/2/3001.pdf* (PDF 238KB)
Test scenario: Online brokerage
This test scenario models an online brokerage. It is simplified for the purpose of this paper, but it is still representative of a real brokerage in terms of documents, transactions, and XML schemas. The main logical data entities in this scenario are:
- Customer: A single customer can have one or more accounts.
- Account: Each account contains one or more holdings.
- Holding: The number of shares of a security.
- Security: Identifier for a holding available for order.
- Order: Each order buys or sells exactly one security for exactly one account.
Figure 2 below shows the relationship between the entities.
For each customer, there is a CUSTACC document that contains all customer information, account information, and holding information for that customer. Orders are represented using FIXML 4.4. FIXML is an industry-standard XML schema for trade-related messages such as buy or sell orders (www.fixprotocol.org). ORDER documents have many attributes and a high ratio of nodes to data. SECURITY documents represent the majority of US-traded stocks and mutual funds, and use actual security symbols and names.
Three separate workloads are executed against the database.
- Insert workload.
- Read-only workload.
- Mixed (reads and writes) workload.
Figure 2: Data entities and their XML schema.
Table 1: Summary of XML queries.
|Query No.||Query Name||Table||Description|
|1||get_order||-||-||X||Return full order document without the FIXML root element.|
|2||get_security||-||X||-||Return full security document.|
|3||customer_profile||X||-||-||Extract seven customer elements to construct a profile document.|
|4||search_securities||-||X||-||Extract elements from some securities, based on four predicates.|
|5||account_summary||X||-||-||Complex construction of an account statement.|
|6||get_security_price||-||X||-||Extract the price of a security.|
|7||customer_max_order||X||-||X||Join CUSTACC and ORDER to find the max of a customer's orders.|
All workloads are characterized by a large degree of concurrency. The workloads are executed by a Java* driver that spawns 1 to n concurrent threads. Each thread simulates a user that connects to the database and submits a stream of transactions without think time. Each stream represents a weighted random sequence of transactions that are picked from a list of transaction templates. Each transaction is assigned a weight that determines the transaction's percentage in the workload.
The insert workload populates the database with approximately 50 GB of raw XML data:
- 3 million CUSTACC documents
- 15 million ORDER documents
- 20,833 SECURITY documents
The read-only workload consists of seven XML queries executing with different degrees of concurrency, specifically 25, 50, 75, 100, 125, and 150 concurrent users. The seven queries are equally weighted in the workload and have the following characteristics in common:
- They are written in standard-compliant SQL/XML notation, such as SQL with embedded XQuery, taking advantage of parameter markers. For more information, refer to Advancements in SQL/XML³. (PDF 167KB)
- They use the SQL/XML predicate XMLEXISTS to select XML documents based on one or multiple conditions that are expressed in XQuery notation.
- They use the SQL/XML function XMLQUERY to retrieve full or partial XML documents, or to construct new result documents that are different from the ones stored in the database.
- They use XML namespaces corresponding to the namespace in the XML data.
- They take advantage of one or multiple XML indexes to entirely avoid table scans.
Table 1 shows the seven queries in terms of their distinguishing characteristics and the tables they touch.
The mixed workload consists of the seven queries in addition to document updates, document deletes, and document inserts. Similar to the read-only workload, the mixed workload is executed with different degrees of concurrency using 25, 50, 75, 100, 125, and 150 concurrent users. The distribution of the transaction types is:
- 70% read operations: queries
- 30% write operations: 6% document updates, 12% document deletes, and 12% document inserts.
The following observations and resulting parameters are used in defining the update/delete/insert transactions:
- Customer accounts get updated to reflect trades (execution of orders), but not necessarily immediately after every order (3% CUSTACC updates).
- ORDER documents do not get updated in our scenario (hence no update order transaction).
- Security prices are updated regularly during a business day (3% security updates).
- The turnover of customers is low (2% CUSTACC inserts, 2% CUSTACC deletes).
- New orders arrive continuously; old orders get pruned from the system eventually and at the same rate (10% ORDER insert, 10% ORDER delete).
- The number of securities is fixed (no delete or insert transactions).
By combining and applying these objectives, the transaction mix, shown in Table 2, was produced.
Update transactions first read a specific document based on an XQuery predicate, and then use that pr edicate to update the original copy of the document in the database. In reality, the document would be modified between the read and the update steps, but this is of low relevance for the purposes of this article and therefore avoided for simplicity. Last, insertions are performed without XML schema validation.
Documents in the database are randomly selected for update and delete operations. Each newly inserted ORDER and CUSTACC document becomes immediately eligible for update or delete by a subsequent transaction.
The test environment consisted of DB2 9 running on an Intel Xeon processor 7100 Series-based server. A secondary Intel Xeon processor 7000 Series-based server was configured as a client machine and was used to drive the workloads. Attached to the data server were two IBM N5500 storage systems running Data ONTAP version 7.0.4. Each N5500 was equipped with four shelves of disks, or fifty-six disks. The operating system installed on both the client and data server was Novell SUSE Linux Enterprise Server* 9 service pack 3.
Intel Xeon processor MP server
Intel® S3E3344 Software Development Platform:
- Four 3.4 GHz Intel Xeon processors 7140M? with 1 MB L2 cache, 16 MB L3 cache
- Intel E8501 chipset (dual 800 MHz FSB)
- 16 GB DDR2-400 memory
- BIOS version RC29
Table 2: Mixed workload transactions
|TRX #||Name||Type||Percent of Total|
The database was created using automatic storage on eight logical disks. Six table spaces were created and configured: three for the tables CUSTACC, ORDER, and SECURITY, and three for the indexes on each table. Four additional buffer pools, with an aggregate size of 11.3 GB, were defined for the CUSTACC table, the ORDER table, and their indexes. All other database memory was managed by the self-tuning memory manager (STMM) in DB2 9. Examples of memory that is managed by STMM include sortheap, lock lists, package cache, and the default buffer pool IBMDEFAULTBP.
N5500 Storage System
Each N5500 storage system was configured to contain a single large aggregate of forty-eight disks. The disks were organized into three 16-disk RAID groups, each containing one single parity disk and one double parity disk (i.e., three 16-disk RAID groups using RAID-DP). Four FlexVol volumes were created on each aggregate for the database tables; an extra FlexVol volume was created on one aggregate to hold the database logs.
An innovative backup and restore technology was used throughout this test environment. The N Series storage system features Network Appliance Snapshot* technology, which captures read-only copies of an entire file system at any given point in time. Each read-only copy of the file system is called a Snapshot copy, which can be used to return a FlexVol to the state it was in at the time the Snapshot copy was taken. For this test environment, a Snapshot copy of each FlexVol was captured after each database was created and populated. Each Snapshot copy was subsequently used to return the database to its initial state after a workload was processed. This approach allowed all test databases to be backed-up and restored in a matter of seconds, regardless of their size.
Figure 3: Performance Scalability of Read-only Workload
Figure 4: Performance Scalability for Mixed Workload.
DB2 9 pureXML Performance Measurements
The DB2 9 database was initially populated with 3 million customer account documents (CUSTACC), 15 m illion order documents (ORDER), and 20,823 securities (SECURITY). These documents were then inserted into the database to evaluate the system’s insert performance when working with XML documents.
Insertion of the CUSTACC documents occurred at a rate of 1400 documents per second at 50 concurrent users. ORDER documents were inserted at a rate of 4500 documents per second at 100 users.
Figure 3 shows the performance and scalability for the read-only workload with various numbers of concurrent users. The throughput peaks at around 125 users with a throughput of 4793 transactions/second (tps). The graph shows that all key metrics - query performance, CPU utilization, and IO throughput - scale linearly up to 100 users. From 100 to 125 users, however, a small drop in scalability is evident. This reduction is due to the nature of the workload driver, which attempts to maintain balanced execution of all seven queries. Specifically, as the system resources become limited with a larger number of users, the average response time for the most complex query (Q7) increases. And because the driver maintains an even distribution of query executions, it scales back execution of the other queries, thus not pushing the system to its full limits.
The performance and scalability of the mixed workload, depicted in Figure 4, shows good scaling with respect to CPU utilization. The performance peaks at around 3,752 tps at 200 users, while scaling drops off at around 150 users. At this point, the I/O subsystem is near capacity operating at approximately 13,000 I/O operations per second (IOPS). The performance increase above 150 users is a result of better buffer pool hit rates.
Figure 5: Intel Xeon processor 7100 Series efficiency from L3 cache ++Intel Xeon processor 7000 Series running at 3.0 GHz with 2 MB L2
Figure 6: Intel Xeon processor 7100 Series multiprocessor scalability
Intel Xeon 7100 Series Processing Efficiency on Read-Only Workload
The number of clock cycles that a processor takes to execute an instruction, frequently called the clocks-per-instruction (CPI), reflects its processing efficiency. In the absence of other bottlenecks such as those arising from serialized execution or disk and network latencies, increases in CPI generally indicate hardware inefficiencies in processors or in the memory subsystem. Such in-efficiencies, in turn, can impose limits on multiprocessor scalability. Using low-level processor event calibration methods, CPI measurements were obtained first to directly compare the efficiency of the Intel Xeon processor 7100 Series architecture with the previous-generation offering, then to test its scaling capability as the number of processors increase. For both studies, 125 users were simulated in the XML read-only query workload.
By comparison, the Intel Xeon processor 7100 Series-based system is similar in most respects with its predecessor, except for its 13% frequency improvement and its large 16 MB L3 cache, described earlier, that is shared between the two execution cores. Figure 5 reveals two significant advantages of the Intel Xeon processor 7100 Series architecture that must be attributed to more than just the incremental clock frequency: First, its XML throughput is 54% greater than that of the previous-generation CPU, and second, it requires 31% fewer CPU cycles than its predecessor to compute the same XML read-only workload. This data confirms that the large shared L3 cache on the Intel Xeon processor 7100 Series is extraordinarily effective in improving its processing performance and efficiency in XML environments.
To test the multiprocessor scalability of the Intel Xeon processor 7100 Series, the read-only query workload was again used while additional processors were added. Figure 6 shows the absolute and relative throughput variation as the number of processor execution cores increases from two, to four, to eight, with all multiprocessor CPI results normalized against the data for one processor. The graph shows a relatively constant CPI even as the number of processors and execution cores are increased. These results indicate, once again, that the large L3 cache on the Intel Xeon processor 7100 Series is very effective in reducing system latencies, which in turn translates to near linear scalable performance in the read-only XML environment.
The goal of this performance study was to demonstrate operating performance characteristics using the latest Intel Xeon processor server hardware, IBM N Series storage, and DB2 9 software for an XML workload. A summary of the combined performance characteristics are as follows:
- The Insert workload experienced over 4,500 inserts per second with 100 concurrent users.
- The Read-Only workload experienced over 4,700 transactions per second at 125 concurrent users.
- The Mixed workload experienced over 3,700 transactions per second at 200 concurrent users.
The results also demonstrate how much more efficiently solutions based on the latest Intel Xeon processors run DB2 9 with an XML workload, as compared to those built with previous-generation Dual-Core Intel Xeon processors. Specifically, the Intel Xeon processor 7100 Series:
- Supports 54% more XML transactions than the previous generation, and
- Requires 31% less CPU cycles than the previous generation to process the same XML workload
Companies considering SOA deployments with large XML data-centric servers can look forward to adding business value with these compelling offerings from Intel and IBM.
Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability or infringement of any patent, copy right, or other intellectual property right. Intel products are not intended for use in medical, life-saving or life-sustaining applications.
Download entire article [PDF 153KB]
¹ XML applications generally fall into two categories: data-centric and document-centric. Data-centric applications are characterized by higher volume of small transactions. Document-centric applications are characterized by variable volume rate with larger sized transactions.
² Source: Intel internally measured results as of August 1, 2006. See the benchmark section in /sites/default/files/m/7/2/c/7100_prodbrief.pdf (PDF 196KB) for details.
³ See http://www.sigmod.org/publications/sigmod-record/0409/11.JimMelton.pdf/view*. (PDF 167KB)
? Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.
+ Hyper-Threading Technology requires a computer system with an Intel® processor supporting HT Technology and a HT Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/products/ht/Hyperthreading_more.htm for additional information.
? Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monit or (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.