by Padma Apparao
Parsers based on either DOM or SAX each provide unique advantages; requirements such as throughput, response time, and resource usage should guide the choice between one or the other.
The benefits of XML spring primarily from the standard's inherent flexibility and its suitability for use with any kind of data at any depth of complexity. The extensibility features of XML enable one to define new tags as needed. XML also provides validation checks for structural correctness of documents by defining Data Type Definitions (DTDs) and Schemas. XML also provides media independence to publish content in multiple formats.
The portability and extensibility of both XML and Java* technology make them a solid choice for the flexibility and wide availability requirements of the Web. Java revolutionized the programming world by providing a platform-independent programming language. XML takes the revolution a step further with a platform-independent language for interchanging data. The two together make the entire solution of building and deploying applications a portable one.
As XML is evolving as a standard for data communication, Intel is working to optimize the software stack on Intel® architecture, as well as looking for opportunities to build features into the silicon that will benefit XML processing technologies. This article reviews some of the things that Intel is doing to understand and enhance what lies under the hood for XML processing, to enhance our contributions in that space.
The Primary XML Parser Technologies are SAX and DOM
An XML document first has to be parsed, and after that, several operations can be performed on it. The processing can take the form of XPath, XSLT, XQuery, and other operations. In this paper, I will discuss only XML parsing and the performance results from a series of experiments. There are two main technologies for XML parsing: the Simple API for XML (SAX) and the Document Object Model API (DOM).
SAX is a public-domain API for an event-based parser. Event-based means that the parser reads an XML document from beginning to end, and each time it encounters a syntax construction, it notifies the application that is running it, and the application must implement the appropriate methods to handle the callbacks and get the functionality needed. The SAX parser can perform validation while parsing XML data, following the rules provided in the XML document's DTD. Details of the SAX API are available at http://www.saxproject.org/*.
DOM, as defined by the W3C DOM working group, is a set of interfaces for building an in-memory representation of the parsed XML document. The object representation is in the form of a tree, and one can traverse and manipulate the tree with DOM methods like insert and delete, just like any other tree data structure. The DOM Level 1 specification defines these interfaces using the Interface Definition Language (IDL). Details of the DOM Level 1 specification are available at http://www.w3.org/TR/REC-DOM-Level-1/*.
SAX Requires Fewer Resources but Provides Fewer Features than DOM
Figure 1 shows a sample XML document. Figure 2 and Figure 3 show the corresponding SAX and DOM outputs. SAX generates events that are handles by the caller and DOM generates the in-memory tree.
Figure 1: Sample XML Document
Figure 2: Events Generated by a SAX Parser
Figure 3: DOM Tree for the Sample Document
Unlike the SAX parser, a DOM parser allows random access to particular pieces of data in an XML document. A SAX parser is also limited to reading a document, whereas a DOM parser allows for manipulation of the document's contents. The DOM parser puts a great strain on the memory-resource subsystem, particularly if the document is large, since the entire tree is built in memory.
Using a SAX parser, one can scan and parse gigabytes worth of XML documents without reaching system limits, as it does not create any data structures in memory; rather, it simply raises events that the application should handle. For this reason, the SAX implementation is faster and requires fewer resources than the DOM implementation.
The SAX parser is generally the superior model when one wants to locate specific parts of documents. The parser raises an event when the specific element in the document is found. SAX implementation is the more challenging of the two, however, because the API requires the development of callback functions to handle the events. This characteristic makes the design less modular, since manipulation, serialization, and traversing of the XML document are left to the application developer.
In order to process XML data, every program needs a parser, which therefore forms a key building block for every XML application. Parsers differ not only in their support for checking and transforming documents, but also in the way they read documents. To read documents sequentially, you will need an event-based parser that follows the SAX API, while other uses may call for the DOM specification.
Most application developers will have to decide on the kind of parser they need for the applications based on criteria like the functionality, speed, memory requirements, and class footprint size. Dennis Sosnoski, in his article, "Java Document Model Usage" has compared Xerces*, Crimson*, EXML (Electric XML)*, JDOM*, dom4j*, and XPP (XML Pull Parser)*, based on their ease of use. Details are available at http://www-128.ibm.com/developerworks/xml/library/x-injava2/*.
Choice between SAX and DOM Depends upon the Application
The primary basis for the decision between SAX and DOM parse r technologies is an individual application's needs in such areas as throughput, response time, and resource usage. The results obtained by our experiments show benefits to using either kind of parser. For the purposes of my experiments, I have chosen four different XML files to serve as input documents. The characteristics of these files are shown in Table 1. These experiments were performed on a system based on an Intel® Pentium® III processor with 1GB of memory.
The database input file consists of a large number of tags, of which only 0.01% are unique; the file represents a database table. The file play.xml is the play Hamlet written in XML with tags an average length of only five characters. The input file periodictable.xml is the chemical periodic table written in XML, and it is characterized by just one of the tags being unique. The last input document is the file soap.xml, which represents the typical SOAP file used in Web services. It has a very high tag content (69%) with very few of them being unique.
Table 1: Characteristics of the XML Input Documents (click image for larger view)
SAX Supports Faster Parsing Throughput than DOM
In Figure 4 we have compared the performance of open-source SAX and DOM parsers (Xerces2_1_0) on the different input files. In terms of the different input files, the number of tags passed in the database.xml is much higher than the others, since the percentage of unique tags in this file is extremely low at 0.01%. This characteristic outweighs the large tag length; thus, a large number of tags are processed per second. Since most of the tags are duplicate tags, the cache behavior is quite good, and hash-table data-structure maintenance is simplified.
The file periodic.xml has larger percentages of unique tags (1%) and larger tag lengths, so it processes the tags much more slowly than others. The soap file and the play.xml file both have short tags, but the percentage of tags in the file drives the parsing performance, with the SOAP file parsing a far smaller number of tags per second.
Figure 4: Tags Parsed per Second
Figure 5 indicates that the SAX parser is able to parse about 30% to 50% more tags than the DOM parser. The reason for this disparity is that a SAX parser does not create any data structures and so has less turnaround time to get to the next tag in the input stream. The tag lengths seem to play a more significant part here than the uniqueness of the tags, as periodic.xml has larger tag lengths (13) than the other files (with the exception of database.xml). In the latter case, the low level of uniqueness in the tags overcomes the effect of tag length.
The traversal of the DOM trees is much faster than the construction, since it is a matter of following pointers in the data structure, without need for allocating any new data structures. The height and the width of the tree will impact tree traversal.
Figure 5: SAX vs. DOM Parsing Tags
DOM Allocates Far More Memory than SAX
The number of bytes allocated is quite dependent on the number of characters parsed. The database file database.xml is the largest byte consumer. The rest of the input documents have only 2-6% of the characters as the database file, hence their memory consumption is much less.
The difference between SAX and DOM here is that the DOM allocates anywhere from 1.5x to 2.5x more bytes than the corresponding SAX representation, except in the database file, where the DOM parser allocates far more bytes (15x) because of the number of tags. Again, SAX parsers still need to allocate byte and character strings to store the elements and attributes for parsing, and so they allocate almost as many bytes as DOM parsers.
Figure 6: Memory Usage for SAX and DOM (click image for larger view)
The true differentiation between SAX and DOM in this area lies in the number of objects allocated by each technology. My experiments indicate that the DOM parser allocates 30x more objects than the SAX parser on the database.xml file (Figure 7). For the other three files, DOM allocates about 3x more objects than SAX. Analysis of the object sizes has shown that most of the objects are small (about 24 bytes on average).
Figure 7: Object Allocation for SAX and DOM
Another point to observe is the type of objects allocated, as shown in Figure 8. Most of the objects are of type char and string. Using SAX, these two classes allocate about 55%-60% of the objects, while in case of DOM, the char and string objects form about 80-99% of the objects. These are all created by the toString method (Table 2). The method returns a string representation of the object that is passed to it. This method is called by the scanAttribute and scanContent methods, which are in turn called by scanStartElement, which is the key method to parser elements in an XML document.
Figure 8: Types of Objects Allocated
Table 2: Methods allocating SAX and D OM Objects (click image for larger view)
The current trend of XML is in data transmission and structured storage. Web Services will be one significant area where XML will be widely used for transmitting messages across distributed systems.
SAX technology consumes far less memory than DOM, but it also lacks many features. For instance, it will not allow any modification to the XML document, and it allows for only serial access. DOM, on the other hand, consumes far more memory, but it provides a richer set of functions, allows for random access to the document, and allows complex queries on the document.
Depending on the usage scenario, either SAX or DOM may be the appropriate choice for parsing functionality. There are several open-source parsers, native as well as Java, which greatly simplify the job of the application developer.
As shown by the results of experimentation with SAX and DOM parsers, the uniqueness of the tags plays an important role in the time it takes to parse a document, whether SAX or DOM. The number of tags overall determines the memory usage of the parser, with DOM taking anywhere from 2-15x times more memory than the corresponding SAX parser.
Most of the objects are string and char objects, which are rather small (about 24 bytes) on average. In the case of SAX technology, these objects comprise about 55-60% of the total objects allocated, and in the case of DOM, they make up anywhere from 80 to 99% of the total objects. Thus, enhancing the speed of allocating new string objects will improve the performance of XML parsing.
This paper has considered the analysis of XML parsing alone. The importance of efficient parsing depends on usage. Certain Web services could involve documents being parsed once and processed several times. Current trends appear to indicate that XML parsing forms about 30% of the applications, while other XML processing technologies like XPath, XQuery, and XSLT take 40% of the time, and serialization of XML documents adds another 15-20% of workload distribution.
In cases with a higher traffic of documents requiring little or no processing, parsing efficiency is critical. Similarly, the former model has a greater reliance on memory size than the latter. From a CPU perspective, making parsing efficient may be possible with parallel scanning of the input. SIMD approaches could help.
Another article in this series, "Java Applications Require Multi-Level Analysis," examines the necessity for Java performance optimization and profiling tools to operate at the system level, the application level, and the micro-architectural level.
Intel, the world's largest chipmaker, also provides an array of value-added products and information to software developers:
- The Intel® Software Partner Program provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share: http://www.intel.com/cd/software/partner/asmo-na/eng/index.htm
- Intel® Developer Services offers free articles and training to help software developers maximize code performance and minimize time and effort: /en-us/
- For information about Intel software development products, including Compilers, Performance Analyzers, Performance Libraries and Threading Tools, visit the Intel Software Development Products home page: /en-us/intel-sdp-home
- Intel® Solution Services is a worldwide consulting organization that helps solution providers, solution developers, and end customers develop cost-effective e-Business solutions to complex business problems: http://www.intel.com/references/
- IT@Intel, through a series of white papers, case studies, and other materials, describes the lessons it has learned in identifying, evaluating, and deploying new technologies: http://www.intel.com/it/
About the Author
Padma Apparao is a Senior Performance Architect working in the Managed Runtime Environments group within the Software Solutions Group at Intel. Padma has been with Intel seven years, working on performance analysis and optimizations on several workloads and industry stand ard benchmarks like TPC-C TPC-H, and SPECjbb2000. Her focus is currently on XML processing; she is involved in understanding the evolution of XML and how processor architecture can influence the future of XML performance. She obtained a PhD in Computer Science on Distributed Algorithms and Systems from the University of Florida, Gainesville in 1995.