| Last Modified On : | September 17, 2008 5:40 PM PDT |
Rate |
|
So far in this series I have emphasized how one of the great strengths of XML lies in how richly it adds to the developer's toolbox. Now we move to a closer look at many of these tools. One of the most important areas is in Application Programming Interfaces (APIs) that bind XML technologies to other programming and run-time environments. In this article, we shall look closely at the two preeminent APIs for XML: Simple API for XML (SAX) and Document Object Model (DOM). They represent two strongly contrasting processing models for XML, and as such have quite complementary sets of advantages and disadvantages. A basic understanding of XML is required, as well as familiarity with some object-oriented programming language.
SAX is a very interesting species. It was essentially created on a marathon thread on the XML-DEV mailing list, which has long been the prime habitat for XML experts. David Megginson led the discussion and the result was one of the most successful XML initiatives, with no large company or standards-body sponsorship.
SAX is an event-driven API. That is, the developer registers handler code for specific events triggered by different parts of XML mark-up (e.g. start and end tags, text, entities). The parser then sends a stream of these events based on the input XML, which the handler code processes in turn. Figure 1 is an illustration of this process.
The XML parser engine is tied to the SAX system through an appropriate driver, which causes SAX events to be fired in a stream as the parsing progresses. The developer will typically register handlers to capture these events and take appropriate action. This style of processing might be familiar to you if you have done user interface programming using popular systems such as Microsoft* Foundation Classes or many XWindows* APIs. If you are not, however, familiar with this style, such event-based processing might represent a somewhat novel way of thinking. Let us proceed with an example.
Listing 1 is a small SAX program that draws a crude graph of the tree structure of an XML document. It is written in Python*, so users of almost any OO language should be able to follow along.
# This is a special form of string literal that can span multiple lines |
In object-oriented languages, event-based interfaces are usually implemented by registering objects whose classes have special methods for each event. We define such a class, TreePrintHandler. If you compare with the event names in Figure 1, you will see that the methods in this class are similar to the event names. This is no coincidence: the SAX engine works by calling these particular methods whenever the corresponding event is dispatched. We derive TreePrintHandler from a built-in class, sax.ContentHandler, which provides a default in case we don't define a method to handle a particular event. Note that we don't implement every SAX event. We don't need to do anything special for the "end document" event, for example.
Very often in SAX programming, you need to keep track of where you are in the flow, or perhaps some values that will need to be used at some point. Managing such details is known as managing state. The information comes strictly in the order that the XML parser follows, which is not always the natural order for processing, so the SAX developer must be familiar with state management. Luckily, since we are using a class for event handling, this is not so difficult. You define instance variables that keep track of state. In the example program, the state to be managed is the depth of indentation reached.
The following is an example of running this program using Python, assuming you've copied Listing 1 to a file named listing1.py:
> python listing1.py |
The SAX example shows how an object-oriented language is used to deal with event-based processing. However, the W3C also developed an object model for XML documents that can be used more directly. The Document Object Model (DOM) is the result of this effort. DOM is well known to Web developers as the way to manipulate forms and the like in an HTML Web browser. The XML aspect of DOM is much the same as the HTML aspect. The basic idea is that a document is decomposed into a tree model, where each component of the XML syntax is represented by a node. For instance, the following XML document can be represented by the tree illustrated in Figure 2.
<?xml version = "1.0"?> |
Figure 2 - Click on image to enlarge
Each part of the document is a separate node in the tree. Notice the root node, labeled ("/"). This is a node that in effect represents the document as a whole. It has one element child, known as the document element (ADDR in our case). Elements can have other elements as children, as well as text nodes. Attributes are not quite considered children of their elements, so I represent them using a dashed line to what is known as their owner element. Attribute names are marked with an @ sign.
Imagine an API that allows you to navigate this tree, moving from parent to child node, to siblings, and other steps, taking advantage of special properties of certain types of nodes (for instance, elements can have attributes and text nodes have the text data). If you can imagine this, then you already have a basic handle on the DOM.
The DOM, like SAX, is designed to be language-neutral. In the case of the DOM, the generic interface definition language (IDL) standardized by the Object Management Group (OMG) is used to express the tree node interfaces. For instance, the interface for retrieving a particular attribute from an element is defined as follows:
DOMString getAttribute(in DOMString name); |
This uses the special type DOMString to represent a string based on the rules for XML strings. This is typically translated into a method of the same name in the implementation language.
Again, an example should help illustrate. Listing 2 is a program that navigates and mutates an XML document using DOM.
xml_source = "<top><outer><inner>center</inner></outer></top>" |
Notice that some very important functions, such as converting XML source into a tree ready for DOM processing, and converting such a tree back to text, are not yet covered in the DOM standard. DOM is evolving through several "levels," each of which builds on the prior one. Level 1 covered the basics, Level 2 added namespace support, iterators, and so forth. Namespaces are a way to avoid clashes in XML element and attribute names. I'll discuss this in more detail in the next article in this series. Iterators are mechanisms for walking over the DOM tree, invoking some action for each node in turn. DOM Level 3 adds loading and saving trees, and other tools, but is still in development as of my writing. Be sure that the DOM library you choose has the support for XML features you need.
As you can see from the example, DOM trees can be manipulated and not just read. One must do this manipulation in small steps, though--by dissociating parents from children and attaching them to other parents. Using the DOM, you can also store arbitrary pointers to any nodes in the tree, and quickly jump to exactly the part you wish to process. If you copy Listing 2 to a file named listing2.py, you can try it out as follows:
> python listing2.py |
DOM and SAX are quite different approaches to XML processing, and both are valuable tools to have around. In deciding when to use one or the other, there are several factors to consider.
It is not unusual to use both in processing, for instance, in skimming over a large document using SAX and then building a small DOM tree for the portion of the document one wishes to process. Whether you need speed of development or speed of execution, regardless of what processing patterns you employ for each particular XML task, one or the other is probably the key to getting things done among the tags.
The XML-DEV mailing list home page:
http://www.xml.org/xml/xmldev.shtml*
Simple API for XML (SAX) project home page:
http://www.saxproject.org/*
The W3C's home page for DOM:
http://www.w3.org/DOM/*
Read other articles in this series: Part 1, Part 2, Part 4, Part 5
