Taking Applications to the Next Level with XML, Part 4: XSLT--Programming by Transform

Submit New Article

Last Modified On :   September 17, 2008 5:00 PM PDT
Rate
 


Introduction

In my previous article I presented DOM and SAX, two technologies for processing XML from your current general programming language of choice. This will be the most natural way to proceed for most developers moving from the likes of C/C++*, Java*, Visual Basic*, Python* and Perl*: you have to get accustomed to how XML is viewed by the processor as a virtual hierarchy or tree of data elements. However, you will still encounter some contortions in taking accustomed tools into the unaccustomed territory of XML processing. Many aspects of such processing are more easily tackled using tools sharpened specifically to the task. Therefore we shall introduce a new tool into the XML developer's pouch: Extensible Stylesheet Language Transformations (XSLT).

XML's ancestor SGML required a similar processing style, and so a specialized styling and transformation language, DSSSL*, was crafted with an eye heavily on LISP*, which is well known for being amenable to processing trees. XSLT shows its DSSSL pedigree, though it is less of a purist's construction; the general purpose facilities built into it have produced a surprisingly capable, if verbose, language.

XSLT is a language for describing tree-to-tree transforms. This is the most important thing to learn about the language. It is not a general text processing language. Many try to use it as such and end up either disappointed or wasting development and execution time bending it into unaccustomed shapes. An XSLT "program," properly a transform, is itself an XML document. It operates on a primary XML source file, although it can also gather source from other XML files, and using special extensions, can use intermediate results as source trees as well. The output is an output tree, which might be in the form of XML, HTML, or text.

XSLT provides a set of instructions that shape the output of the transform, but for accessing the source tree, a different language is actually used: XPath*. XPath is a simple language, not in XML format, that allows you to access nodes from XML trees using simple instructions for navigating the branches from a given starting point, known as the context. Both XSLT and XPath also provide well-conceived mechanisms for extending the language to provide facilities that are not built in. 
 

The Tag Shuffle

As a first example, we shall show an XML-to-XML transform. In part 2 of this series we showed how XML accommodates a change in schema We added a confidentiality element to a memo format. As a reminder, Listing 1 is the old form of the memos.

<?xml version='1.0' encoding='ISO-8859-1'?> 

<memo>

<title>With Usura Hath no Man a House of Good Stone</title>

<date form="ISO-8601">1936-04-03</date>

<to>The Art World</to>

<body>

It has come to our attention that the basis for art production

Has shifted from keen patronag e to vulgar commercial measure.

Management is concerned this will erode the lasting value of the

age's works.

</body>

</memo>

Listing 1: An example of an old format memo

Listing 2 is an example of the new form, using the optional new confidentiality element.

<memo> 
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<confidentiality>high</confidentiality>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the
age's works.
</body>
</memo>

Listing 2: An example of a new format memo

Suppose we decide not to take advantage of extensibility and make the confidentiality element a required one. Then we're back to the old problem of updating legacy data. We shall transform all old documents to add a confidentiality of "high" (paranoid by default). Listing 3 is a transform that does so.

<xsl:transform 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="to">
<to><xsl:value-of select="."/></to>
<confidentiality>high</confidentiality>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>

Listing 3: A transform from the old format to the new format memo

The top-level element, xsl:transform (xsl:stylesheet is a synonym) declares the standard XSLT namespace. A namespace is a way of disambiguating names. Using namespaces, you can use multiple elements or attributes of the same name without confusing a processor. A namespace is a group of XML names that is associated with a uniform resource identifier (URI). In an XML document, this association is made by declaring a prefix as an abbreviation for the URI in an attribute of the form.

xmlns:<prefix>="<namespace-URI>"


And any contained elements or attributes may use the declared prefix to establish their namespace. This is why all the special XSLT instructions start with the "xsl" prefix. Note that the prefix is not important in namespace processing-i t is the URI that is important. We could have declared any other prefix we'd chosen, but the URI in question is mandatory for XSLT transforms.

The version is also given in the top-level element as "1.0." Specifying the version is mandatory.

Selecting and Matching
Then right away we get to the heart of XSLT. Templates are XSLT instructions that define chunks of processing steps. The most common way to trigger a template is to specify a match, which triggers the template when a particular pattern is encountered in the source document. Our first template matches any element named "to."

XSLT processing starts with the context set to the root node of the source document (similar to the document node from the DOM point of view). The processor then looks for a template that matches the root document. If it doesn't find one, it applies a default, which is merely to iterate over all the children of the root node and apply any template to each in turn. However, in this case we do have a template that matches the root node—the second one.

The pattern "@*|node()" is designed to gobble up everything: root nodes, elements, attributes, text nodes, and so forth. Once this template is activated, it uses the xsl:copy function to copy the basic structure of the node. Within this, we call apply-templates, which activates the search for templates that match a set of nodes we specify in the "select" attribute. This expression again selects all child nodes of the current one. The document element "memo" is thereby selected and our catch-all template intercepts it. It is thus copied to the output tree, and xsl:apply-templates is again invoked, processing all its children. The first child, the "title" element, is processed the same way; its only child is a text node, which is copied to the output tree, and since text nodes have no children, the processing winds its way back up the stack of pending templates.

If this were the only template in the transform, processing would continue in this fashion, copying the input tree verbatim to the output tree. Such a transform is known as the identity, and is thus the basis of many XML-to-XML transformation tasks.

However, we do break the monotony somewhat. When we get to the "to" element, we have two templates that can match. In this situation, XSLT provides for a conflict resolution, which tends to favor more specific templates over less specific ones. Therefore, in this case, the top template wins out. This one is simpler. It starts with a literal result element ("to"), which is an element that is not in the XSLT namespace, and is copied to the output. Within this, there is an instruction xsl:value-of that computes a string from an XPath in the given "select" attribute and copies the result to output. The "." is an XPath abbreviation for the context node, which is the "to" element from the source document. This node is converted to string by taking all its text descendants.

And now here is the whole point of the transform. Right after we've cloned the "to" element to the output, we put out a literal "confidentiality" element, and "high" as literal result text.

The output of this tran sform is similar to Listing 2, differing in its white space.

To give this a try, grab almost any XSLT processor. I used 4Suite's:

$ 4xslt listing1.xml listing3.xslt 
<?xml version='1.0' encoding='UTF-8'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form='ISO-8601'>1936-04-03</date>
<to>The Art World</to><confidentiality>high</confidentiality>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the
age's works.
</body>
</memo>


Most XSLT processors can be invoked the same way on the command line.

Conclusion

The most common use of XSLT currently is to generate HTML for Web display, given XML source files. I went through an XML-to-XML transform in part because fewer articles cover this important use of XSLT. As you learn more about the language, you will find more and more comfort identifying tasks for which XSLT is best suited, and splitting up your XML processing between XSLT and more mainstream languages. Most XSLT processors have simple APIs for invocation from other programming languages, so this should be straightforward.

So far we've set up the basics of XML processing. Next we shall look at technology that underscores XML's particular contribution to information technology: we shall see how XML has the potential for realizing the long-standing promises of so-called "knowledge technology."

About the Author

Uche Ogbuji is a Computer Engineer, co-founder and CEO of Fourthought, Inc., a software vendor and consultancy specializing in open, standards-based XML solutions, especially as applicable to problems of knowledge management. He has worked with XML for several years, co-developing 4Suite*, an open-source platform for XML processing. He writes many articles and speaks at many conferences on the practical use of XML. A Nigerian immigrant, Mr. Ogbuji currently resides and works in lovely Boulder, Colorado.


Read other articles in this series: Part 1, Part 2, Part 3, Part 5