Taking applications to the next level with XML: Part 2

Submit New Article

Last Modified On :   October 20, 2008 1:13 PM PDT
Rate
 


Introduction
Programmers, like all craftsmen, rely heavily on the contents of their toolboxes. Their expertise is honed as much by their ability to use their tools as it is by their understanding of the problem they are fixing. The modern programmer's toolbox is filled with structured programming techniques, fundamental data structure and algorithms, operating system facilities, revision control systems, text editors, and the various APIs for their programming language, to make a very brief selection. In the first article of this series I argued that much software development right now involves the manipulation of semi-structured data, which is usually text with a varying degree of structure. The modern programmer's toolbox was surprisingly poor in instruments for generating and reading such data until the advent of XML.

We now turn our attention to the relevant instruments that XML adds to the toolbox, and in particular to some of the basic characteristics of XML as a format for input, output, and intermediate management of semi-structured data in situations typically dealt with by programmers. In this installment, we shall look at the basics, and at output techniques. In the next installment we shall look at input and parsing techniques.

You should be familiar with basic XML syntax. This article presents examples using the Python* programming language, chosen because of its popularity and extraordinary readability (executable pseudo-code, as it has been called). Programmers who have used any object-oriented programming language should be able to understand the examples.

Modeling a Memo
Say you have to model a memo in your program. Perhaps you are writing a specialized memo editor for your company, or maybe even a voice-to-text program that can take a memo as a virtual secretary. You will need to deal with fixed fields such as the date, sender, confidentiality, and recipient of the memo, as well as the free text passages that make up the core message. Even these free text portions might need sections marked in some way, perhaps just for emphasis, or maybe even to keep track of mentions of products for future searching and analysis. This is the essential sort of problem XML handles very well. But let us say XML is not yet in our toolbox.

Defining fields in an output format might be a matter of deciding that every field starts on a new physical line in the output format, and that the free text portion is all placed on one line. This was a common approach in simple record-based systems such as were pervasive in the mainframe days. Or perhaps the one field per line only applies for a few lines that give the fixed fields, after which the text can run freely. This is pretty much how Internet email is formatted according to the RFC-822 and MIME standards.

Let us say we have an object, memo, which encapsulates all the pertinent data for the memo.
memo.title = "With Usura Hath no Man a House of Good Stone"
memo.date = "3 April 1936"
memo.to = "The Art World"
line1 = "It has come to our attention that the basis for art
          production "
line2 = "Has shifted from keen patronage to vulgar commercial
          measure. "
line3 = "Management is concerned this will erode the lasting
          value of the age's works."
memo.body = line1 + line2 + line3
#Open file for writing
file = open("mymemo.dat", "w")
file.write(memo.title + " ")
file.write(memo.date + " ")
file.write(memo.to + " ")
#remove all the carriage returns using a Python trick
bits = memo.body.split(" ")
bits.join("")
#Write out the body
file.write(memo.body)

This doesn't seem so bad, as it goes. The main obvious pain is in dealing with the rule that the body cannot contain new line characters. And using the second approach (the one used in Internet mail), you needn't worry about this at all. However, there are many less obvious pitfalls. To name just a few:
  • Character sets and encodings. Files are stored as streams of bytes, and these might not be interpreted into actual characters and thus words in the same way across software or platforms.
  • Differences in how platforms represent new line characters. For instance Windows, UNIX and Apple Mac all store new lines differently.
  • Error handling: files can get corrupted, and there is very little here to indicate conditions that show a file follows the rules imposed by the software.
But the greatest rub is what happens if the code above gets lost or becomes obsolete. The details of the file format are pretty much buried in the code. While one could figure it out in the simplified case shown above, real life is rarely so accommodating. The quest to rescue precious data from obscure formats, generated using forgotten code is one of the most common problems of information technology. It's a good part of the reason why COBOL programmers are still very much in demand. It is also why there is always brisk trade in data conversion and parsing packages. The many lawyers who relied on WordPerfect* only to watch it become irrelevant in the face of the Microsoft Word* juggernaut would be in far less trouble if generic data formats were more widespread.

XML Metadata Exchange
The structured data world has tackled this problem a while ago, often in clumsy form such as SQL dumps. The Object Database Management Group (ODMG) standard built in an object interchange format from the start. And software developers using the Unified Modeling Language* (UML) are beginning in earnest to enjoy the freedom that comes with interchanging object models in the XML Metadata Exchange (XMI) format.

XMI is an XML format, but XML's biggest splash has been in bringing such data format transparency to semi-struct ured data. One of its keys strengths is that it provides not only the power to effectively store such data, but also the power to describe the data. There are many technologies for defining XML document formats, known as XML schema languages. We shall look at the schema language that is built into XML 1.0 itself: Document Type Definition (DTD). DTDs have become a bit dated, and other entrants may be better choices, but it has the most widespread support, and is based on very mature technology from XML's SGML ancestor.

We might draft the XML document in a form such as the following:
<?xml version='1.0' encoding='ISO-8859-1'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<body>

It has come to our attention that the basis for art production Has shifted from keen patronage to vulgar commercial measure. Management is concerned this will erode the lasting value of the age's works.
</body>
</memo>

Right off the bat, in the XML declaration, the file communicates the file format and the character encoding used. It uses ISO-8859-1, also called LATIN-1, a standard encoding used in many European languages, including English. The tags also set a clear delimiter of the various fields, without any trickery or special rules. As a bonus, we sharpened the definition of the date field. Not only did we change it to a standard format, but we also added a metadata attribute defining the format used for the date: ISO-8601.

In order to communicate this format, you can write a DTD (call it memo.dtd):
<?xml version='1.0' encoding='ISO-8859-1'?>

<!ELEMENT memo (title, date, to, body)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ATTLIST date form CDATA #REQUIRED>
<!ELEMENT to ( #PCDATA)>
<!ELEMENT body (#PCDATA)>

This defines an element memo, defined as a series of other elements: title, date (with the form attribute) and body. All these other elements are defined to have only text as content.

Using this DTD, you can define the XML format for the memo. You can even embed a reference to this DTD in the document instance itself. With this reference added, the first four lines of the example memo are as follows:
<?xml version='1.0' encoding='ISO-8859-1'?>
<![DOCTYPE SYSTEM "memo.dtd">
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>


XML and Extensibility
One of the most important features of XML is its extensibility. Interestingly enough, this is also one of the features that make it less likely to suffer from the legacy problem of a data format whose usefulness is past. Therefore XML helps code developers and maintainers in two ways. First of all, it provides a format with some well-defined common rules (known as well-formedness) and a system for describing specific rules (DTD or other schema). This makes it easier to develop future code to process this data, even if the original developers are unavailable. Secondly, because it is extensible, it makes it less likely that one will need to replace any particular XML data format entirely, which increases its longevity.

XML achieves extensibility in many ways. One of the most fundamental is the way we are allowed to specify the content model of elements. If we wanted to add a new field to the memo specifying its confidentiality, we could change the content model for the memo element to the following:
<!ELEMENT memo (title, date, to, confidentiality?, body)>
<!ELEMENT confidentiality (#PCDATA)>

Where the question marks that the new confidentiality element is optional. This means that the following variation is valid:
<?xml version='1.0' encoding='ISO-8859-1'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<confidentiality>high</confidentiality>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the
age's works.
</body>
</memo>

But importantly, it also means that documents meeting the former DTD are valid as well. We have thus extended the format. There are a few problems with this approach. For one thing, DTDs do not have the power to make extended format contingent on a version marker in the document itself. Luckily, there are other schema technologies, such as Schematron*, that do. Even within the XML space, it pays to have a deep toolbox.

Conclusion
When you need a general-purpose mechanism for storing and exchanging data, XML is an option with the advantage of simplicity, extensibility, and standard mechanisms for describing formats. Once you have settled on an XML form, and can generate instances, the next step is to be able to read and manipulate that data in your code. In the next installment in this series, we shall discuss such techniques.

About the Author
Uche Ogbuji Uche Ogbuji is a Computer Engineer, co-founder and CEO of Fourthought, Inc., a software vendor and consultancy specializing in open, standards-based XML solutions, especially as applicable to problems of knowledge management. He has worked with XML for several years, co-developing 4Suite*, an open-source platform for XML processing. He writes many articles and speaks at many conferences on the practical use of XML. A Nigerian immigrant, Mr. Ogbuji currently resides and works in lovely Boulder, Colorado.


Read other articles in this series: Part 1, Part 3, Part 4, Part 5