Getting Started with XML Programming, Part II

May 5, 1999

Getting Started with XML Programming, Part II

This article examines the question of processing XML documents. It continues where Part I left off. This part discusses the Document Object Model and presents concrete examples in Perl and Java.

	"Abstraction, abstraction and abstraction." This is the answer to the question, "What are the three most important words in programming?"
	Paul Hadak

Part I of this tutorial examined several methods for processing XML documents in Perl: traditional text processing methods, regular expressions, and using an XML parser. This part introduces the Document Object Model and presents some concrete examples in Perl and Java.

A Simple Application Revisited

In this article, we'll continue to refine the simple text processing application described in part I. The task is to process a simple XML document that contains user preferences and other application configuration data. A simple example of this format is shown in Figure 1.

Figure 1. A Simple INI file in XML

<configuration-file>
  <section name="section1">
    <entry name="name1" value="value1"/>
    <entry name="name2" value="value2"/>
  </section>
  <section name="section2">
    <entry name="someothername" value="someothervalue"/>
  </section>
</configuration-file>

The specific challenge is to write two functions: GetProfileString() for reading a name/value pair from a specific section of the file, and SetProfileString() to set a value.

Letting a parser do the work

In Part I, we worked our way from simple text processing methods, which really don't work, to using an XML parser.

The problem with using the parser directly is that interfacing to the parser can be tedious. We have to setup callbacks (events) for the elements that we're interested in and build our own data structures on-the-fly as these events occur.

For some classes of applications, this is the ideal way to process XML documents. If, for example, you're writing a streaming application that is expected to handle XML documents with as little latency as possible or to handle documents too large to fit entirely in memory, processing each event as it occurs is absolutely necessary.

But our application isn't anticipating either of these cases. In fact, what the "more complete solution" presented at the end of the previous article does is load the entire document into memory and build its own tree-like representation of the document. Rather than responding to each event, it'd be easier if the entire tree was just loaded into memory for us, then we could walk that tree to get and set configuration information.

Adding a layer of abstraction

Just as the XML parser adds a layer of abstraction over the actual textual representation of the XML document, we'd like a layer of abstraction that gives us access to the entire document. Instead of considering elements to be composed of start- and end-tags, we'd like to consider them as nodes in a tree, with parents and children and attributes represented in a natural way, rather than as events on an input stream.

In modern jargon, this is called an "object model". There's nothing deeply mysterious about an object model, it's just a bunch of data and a set of methods for manipulating that data.

In the case of an XML document, one obvious object model consists of the document tree and methods like getDocumentElement(), getAttribute(), appendChild(), etc., that allow you to query and change the document tree.

In principle, an unlimited number of object models for XML documents could be constructed. In practice, there's a lot of benefit to be gained from standardizing such a model.

The Document Object Model Working Group of the W3C is doing precisely this.

The Document Object Model

The Document Object Model (DOM) Level 1 Specification defines a platform- and language-neutral interface to the structure and style (CSS properties) of XML (and HTML) documents. This interface allows a process to dynamically access and update the structure and style of documents. From the specification:

The Document Object Model provides a standard set of objects for representing HTML and XML documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them.

The level 1 specification defines a core set of objects and interfaces to XML and HTML documents. Future specifications will provide higher-level access to the DOM and other, additional functionality.

A Brief Survey of the DOM

A DOM representation of an XML document consists, logically, of a tree of Nodes. Note that there's no requirement that an implementation actually build a tree of nodes; implementations are free to use any representation they wish. But in order to conform to the DOM, they must expose the interfaces described here, and these interfaces offer a tree view of the document.

Nodes come in a dozen flavors:

Document. A document tree has exactly one Document node; it is the root of the tree. This is not the node that represents the root element of the document, it's one level above that. The Document node may contain one Element node, which is the root element of the document. It may also contain nodes representing the document type, processing instructions, and comments that occur outside the root element of the document.
DocumentFragment. A document fragment is a lightweight wrapper for holding portions of a document. Unlike Document nodes, DocumentFragment nodes can have multiple element children.
Element. Each element in a document is represented by an element node.
Attr. Each attribute on each element is represented by an attribute node.
ProcessingInstruction. Each processing instruction is represented by a processing instruction node.
Text. Text nodes are used to represent sequences of characters that do not contain any markup. Text nodes can never contain markup. When a document is first parsed, all adjacent characters are placed into a single text node, but subsequent updates to the tree can create adjacent text nodes.
Comment. Each comment is represented by a comment node.
CDATASection. Each CDATA section is represented by a CDATA section node.
EntityReference. Entity references are represented by entity reference nodes.
Entity. Entities are represented by entity nodes. These are either parsed or unparsed entities, not entity declarations.
Notation. Notations are represented by notation nodes.
DocumentType. A Document node may contain a document type node which represents the document type (DTD). In the Level 1 DOM, this node offers very little access to the DTD. This was intentional; it's clear that the Schema Working Group is going to introduce significant changes to the schema definition language for XML.

Each of these node types exposes an interface appropriate to its content. For example, the Element node exposes this interface:

tagName, contains the name of the element.
getAttribute(), returns the value of an attribute on the element.
setAttribute(), assigns a value to an attribute on the element.
removeAttribute(), removes an attribute from the element.
getAttributeNode(), returns the Attr node for an attribute.
setAttributeNode(), assigns an Attr node to the element.
removeAttributeNode(), removes the specified Attr node from the element.
getElementsByTagName(), returns a list of all the nodes that are descendents of the element that match the specified element name.
normalize(), coalesces adjacent text nodes in the subtree rooted at the element.

Each node type exposes methods appropriate to that node.

Using the DOM in Perl

Enno Derksen's XML::DOM module extends the XML::Parser module to provide DOM Level 1 access to XML documents.

Parsing a Document

Loading a document using the DOM is straightforward:

 1 |use XML::DOM; # Use the DOM module
   |
   |$parser = new XML::DOM::Parser (NoExpand=> 1); # Create a new parser
   |
 5 |$doc = $parser->parsefile ($cfgfile);# Parse a file

Line 1		In order to use the DOM module, you have to, uh, `use` it.
Line 3		Like the `XML::Parser` module, the DOM module exposes the parser as an object. The `NoExpand` option tells the parser to leave entity reference in place, rather than expanding them.
Line 5		If no errors occur, the parser returns the DOM `Document` object of the document.

Walking the Tree

Now we can use the methods defined by the DOM to walk over the tree and locate, for example, a specific element. The following function searches the children of $parent and returns the section named $name:

  1 |sub findChild {
    | my $parent = shift; 
    | my $name = shift; 
    | my $node = $parent->getFirstChild(); # Get first child
  5 |
    | while ($node) { 
    | if ($node->getNodeType() == XML::DOM::ELEMENT_NODE # Is it an element?
    | && $node->getTagName() eq "section" # And a section?
    | && $node->getAttribute("name") eq $name) { # And named $name?
 10 | return $node; 
    | } 
    | $node = $node->getNextSibling(); # Get the next child
    | } 
    |
 15 | return undef; 
   |}

Line 4		`getFirstChild()` on an `Element` returns the first child of the element.
Line 7		`getNodeType()` returns the node type. When we were using the parser directly, we used the start-tag event to extract only element start tags. The DOM loads the whole tree, so we must check for comments, processing instructions, etc. If the node type is `XML::DOM::ELEMENT_NODE`, the node is an element.
Line 8		`getTagName()` returns the name of the element. If it's a "section", we've found a section.
Line 9		`getAttribute()` returns the value of the specified attribute on the element. If it's got the name "sect1", we've found what we're looking for.
Line 12		`getNextSibling()` returns the next sibling of the current node (or null, if there is no next sibling).

Finding Elements

An alternate method for finding the appropriate section can be written with getElementsByTagName(). This method returns a list of all the "section" nodes. Iteration over the set can be used to find the one with the desired name. Here's findChild() written that way:

  1 |sub findChild {
    | my $self = shift; 
    | my $parent = shift; 
    | my $name = shift; 
  5 | my $nodelist = $parent->getElementsByTagName("section"); # Get all the sections
    | my $nodecount = $nodelist->getLength();# Count them
    |
    | for (my $count = 0; $count < $nodecount; $count++) { 
    | my $node = $nodelist->item($count); # Look at each one
 10 | if ($node->getAttribute("name") eq $name) { # If it's named $name, we found it
   | return $node; 
   | } 
   | } 
   |
 15 | return undef; 
   |}

Line 5		`getElementsByTagName()` returns a list of all of the descendants of `$parent` that are named "section". An important distinction between this version and the previous is that this version examines all the descendants, whereas the previous version examined only the children of `$parent`. If the structure that you're looking for can be deeply nested, there are potential performance impacts. Note also that the node list returned is in "document order", so the search is effectively depth-first not breadth-first.
Line 6		`getLength()` returns the number of elements in the node list.
Line 9		Now we can walk over each node. The `item()` method returns the nth node from the list.
Line 10		If the name matches, we've found it. Note that we don't have to check the node type or element name because `getElementsByTagName()` returns only the element nodes that we specified.

Using the DOM in Java

One of the real advantages of a standardized, language-independent DOM is that programming with it is essentially the same in any language. The Java function below implements the findChild() using the tree-walking algorithm:

  1 |private Element findChild(Element parent, String name) { 
    | // Find the 'section' child of parent named 'name' and return it; 
    | // return null if no such child exists. 
    | Node node = parent.getFirstChild(); 
  5 | Element child; 
     |
    | while (node != null) { 
    | if (node.getNodeType() == Node.ELEMENT_NODE) {
    | child = (Element) node; // Cast Node to Element 
 10 | if (child.getTagName().equals("section") 
    | && child.getAttribute("name").equals(name)) { 
    | return child; 
    | } 
    | } 
 15 | node = node.getNextSibling(); 
    | } 
    |
    | return null; 
    |}

Part III

In Parts I and II of this article, we've been moving from the traditional text processing model for accessing XML documents through progressively higher abstractions. A next logical step from here would be query languages.

At the time of this writing, the query models for XML documents are just beginning to be standardized, so it's really too early to take that step.

However, looking ahead just a little bit, we might imagine a query language using an XSL-ish syntax. Using this language to find the value for key1 in sect1 might look something like this:

section[@name="sect1"]/entry[@name="key1"]/@value

Reading from right to left, this query returns the content of the value attribute (@value) on the entry named "key1" (entry[@name="key1"]) in the section named "sect1" (section[@name="sect1"]).

Queries can be used for extracting information from XML documents, but they're essentially read-only operations. Even when they're widely supported, other mechanisms, like the DOM, will still be required if you want to make changes to the document structure.

Conclusion

We've now examined several ways to process XML documents, at varying levels of abstraction. Regular expressions, direct interfacing to the parser, the DOM, and queries all have thier place. Like most other decisions, there are tradeoffs to be made. With this exploration in hand, I hope you're ready to tackle your next XML project. Let me know how it goes, and remember to send your XML questions to xmlqna@xml.com.

Appendix A. Getting the Code

Two small applications and a number of other examples accompany this article:

sample.xml, the sample configuration file.
cfgfile2.pm, a complete configuration file interface, and tcfgfile2.pl, a test appliation to drive cfgfile2.pm

This code improves on cfgfile2.pm from Part I by using the DOM instead of driving the parser directly.
cfgFile.java, a Java implementation of the configuration file interface, and testapp.java.

This code uses the XML4J parser from IBM, but should be amenable to any parser that supports the DOM.

These applications were designed to be instructive, and there's room for improvement in each of them. If you're so inclined, you might consider how to make them more general or how to provide improved error handling, among other improvements.