The XPath 2.0 Data Model

February 2, 2005

In everything I've written in this column so far about XSLT 2.0, I've been cherry-picking—taking fun new features and plugging them into what were otherwise XSLT 1.0 stylesheets in order to demonstrate these features. As XSLT 2.0 and its companion specification XQuery 1.0 approach Recommendation status, it's time to step back and look at a more fundamental difference between 2.0 and 1.0: the underlying data models. A better understanding of the differences gives you a better understanding of what you can get out of XSLT 2.0 besides a wider selection of function calls.

The current XSLT 2.0 Working Draft's short Data Model section opens by saying, "The data model used by XSLT is the XPath 2.0 and XQuery 1.0 data model, as defined in [the W3C TR XQuery 1.0 and XPath 2.0 Data Model document]. XSLT operates on source, result, and stylesheet documents using the same data model." The XSLT 2.0 Data Model section goes on to describe a few details about issues such as white space and attribute type handling, but these concern XSLT processor developers more than stylesheet developers. The "XQuery 1.0 and XPath 2.0 Data Model" document that it mentions is what the majority of us really want to look at.

Before looking more closely at this document, however, let's back up for some historical context. Many people felt that the original XML 1.0 Recommendation released in February of 1998 failed to describe a rigorous data model. This raised the possibility of different applications interpreting document information differently, increasing the danger of incompatibilities. To remedy this, the W3C released the XML Information Set Recommendation (also known as "the infoset") in October of 2001. It described the exact information to expect from an XML document more formally and less ambiguously than the original 1998 XML Recommendation did.

Like the XSLT 2.0 spec, the XSLT 1.0 Recommendation includes a Data Model section that describes its basic dependence on the XPath data model (in this case, XPath 1.0) before going on to describe a few new details. The XPath 1.0 Recommendation's Data Model section, which is about six pages when printed out, provides a subsection for each possible node type in the XPath representation of an XML tree: root nodes, element nodes, text nodes, attribute nodes, namespace nodes, processing instruction nodes, and comment nodes. (Remember that there are different ways to model an XML document as a tree—for example, a DOM tree, an entity structure tree, an element tree, and an XPath tree—so certain basic tree structure ideas such as "root" will vary from one model to another.) This spec also includes a brief, non-normative appendix that describes how you can create an XPath 1.0 tree from an infoset, so that no guesswork is needed regarding the relationship of the XPath 1.0 data model to the official model describing the information to find in an XML document.

The Data Model sections of these W3C technical reports are all fairly short. When I printed the latest draft of the new XQuery 1.0 and XPath 2.0 Data Model document, it added up to 90 pages, although nearly half consist of appendices that restate earlier information in more mechanical or tabular form. The document has a reputation for being complex, but once you have a good overview of its structure, the complexity appears more manageable.

The Data Model document's Introduction tells us that it's based on the infoset with two additions: "support for [W3C] XML Schema types" and "representation of collections of documents and of complex values."

This talk of "complex values" makes the second addition sound more complicated, but it's actually simpler, so I'll cover it first. This part of the data model has actually simplified a messier aspect of XPath 1.0, and we've already seen the payoff in an earlier column on using temporary trees in XSLT 2.0. Here are the key points, including some direct quotes from the document:

"Every instance of the data model is a sequence." (One class of "pervasive changes" from XSLT 1.0 to 2.0 is "support for sequences as a replacement for the node-sets of XPath 1.0.")
"A sequence is an ordered collection of zero or more items."
All items are either atomic values (like the number 14 or "this string") or a node of one of the types listed above: element node, attribute node, text node, and so forth. The choice of seven types now lists "document node" instead of "root node," presumably because temporary trees can have root nodes that lack the special properties found in the root node of an actual document tree.

Sequences

There's nothing special that must happen to a node for it to be considered part of a sequence; a single item is treated as a sequence with one item in it. One sequence can't contain another sequence, which simplifies things: a sequence's items are the node items and atomic value items in that sequence, period. Because an XPath expression describes a set of nodes, the value of that XPath expression is a sequence.

The idea of a "sequence constructor" comes up often in XSLT 2.0—the spec uses the term about 300 times. The "Sequence Constructors section defines one as "a sequence of zero or more sibling nodes in the stylesheet that can be evaluated to return a sequence of nodes and atomic values." In 2.0, a template rule contains a sequence constructor that gets evaluated to return a sequence for use in your result tree, a temporary tree, or wherever you like. Variables, stylesheet-defined functions, and elements such as xsl:for-each, xsl:if, and xsl:element are all defined in the specification as having sequence constructors as their contents.

A node can have several properties, such as a node name, a parent, and children. Not all node types have the same properties; for example, a document node has no parent property. The data model document talks a lot about retrieving the values of node properties with "accessors," which are abstract versions of functions that represent ways to get the values of node properties. For example, the dm:parent accessor returns the node that is the parent of the node whose parent property you want to know about. (The data model document uses the "dm:" prefix on all accessor functions without declaring a namespace URI for it, because these aren't real functions. If you want real functions, see the XQuery 1.0 and XPath 2.0 Functions and Operators document.)

A tree consists of a node and all the nodes that you can reach from it using the dm:children, dm:attributes, and dm:namespaces accessors. If a tree's root node is a document node, the tree is an XML document. If it's not, the tree is considered to be a fragment. This idea of tree fragments is an improvement over the XSLT 1.0 data model, with its concept of Result Tree Fragments, because the operations allowed on Result Tree Fragments were an often-frustrating subset of those permitted on node-sets. The ability to perform the same operations on a 2.0 source tree, a temporary result tree, a subtree of either, or a temporary tree created on the fly or stored in a variable gives you a lot more flexibility, because you can apply template rules to the nodes of any of them.

The new data model offers more ways to address a collection of documents than the XPath 1.0 model did. XSLT 2.0 offers several ways to create a sequence. One simple XPath 2.0 way is to put a comma-delimited list of sequence items inside of parentheses, as I demonstrated in an earlier column on writing your own functions in XSLT 2.0. The list's items can even be entire documents, as shown in the following xsl:value-of instruction:

<xsl:value-of select="reverse((document('test1.xml'),document('test2.xml'),
                               document('test3.xml')))"/>

The outer parentheses belong to the reverse function, which expects a sequence as an argument, and the next parentheses in from those create a sequence of documents, with each document being pulled in with a call to the document function. (I used the reverse call to test whether my XSLT 2.0 processor would treat the list of documents as a proper sequence.)

The new collection function returns a collection of documents. The argument to pass to it (for example, the name of a directory containing documents, or a document listing document URIs) depends on the implementation.

XSLT and W3C Schema Types

The Data Model document's Types section tells us that "the data model supports strongly typed languages such as [XPath 2.0] and [XQuery] that have a type system based on [Schema Part 1]." Note that it's not based on "Schema Part 2," the Datatypes part of the W3C Schema specification but on XML Schema Part 1: Structures, which includes Part 2. Part 2 defines built-in types such as xs:integer, xs:dateTime, and xs:boolean. It also lets you define new simple types by restricting the existing ones (for example, an employeeAge type defined as an integer between 15 and 70), and Part 1 builds on that by letting you define complex types that have structure: content models (for example, an article type consisting of a title followed by one or more para elements) and attributes.

An XSLT 2.0 stylesheet can use the typing information provided by a schema to ensure the correctness of a source document, of a result document, and even of temporary trees and expressions in the stylesheet itself. You can declare stylesheet functions to return values of a certain type, and you can declare parameters and values to be of specific types so that attempts to create invalid values for them will result in an error. These features help you find data and stylesheet errors earlier and with more helpful error messages.

Another nice feature of type-aware XSLT 2.0 processors is their ability to process source nodes by matching against their type. For example, a stylesheet can include a template rule that processes all elements and attributes of type xs:date.

I mentioned above how the XPath 1.0 spec includes an appendix that describes the relationship of its data model to the infoset's data model. The XQuery 1.0 and XPath 2.0 Data Model document includes detailed descriptions of how to map its data model to an infoset and how to create each component of the data model from an infoset and from a PSVI. The latter is a huge job, accounting for a good portion of the new data model spec's length, because a Post Schema Validation Infoset can hold more information than a pre-validation infoset. A W3C schema can include information about typing, defaults, and more, so validation against that schema can turn an element such as <length>2</length> into <length unit="in">2.0</length> along with the associated information that the "2.0" represents a decimal number. In XPath 2.0, the type-name node property stores the name of the data type declared for the value in the schema, if available from a validation stage of the processing, and "xdt:untyped" otherwise. The rules for determining the value of the type-name property from an associated W3C schema are laid out in the Mapping PSVI Additions to Type Names section of the Data Model document, but be warned—only a big fan of W3C schemas could enjoy reading it.

Things are a bit simpler when the types that may come up in your XSLT processing are limited to the Schema Part 2 types. Without having your stylesheet refer to content models and other aspects of complex types, it can be handy to identify some values as integers, some as booleans, and some as URIs. The Data Model document adds five new types to the 19 primitive types defined in the Part 2 Recommendation: the xdt:untyped one mentioned above and the xdt:untypedAtomic "type," which also serves as more of an annotation about typing status than as the name of an actual type; xdt:anyAtomicType, an abstract type that plugs a newly-discovered architectural hole; and xdt:dayTimeDuration and xdt:yearMonthDuration, two types that offer totally ordered ways to measure elapsed time. That is, a sequence of values of either of these types can be properly sorted, which wasn't always the case with the duration types offered in the Schema Part 2 specification—comparing "one month" with "30 days" wouldn't always give you a clear answer.

For now, remember that if you completely ignore type-aware XSLT 2.0 processing—which will be a perfectly valid approach for much XSLT 2.0 development—the big difference between the XSLT 1.0 and 2.0 data models is that a single node, a complete document, a subtree representing a fragment of a document, and any other set of nodes described by an XPath expression are all sequences, and that much of the processing is now described in terms of these sequences. From a practical standpoint, an XSLT 1.0 stylesheet with the version attribute of its xsl:stylesheet element set to "2.0" and a few new functions called here and there will still work, as we've seen in my earlier columns on XSLT 2.0. Also remember that the new Data Model document lays the groundwork for not only XPath 2.0 (and therefore XSLT 2.0), but also XQuery, so an appreciation of sequences will help you learn XQuery more quickly.

In a future column, I'll demonstrate what type-aware XSLT processing adds to your stylesheet and its handling of source and result documents.