A Data Model for Strongly Typed XML

December 18, 2002

Introduction

In many XML applications, the producers and consumers of XML documents are aware of the datatypes within those documents. Such applications can benefit from manipulating XML via a data model that presents a strongly typed view of the document. Although a number of abstractions exist for manipulating XML -- the XML DOM, XML infoset, and XPath data model -- none of these views of XML take into account usage scenarios involving strongly typed XML.

Many developers utilize XML in situations where type information is known at design or compile time, including interacting with relational databases and strongly typed programming languages like Java and C#. Thus, there is a significant proportion of the XML developer community which would benefit from a data model that encouraged looking at XML as typed data. This article is about my search for and discovery of this data model.

The XML Information Set

The W3C XML Information Set recommendation describes an abstract representation of an XML document. The XML Infoset is primarily meant to act as a set of definitions used by XML technologies to formally describe what parts of an XML document they operate on. Several W3C XML technologies are described in terms of the XML Infoset, including SOAP 1.2, XML Schema, and XQuery.

An XML document's information set consists of a number of information items. An information item is an abstract representation of a component of an XML document: such as an element, attribute or processing instruction. Each information item has a set of associated named properties. Each property is either a collection of related information items or data about the information item; the [children] property of an element information item is an example of the former, while the [base URI] of a document information item is an example of the latter. An XML document's information set must contain a document information item from which all other information items belonging to the document can be accessed. The XML Infoset is a tree-based hierarchical representation of an XML document.

The W3C recommendation simply calls the infoset a list of definitions, effectively a glossary of sorts; some consider the XML Infoset as an official attempt to define what is considered to be significant information in an XML document. For example, the infoset does not differentiate character content that is written as character references, CDATA sectionsk or entered directly. So the following

    <test><![CDATA[ ]]>2</test>
    <test>&#160;#160;2</test>
    <test> 2</test>

are considered equivalent according to the XML Infoset. Similarly the kind of quote character used for attributes is not considered significant; thus, the elements

<test attr='value'/>
    <test attr="value"/>

are both considered equivalent from the perspective of the XML Infoset. Many people in the XML community feel that the XML Infoset isn't just an inventory of various aspects of an XML document but an inventory of the significant aspects of an XML document. Others consider the XML Infoset to be the data model for XML.

Deconstructing the PSVI

A conformant W3C XML Schema (WXS) processor accepts an XML Infoset as input and transforms it into a Post Schema Validation Infoset (PSVI) upon validation. The WXS recommendation specifies a number of required information set items and properties that must be present in the input XML Infoset. Of note is the fact that only information about elements, attributes, textual content, and namespaces is required to be passed to the WXS processor. This is understandable given that those information items are the only aspects of an XML document the WXS recommendation can constrain.

A PSVI is the original input XML Infoset with new information items added and new properties added to existing information items. The WXS recommendation lists the contibutions to the Post Schema Validated Infoset. There are five broad classes of information contained within the PSVI:

Validation Outcomes: Information related to whether an element or attribute was successfully validated or not. The [schema error code] property added to attribute information items and the [validity] propery added to element information items are examples of information in the PSVI related to validation outcomes.
Default Information: Indications as to whether the value of an element or attribute was obtained via default values specified in the schema or not. The [schema default] property added to attribute information items is an example of information in the PSVI related to default information.
Identity Constraint Relationships: Tables containing information related to WXS identity constraint mechanisms such as ID/IDREF and key/keyref/unique. The [ID/IDREF table] and [identity-constraint table] properties added to element information items are examples of information in the PSVI related to identity constraint relationships.
Type Annotations: References to schema components that may be type definitions or element and attribute declarations. Also includes values of elements and attributes after normalization based on their datatype. The [type definition] and [member type definition] properties added to element and attribute information items in the PSVI are examples of references to schema components that act as type annotations.
Notation Data: Information related to notation declarations in the schema. The [notation] and [notation public] properties added to element information items are examples of information in the PSVI related to notations.

For users of XML interested primarily in strongly typed data, the PSVI is unsatisfactory for a number of reasons. First, the PSVI contains a lot of heavyweight information that is irrelevant to developers of strongly typed applications, such as defaulting information and identity constraint relationships. In fact, I am unaware of any schema processor that provides access to all of the PSVI, and the WXS recommendation limits the responsibility of schema processors in exposing the PSVI. Secondly, the PSVI is too tightly coupled with the WXS recommendation and cannot stand on its own as a generic data model for strongly typed XML.

However, the type annotations in the PSVI are, as an attempt to couple type information to an abstract representation of an XML document, a good start.

Discovering the XQuery and XPath 2.0 Data Model

The XQuery and XPath 2.0 data model is the next iteration of the XPath data model. The data model is based on the XML Infoset with two significant additions.

the XQuery and XPath 2.0 data model supports identifying the data types associated with elements and attributes via an expanded name (i.e. the misnamed xs:QName type)
sequences consisting of nodes, atomic values, or both can be represented in the XQuery data model

The ability to identify the datatype of nodes via the namespace URI and a local name (i.e. an expanded name) provides a loosely coupled mechanism for supporting W3C XML Schema datatypes and potentially any other type system where individual types can be identified by an expanded name. Describing sequences of nodes and atomic values in the data model is necessary to represent intermediate results of query processing. For instance, the data model has to be able to represent the results of queries that return atomic values such as 1 + 1, that return nodes such as //* or a mixture of both such as (5, <six />, 7).

The nodes in the XQuery and XPath 2.0 data model are analogous to information items in the XML Infoset. Although each node in the XQuery and XPath 2.0 data model has a corresponding information item, not every information item has a corresponding node. The notation, unexpanded entity reference, unparsed entity, and document type declaration information items do not have corresponding nodes in the XQuery and XPath 2.0 data model. There is also a correspondence between some of the properties added to an XML Infoset in the PSVI and the information present in the XQuery and XPath 2.0 data model. Type annotations and validation outcomes are mapped to type information using an algorithm described in the XQuery and XPath 2.0 working draft. The current algorithm specifies that either the expanded name of an element or attribute's type should be obtained by investigating its type annotation, if it was successfully validated, or should be xs:anyType if an element and xs:anySimpleType if an attribute. PSVI information such as notation data, default information, and identity constraint relationships are not represented in the XQuery and XPath 2.0 data model.

In the XQuery and XPath 2.0 data model, information about nodes is obtained via accessor functions that can operate on any node. These accessor functions are analogous to an information item's named properties. The accessor functions are figurative and are not meant to specify a programming interface to the XQuery and XPath 2.0 data model. Instead they are intended to serve as a concise description of the information that must be exposed by the data model. The XQuery and XPath 2.0 data model also specifies a number of constructor functions whose purpose is to illustrate how nodes in the data model are constructed. This aspect of the data model is currently underspecified and the relationship between the data passed to a constructor and that retrieved by accessors on the constructed node is unclear. It is likely that these issues will be cleared up in future versions of the working draft.

The primary upgrade to nodes in XQuery and XPath 2.0 from XPath 1.0 is that they can now have a typed value and a type. The typed value of a node is a sequence of zero or more atomic values obtained from the content of the node; its type is a named type identified by an expanded name. These two additions make it possible to treat nodes, specifically element and attribute nodes, as receptacles of strongly typed data.

Points to Ponder

Although the XQuery and XPath 2.0 data model provides a mechanism for representing strongly typed XML, it does not meet the needs of users who want to work with well-formed XML. To satisfy the unique needs of XML query processing, the data model is capable of representing XML documents that are not well-formed.

To make the XQuery and XPath 2.0 data model meet the needs of generic users of strongly typed XML, certain aspects of the data model may have to be made more strict. Specifically, in the XQuery and XPath 2.0 data model a document node allows more than one element node as a child and also permits text nodes as children. Although this is not a concern for general XML processing scenarios, it would be a concern if an attempt was made to serialize the data model and pass it to consumers of well-formed XML without ensuring the XML in the data model was well-formed. To restrict this to well-formed XML one could mandate that a document node can only contain an element node, zero or more comment nodes, and zero or more processing instruction nodes as children.

There are certain edge cases where using the XQuery and XPath 2.0 data model can result in information loss. Consider the following XML instance and schema fragment:

  <xs:schema  xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="A" type="xs:integer" />
  </xs:schema>

  <A>1<!-- comment may be lost -->2</A>

The process described for mapping an element information item in the PSVI to an element node in the XQuery and XPath 2.0 data model could lose information about the comment node, since there is no requirement for comments or processing instructions to be exposed in the PSVI. This approach treats comments and processing instructions as second class citizens. However, this is typically the case in strongly typed applications that utilize XML.

So What's the Verdict?

The XQuery and XPath 2.0 Data Model is still a working draft; some of its details may change before it becomes a W3C recommendation. However, the core ideas behind the data model which this article explores are unlikely to change. This article is based on the November 15^th draft of the working draft.

The XQuery and XPath 2.0 data model presents itself as a viable data model for processing XML in strongly typed usage scenarios. The loose coupling to the W3C XML Schema type system is especially beneficial because it both provides an interoperable set of types, yet does not limit one to solely those types. The XQuery and XPath 2.0 data model stands out as the most credible data model for dealing with XML in strongly typed scenarios. Given that the XQuery and XPath 2.0 data is based on the XML Infoset, and also builds upon the past experience with XPath 1.0, it's the best candidate for the Data Model for XML.

I'd like to thank Sergey Dubinets and Mark Fussell for their help with this article.