Beyond W3C XML Schema

April 10, 2002

XPath and XSLT for Validation

The XML developer who needs to validate documents as part of application flow may choose to begin by writing W3C XML Schema for those documents. This is natural enough, but W3C XML Schema is only one part of the validation story. In this article, we will discover a multiple-stage validation process that begins with schema validation, but also uses XPath and XSLT to assert constraints on document content that are too complex or otherwise inappropriate for W3C XML Schema.

We can think of a schema as both expressive and prescriptive: it describes the intended structure and interpretation of a type of document, and in the same breath it spells out constraints on legal content. There is a bias toward the expressive, though: W3C XML Schema emphasizes "content models", which are good at defining document structure but insufficient to describe many constraint patterns.

This is where XPath and XSLT come in: we'll see that a transformation-based approach will let us assert many useful constraints and is in many ways a better fit to the validation problem. (In fact, one might define schema validation as no more than a special kind of transformation -- see van der Vlist.)

We'll begin by looking at some common constraint patterns that W3C XML Schema does not support very well and then develop a transformation-based approach to solving them.

Constraints -- Common Patterns

We'll observe two examples, each of which it is problematic to implement in W3C XML Schema. First, consider the schema shown below, modeling a home stereo system. It requires one of two configurations for sound amplification and then allows any number of sound sources in sequence. Finally, speakers are listed. (Note that for simplicity in this example we're leaving out data type information and focusing on structure. For more fully-worked examples and downloadable code, see the complete whitepaper.)

<?xml version="1.0" encoding="UTF-8" ?>



<xs:schema version="1.0" 

  xmlns:xs="http://www.w3.org/2001/XMLSchema"

>



  <xs:element name="Stereo"><xs:complexType>

    <xs:sequence>

      <xs:choice>

        <xs:sequence>

          <xs:element name="Amplifier" />

          <xs:element name="Receiver" />

        </xs:sequence>

        <xs:element name="Tuner" />

      </xs:choice>

      <xs:element name="CDPlayer" minOccurs="0" maxOccurs="unbounded" />

      <xs:element name="Turntable" minOccurs="0" maxOccurs="unbounded" />

      <xs:element name="CassetteDeck" minOccurs="0" maxOccurs="unbounded" />

      <xs:element name="QuadraphonicDiscPlayer" minOccurs="0" maxOccurs="unbounded" />

      <xs:element name="Speaker" minOccurs="2" maxOccurs="6" />

    </xs:sequence>

  </xs:complexType></xs:element>

  

</xs:schema>

We have occurrence constraints that demand at least two speakers, but let's assume that a system with a quadraphonic sound source must have at least four speakers to be valid. So the following document is valid against the above schema, but for our broader purposes is incorrect:

<?xml version="1.0" encoding="UTF-8" ?>



<Stereo

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:noNamespaceSchemaLocation="Stereo.xsd"

>



  <Amplifier>Mondo Electronics</Amplifier>

  <Receiver>Mondo Electronics</Receiver>

  <QuadraphonicDiscPlayer>CSI Labs</QuadraphonicDiscPlayer>

  <Speaker>Moltman</Speaker>

  <Speaker>Moltman</Speaker>



</Stereo>

Now we could break the whole content model into an xs:choice between two types of systems, one of which included quadraphonia, but this solution is pretty ugly. In practice, this same pattern can occur over many more different types, in more possible combinations, and a model that enumerates all legal configurations rapidly becomes unwieldy -- difficult to read and to maintain.

Thus one common pattern is the need to analyze the document tree as a whole. W3C XML Schema focuses on the immediate relationships between elements and attributes, parents and children. A more direct approach to pure validation is to start from the document scope and make assertions from there, drilling down as far as necessary to express a constraint. As we'll see, XPath is far better suited to abstract tree analysis than is W3C XML Schema.

Secondly, consider weakly-typed designs. Weak types are generally to be discouraged in XML document design, but this pattern does tend to pop up at some point out of necessity. Instead of creating multiple subtypes to express specializations, one complex type is used that includes a "type" attribute, usually an enumeration with one possible value for each pseudo-subtype.

Based on the value of this attribute, other attributes and elements may or may not be meaningful. Thus from a W3C XML Schema perspective all these attributes and elements must be considered optional, and this weakens the prescriptive capability of the schema. This is an especially tough nut for Schema to crack, since nothing in the W3C XML Schema Recommendation allows for validation of structure based on values in the instance document.

An example of a weakly-typed system exists in this schema for credit transactions. Various means of authenticating the human actor are defined, and the Means of effecting the transaction is our "type" attribute. Different means will require different combinations of authentications.

As in the Stereo example, this is primarily meant to illustrate what the schema cannot do: how could we express that if Means is "In person", then both SignatureVerifiedBy and VisuallyIdentifiedBy elements are required, but for an Internet sale a DigitalSignature is required instead?

XPath as Constraint Language -- Selecting What Shouldn't Exist

In order to develop a comprehensive architecture for XML document validation, it is clear that we will need more than W3C XML Schema is able to provide in the way of specifying content constraints. The need here is twofold: we need a language by which to define constraints and a mechanism by which to assert those constraints for a given XML document.

Generalizing from the examples in the previous section, we can see that our constraint language should allow us to express constraints of any scope -- up to document scope at least -- and any complexity. It should enable at least basic node selection by tag or attribute name, pattern recognition, existence tests and node counting, and simple numeric, string and boolean expressions for comparing values. XPath is clearly an excellent fit here. It is expression-based, allowing for arbitrarily complex constraints. It can do simple math and string manipulation and with little effort can perform some modestly complicated set arithmetic. Best of all, XPath expressions can evaluate to node sets, allowing for selection of all nodes that meet certain criteria.

The question, then, is what to select. It's intuitive to think in terms of selecting what's valid. Looking a short way ahead, though, it can be seen that validation is really a process of weeding out invalid data. So our aim should be to express constraints as assertions about unacceptable content patterns. The trick, in other words, is to select what shouldn't exist. For instance, the XPath expression Stereo[count (CDPlayer | Turntable | CassetteDeck | QuadraphonicDiscPlayer) = 0] would select any stereo with no sound sources, but would return an empty node set for a valid document. This is the form we want our assertions to take.

XSLT -- Transformation as Validation

XPath never stands alone; it was conceived as a useful common language for various purposes, including transformation, parsing, and even schema design. We need a way to apply this expression-based language to validation. XSLT gives us our solution by framing the process of validation as a transformation whose output will consist of error messages or will be empty.

The structure of the xsl:transform is quite simple:

There is one template that overrides the built-in XSLT template rules to suppress all automatic output, while still allowing every node in the document tree to be visited.
For each constraint, there is one template whose match attribute is the XPath expression that describes data that would be invalid against that constraint. Each such template will be instantiated once for each violation of the constraint and will produce output in the form of a warning or error message. (We'll use the text output method for these examples; note that a practical process would probably define an XML vocabulary for validation output or would leverage an existing one. See the full whitepaper for more on this.)

We'll now look at XSLT-based solutions to the two problems posed in the previous section. First, let's assert two constraints which are not addressed by the stereo schema:

If there is at least one quadraphonic sound source, then there must be at least four speakers.
The stereo must have at least one sound source.

We now introduce a second stage in the validation process: the application of a validating XSLT transform to the instance document. Following the strategy laid out above, the transform defines a template for each of the two constraints, producing the appropriate error message in each case:

<?xml version="1.0" encoding="UTF-8"?>



<xsl:transform version="1.0" 

  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 

>



  <xsl:output method="text" />

  <xsl:strip-space elements="*" />



  <xsl:template match="text ()" />

  

  <xsl:template match="Stereo[QuadraphonicDiscPlayer][count (Speaker) < 4]" >

<xsl:text>ERROR:  Quadraphonic sound source without enough speakers.

</xsl:text>

  </xsl:template>

  

  <xsl:template match="Stereo[count (CDPlayer | Turntable 

                      | CassetteDeck | QuadraphonicDiscPlayer) = 0]">

<xsl:text>ERROR:  Stereo system must have at least one sound source.

</xsl:text>

  </xsl:template>

  

</xsl:transform>

Consider the instance document shown earlier, which should not be considered valid. It still validates against the schema but is now flunked by the validating transformation. Sample output for transforming the instance document using the above transform is

ERROR:  Quadraphonic sound source without enough speakers.

Now let's return to our weakly-typed transaction model. We define a validating transform to assert that, for instance, in-person transactions must be verified by checking the signature and visually identifying the other party based on a photo. Then the candidate document, one of whose sales records doesn't have both required verification steps, fails the validating transformation, producing

ERROR:  In-person sales must have verified signature and visual ID.

Advantages and Limitations of XPath/XSLT

We've seen that XPath and XSLT can form a second line of defense against invalid data. The value of this second stage in the validation architecture will be judged by what it can do that W3C XML Schema cannot. Here's a short list of constraint patterns XPath can express well.

Structural analysis of the tree as a whole, as in the Stereo example
Weakly-typed designs, as in the Sales example
Finer control over use of subtypes -- say base types A and B are associated but subtype A2 should only see instances of B2, not B1 or B3, etc.
Single values based on numeric or string calculation -- a number that must be a multiple of three, a string that must list values in a certain order
Relationships between legal single values -- a checksum over a long list of values, or a rule limiting the total number of occurrences of a common token
Constraints that span multiple documents -- for instance a dynamic enumeration where the legal values are listed in a second document, and so cannot be hardcoded into a schema

The third line of defense, if you will, is application code. Clearly, XPath and XSLT cannot do what this code can do; computational ability especially is limited. XPath has some math functions, and XSLT's flow-control constructs and variables can be used to perform simple calculations, such as a sum of products. This only scratches the surface of what a modern programming language can do. Still, XPath/XSLT will do whatever it can do in very few lines of simple code; we're only hoping that this stage can handle enough of the load to make its inclusion in the process worth the trouble. Code-level integration of XPath and XSLT offers great advantages, too, and may blur the line between the second and third stages as described here.

A frustration at the moment is that XPath has yet to catch up with XML Schema's datatypes. It would be nice, for instance, to use XPath to select all flights in an itinerary to assure that they are indeed sequential. XPath 1.0 doesn't have a date type, as XML Schema does, so this assertion would have to be effected using some fancy XPath/XSLT processing or be relegated to application code. The XPath 2.0 Requirements include enlargement of XPath's type model to include built-in XML Schema types.