Menu

Datatype Checking With XSLT 2.0

October 1, 2003

Bob DuCharme

Much of the added complexity of XSLT 2.0 as compared with 1.0 comes from the former's support for the W3C's XML Schema data typing system. As James Mason described in a recent Report from Extreme Markup Languages 2003, Jeni Tennison's presentation on complications introduced into XSLT 2.0 by the requirements for strong typing left many wondering "If Jeni's having problems, what hope is there for the rest of us?"

There's actually some good, even if ironic, news about data typing support in XSLT 2.0: if you're still using DTDs, and you're putting off a move to any schema format, you can use XSLT 2.0 stylesheets to add datatype checking to your system, further postponing a move to schemas.

XPath 2.0's castable keyword lets you check whether a given expression can be cast, or converted, to a particular datatype. The cast expression lets you perform the actual conversion, but without using that, you can use castable to create a simple stylesheet that checks whether elements and attributes in a document conform to particular datatypes. As projects such as DSDL and XPipe demonstrate a trend toward breaking up the groups of tasks formerly done by monolithic parser/validators into individual, specific processes that can be configured into customized combinations instead, an easy way to check datatypes, and nothing else, can very useful in a toolbox of such processes.

A castable expression returns a Boolean true or false depending on whether the expression provided can be cast or not. For example, "'3' castable as xs:integer" would return true, but "'three' castable as xs:integer" or "'3.14' castable as integer" would not.

Don't forget to declare a namespace to go with the "xs" prefix used with the type name. Saxon 7, the only current implementation of the XSLT 2.0 Working Draft, supports both the http://www.w3.org/2001/XMLSchema and the http://www.w3.org/2001/XMLSchema-datatypes namespaces described in the XML Schema Part 2: Datatypes Recommendation. The latter is more or less a superset of the former, according to Part 3.1 of the spec. I say "more or less" because of concerns expressed by Michael Kay, author of Saxon and editor of the XSLT 2.0 spec, about the relationship between the two namespaces. He recommends that you use types from the http://www.w3.org/2001/XMLSchema namespace.

Let's look at an example. I like to use the following document to test anything that does type checking. It has attributes and elements with data that should conform to specific primitive types, and the second of its two order elements has bad data for a type checker to flag.

 <typeTest>

  <order shipDate="2003-10-12">
    <itemNum>a123</itemNum>
    <price>12.99</price>
    <quantity>1</quantity>
    <shipped>true</shipped>
  </order>  

  <order shipDate="2003-02-30">
    <itemNum>b342</itemNum>
    <price>green</price>
    <quantity>3.14</quantity>
    <shipped>yo yo yo</shipped>
  </order>  
    
</typeTest>

Each order element has a shipDate attribute that should have an ISO 8601 date value, a price child element that should have a decimal value, a quantity element that should have an integer value, and a shipped element that should have a boolean value. The example's first order element has good values for all of these. The second has bad values; a type-checking process should catch and identify those errors.

The following XSLT 2.0 stylesheet does a pretty good job of this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:xs="http://www.w3.org/2001/XMLSchema"
     version="2.0">

  <xsl:output method="text"/>

  <!-- Don't output text nodes. We just want error messages. -->
  <xsl:template match="text()"/>

  <!-- Make sure that element attributes get processed. -->
  <xsl:template match="*">
    <xsl:apply-templates select="@*|node()"/>
  </xsl:template>

  <!-- Type checking templates: -->

  <xsl:template match="quantity">
    <xsl:if test="not(string(.) castable as xs:integer)">
      <xsl:text>
Following quantity value not an integer: </xsl:text>
      <xsl:value-of select="."/>
    </xsl:if>
  </xsl:template>

  <xsl:template match="shipped">
    <xsl:if test="not(string(.) castable as xs:boolean)">
      <xsl:text>
Following shipped value not a boolean: </xsl:text>
      <xsl:value-of select="."/>
    </xsl:if>
  </xsl:template>

  <xsl:template match="price">
    <xsl:if test="not(string(.) castable as xs:decimal)">
      <xsl:text>
Following price value not a decimal number: </xsl:text>
      <xsl:value-of select="."/>
    </xsl:if>
  </xsl:template>

  <xsl:template match="@shipDate">
    <xsl:if test="not(string(.) castable as xs:date)">
      <xsl:text>
Following shipDate value not a date: </xsl:text>
      <xsl:value-of select="."/>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Before looking at how it works, let's look at what it does to the sample document with the two order elements using Saxon 7.6.5:

Following shipDate value not a date: 2003-02-30
Following price value not a decimal number: green
Following quantity value not an integer: 3.14
Following shipped value not a boolean: yo yo yo

The first error is a particularly good catch, because while "2003" is a legal year value, "02" is a legal month value, and "30" is a legal day value, they're illegal together, because February never has thirty days.

The stylesheet starts with two setup template rules. Because I wanted to write something that would be easy to plug into a pipelined process, it outputs nothing if it finds no errors, which is why the result shown above starts with the message about the bad date value in the second order element. Because of the stylesheet's diagnostic purpose, its first template rule tells the XSLT processor not to output any text nodes from the source tree. The second ensures that all attributes get processed because the XSLT built-in template rule for elements does not tell the XSLT processor to apply templates to attributes.

The remainder of the stylesheet is one template rule for each attribute or element type whose datatype we want to check. Each template rule has a single xsl:if statement that uses the castable expression in its test condition. If the checked value can be cast to the specified type, then nothing gets output. Otherwise it outputs a message about the problem, the value that can't be cast, and the type that it can't be cast to. (The use of the str() function to force the tested value to a string before checking its castable status is there to accommodate a quirk of Saxon 7.6.5—it will be unnecessary in future Saxon 7 releases, as the implementation conforms more closely to the XSLT 2.0 Working Draft.) The template rules' simple match patterns such as "quantity" and "shipped" will check any elements at all with those names, regardless of their context in the input document, but you could change this easily enough—for example, by changing "quantity" to "order/quantity" if you only wanted to ensure that quantity children of order elements were integers, leaving other quantity elements alone.

Where do the datatype names in the castable expressions come from? Which ones can you use? Much of the data typing topics described by Jeni Tennison in her talk in Montreal described the typing implications of schema-aware XSLT processors. (By "schema," here, the W3C XSL Working Group meant "W3C Schema".) For our purposes, though, we don't care about schemas. We're using castable and the datatype names in an xsl:if element's test attribute, which stores an XPath expression, so we're more concerned with XPath 2.0's typing capabilities and limitations than XSLT's. The most recent XPath 2.0 Working Draft tells us that we can use the built-in datatypes from the namespace http://www.w3.org/2001/XMLSchema. As I mentioned above, Saxon 7 also supports the http://www.w3.org/2001/XMLSchema-datatypes namespace, but Michael Kay recommends that you use the http://www.w3.org/2001/XMLSchema ones in your stylesheets.

Schematron and XSLT 2.0

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

In an XML.com article titled Filling in the DTD Gaps with Schematron, I described how Schematron lets you describe potential problems to check for in an XML document, and that all you need to check an XML document against a Schematron schema is an XSLT processor. I also said that one of the few things I found missing from Schematron was the ability to check datatypes. While playing with the stylesheet shown above, I realized that an XSLT 2.0 processor should let you write Schematron rules that specify the datatypes of certain elements and attributes and then check documents against those rules. There was a bit of hand tweaking to do along the way to accommodate the settling process currently underway with both XSLT 2.0 and Schematron, but I did get it to work, and it gives a very positive view of Schematron's future.

The following Schematron schema does essentially the same thing as the stylesheet shown above. Each rule has an assert statement, which is Schematron's way of testing for something that should be true. If the boolean condition in its test attribute is not true, the message in the assert element's content gets output. This schema also uses Schematron's diagnostics feature to show the value that wasn't castable, which makes it easier to trace the error.

<schema xmlns="http://www.ascc.net/xml/schematron"
        xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <pattern name="Check Types">

    <rule context="quantity">
      <assert test="string(.) castable as xs:integer"
              diagnostics="d1">
      All quantity values should be integers.
      </assert>
    </rule>

    <rule context="price">
      <assert test="string(.) castable as xs:decimal"
              diagnostics="d1">
      All price values should be decimal numbers.
      </assert>
    </rule>

    <rule context="shipped">
      <assert test="string(.) castable as xs:boolean"
              diagnostics="d1">
      All shipped values should be boolean.
      </assert>
    </rule>

    <!-- Can't use attribute value as rule 
         context, at least not in Schematron 1.5 -->
    <rule context="order">
      <assert test="string(@shipDate) castable as xs:date"
              diagnostics="date">
      All shipDate values should be valid dates.
      </assert>
    </rule>

  </pattern>


  <diagnostics>

    <diagnostic id="d1">
    Not true for "<value-of select="."/>".
    </diagnostic>

    <diagnostic id="date">
    Not true for "<value-of select="@shipDate"/>".
    </diagnostic>

  </diagnostics>

</schema>

For details of why it didn't work right away and the issues involved in making it do so, you can read the schematron love-in thread (1, 2, 3, 4) on the issue. It's enough to say that Rick Jelliffe, the man behind Schematron, is keeping a close watch on XSLT 2.0 and looking forward to the new capabilities that it can bring to Schematron, and when XSLT and XPath 2.0 become Recommendations, Schematron will be ready for them.

When asked about potential motivations for moving past DTDs to schemas, most put data typing high on their list, but many are still intimidated by the accompanying baggage of making a complete conversion. It's nice to know that XSLT 2.0 stylesheets will let you add data typing to your system without changing the system itself, with no commitment to buying into all of the XSD/XSLT 2 interactions that Jeni warned of in Montreal.