XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Validating XML with Schematron

November 22, 2000

Schematron is an XML schema language, and it can be used to validate XML. In this article I show how to do the latter and assume the reader is at least familiar with XML 1.0, DTDs, XSLT, and XPath.

The Need for Schemas

XML schemas are necessary for communicating the structure of an XML document type to a machine. For example, consider two XML fragments.


<vehicle name='Harley Davidson' type='motorcycle'>
    <wheel name='Front Tire'/>
    <wheel name='Rear Tire'/>
    <HeadLight name='Front Lamp' />
    <kickstand/>
</vehicle>

<vehicle name='Mitsubishi 3000 GT' type='motorcycle'>
    <wheel name='Front Right Tire'>
    <wheel name='Front Left Tire'>
    <wheel name='Rear Right Tire'>
    <wheel name='Rear Left Tire'>
    <HeadLight name='Front Right lamp'>
    <HeadLight name='Front Left lamp'>
    <SunRoof/>
</vehicle>

A person can easily interpret and understand both XML instances from the words used to describe their components. A person can verify if the documents adhere to a set conventions about how vehicle elements should be used. For example, a person can tell that the this XML instance is invalid:

<vehicle name='Harley Davidson' type='motorcycle'>
    <wheel name='Front Tire'/> 
    <SunRoof/>
</vehicle>

We know that a motorcycle typically has two wheels and doesn't have a sunroof. A piece of program logic, however, needs an XML schema against which it can validate XML instances.

XML validation is a crucial part of predictable and efficient processing of XML instances. Knowing the structure of an XML document saves the developer from writing unnecessary conditional application logic. Once a document is identified as belonging to a class of documents, many assumptions about its structure can be made.

Document Type Definitions (DTDs)

DTDs were the first standard mechanism for XML validation, and for all practical purposes still are. They define the roles and structure of XML elements. DTDs are written in a syntax other than XMLs' and rely upon post-processing for validation. For simple XML schemas, DTDs are sufficient. However, DTDs are a step behind the direction XML technologies are evolving: they don't support namespaces, and they use a non-XML syntax.

The most serious problem with DTDs is that they do not support namespaces, a critical flaw since namespaces are a very powerful aspect of XML. The inability to validate DTD-declared XML documents with namespaces prevents XML application developers from taking advantage of namespaces in their business logic.

Most XML technologies (RDF, XSLT, and XLink) and schema languages (RELAX, XML Schema, SOX) are represented as XML. This uniformity helps make these technologies easy to learn, and it means developers are able to leverage existing XML tools. This places DTDs at a disadvantage because developers must learn an additional syntax in order to define their XML schemas--but DTDs also have more severe restrictions.

DTDs are somewhat limited in their range of expression; therefore, they cannot be used to validate some XML document structures. Consider the following XML:

<TennisMatch tournament='US Open'> 
 <Competition type='Doubles' gender='Female'> 
   <Player name='Venus Williams'/> 
   <Player name='Serena Williams'/> ....  
   <Player name='Martina Hingis'/> 
   <Player name='Lindsey Davinport'/>
 </Competition> 
</TennisMatch>

A DTD couldn't declare that a Competition element can only have an even number of Player elements. Consider the following XML:

<shortStory author='AUTHOR1'>
    <character name='CHARACTER1'/>
    <character name='CHARACTER2'>
</shortStory>

<anthology author='AUTHOR1'>
    <shortStory>
        <character name='CHARACTER1'/>
        <character name='CHARACTER2'>
    </shortStory>
</anthology>

If one constraint on such a document is that a shortStory element may only contain an author attribute if it isn't the child of anthology element, it wouldn't be possible to represent that constraint in a DTD.

These DTD handicaps aren't going unnoticed, and the W3C is presently developing an XML Schema language (currently a W3C Candidate Recommendation) that is more expressive and powerful than DTDs. The XML Schema language is an XML application and will likely become the standard way XML schemas are formally declared. However, we should take note of the REgular LAnguage description for XML (RELAX), an alternative XML schema language, developed by Murata Makoto, which has been submitted to the International Organization for Standardization (ISO) as a technical report. RELAX has been covered in previous XML.com articles. Until (and after) XML Schema is adopted as the standard for schema definitions, there are alternatives such as RELAX and Schematron. I've found Schematron to be the most promising of these.

Introducing Schematron

Schematron, created by Rick Jelliffe, defines a set of rules and checks that are applied to an XML instance. Schematron takes a unique approach to schemas in that it focuses on validating document instances instead of declaring a schema (as the other schema languages do).

Schematron relies almost entirely on XPath query patterns for defining these rules and checks. With just a subset of XPath, powerful XSLT stylesheets can be created to process very complex XML instances.

Before digging into Schematron, I'll demonstrate how XSLT can easily be used to validate XML instances. Let's go back our previous example.

<shortStory author='AUTHOR1'> 
  <character name='CHARACTER1'/> 
  <character name='CHARACTER2'>
</shortStory>

<anthology author='AUTHOR1'>
    <shortStory>
        <character name='CHARACTER1'/>
        <character name='CHARACTER2'>
    </shortStory>
</anthology>

A template can be created that returns "Invalid XML" if a shortStory element has an author attribute when it's contained in an anthology element.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match='shortStory'>
        <xsl:if test='../anthology and @author'>
            Invalid XML
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>

You can imagine other combinations of templates that validate more complex XML structures. This is essentially how Schematron works. It takes a Schematron schema definition (in XML) that describes the constraints. A Schematron XSLT stylesheet converts this to another stylesheet -- transforming an instance document with this resultant stylesheet then performs the validation of that instance.

Pages: 1, 2

Next Pagearrow