XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Validating XML with Schematron
by Chimezie Ogbuji | Pages: 1, 2

Structure of a Schematron Document

A Schematron XML document consist of a schema element in the Schematron namespace: http://www.ascc.net/xml/schematron. The schema element contains one or more pattern elements. Pattern elements allow the user to group schema constraints logically. Some examples of logical groupings are: Text Only Elements, Valid Root Element, Check for ID Attribute.

Pattern elements have a name attribute. They may also have a see attribute that refers to a URL for user documentation of the schema.

Rules

Rule elements define a collection of constraints on a particular context in a document instance (for example, on an element or collection of elements). This is very similar to XSLT templates, which are fired with respect to a node or group of nodes returned by an XPath expression. If we go back to the XSLT stylesheet we defined earlier:

<xsl:template match='shortStory'>

The match attribute causes the XSLT processor to evaluate the XPath expression shortStory and then instantiate the template relative to the shortStory element. The contents of a rule element operate within the context of the elements matched by its context attribute.

Rule elements may contain assert and report elements. Both elements are conditionally instantiated depending on the XPath evaluation of their test attribute. The only difference is that assert elements are instantiated if the XPath expressions evaluates to false, while the report elements are instantiated if it evaluates to true. (The general intent is that assert is used to detect errors, while report can be used to report affirmative qualities of an instance.)

The assert/report mechanism is similar to the XSLT xsl:if element in our example stylesheet above, which also has a test attribute that determines if the contents of the xsl:if element are instantiated in the resulting XML tree.

Note that a node can only be the context of a single rule (the first matching rule the processor comes across) within a pattern. However, a node can be matched multiple times within different patterns. Thus pattern groupings are important. Every match of a context node can be considered a discrete constraint.

These elements allow authors of Schematron schemas to provide functional (and humanly readable) feedback about invalid XML instances. The user-defined feedback makes Schematron's unique approach to schema declaration more powerful than other schema languages.

Finally, assert and report elements have a name element to use for substituting the name of an element into the output stream. The name element has an optional path attribute which returns the node whose tag name will be inserted in place of the name element. If the path attribute isn't specified the name of the current context node is used instead. This element is often used by assert and report elements to identify the tag name of an offending element within the validation message.

Powered by XPath

The power of Schematron lies with its use of XPath expressions. They allow XML instances to be queried by powerful patterns, providing validation of constraints beyond the capabilities of DTDs to declare. Let's consider selected portions of the "Structural Validation" pattern inside the RSS Schematron (which can be downloaded).

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <pattern name="Structural Validation">
    <rule context="rss">
      <assert test="@version">
        An RSS version identifier should be supplied
      </assert>

Here the rule context is the rss element. The assert element tests for the existence of a version attribute with the @version XPath expression. If the matched rss element doesn't have a version attribute, the contents of the assert element are instantiated: that is, the text message is created in the output of the stylesheet to alert the user that a version identifier is required.

<report test="@version != 0.91">
  This Schematron validator is for RSS 0.91 only
</report>

This is an example of a report element whose content is instantiated only if the test expression evaluates to true. In this case, the Schematron is checking for a version number other than 0.91.

<assert test="count(channel) = 1">
  An RSS element can only contain a single channel element
</assert>

Here we have a more complex constraint. It tests whether the context node (/rss in this case) has only a single channel element. The test expression uses the XPath count function, one of the many powerful XPath functions available to a Schematron.

<rule context="title|description|link">
  <assert test="parent::channel or parent::image 
       or parent::item or parent::textinput">
    A <name/> element can only be contained with a
    channel, image, item or textinput element.
  </assert>
  <report test="child::*">
    A <name/> element cannot contain sub-elements,
    remove any additional markup
  </report>
</rule>

This rule element's context node in the example above is either a title, description, or link element. The assert element checks that the context node's parent is either a channel, an image, an item, or a textinput. It uses the parent axis specifier for the check.

The report element ensures that neither the title, description, nor the link element contains a child element. It uses the child axis specifier.

<rule context="image">
 ...
  <assert test="count(width) = count(height)">
    Width and Height elements should be balanced
  </assert>
</rule>

This is another powerful example of the count function being used for constraint. And it's another situation where a DTD could not express this constraint for validation.

<rule context="width">
  <assert test="preceding::height or following::height">
    A width should be accompanied by a height
  </assert>
</rule>

Finally, it also shows just what Schematron can validate. The assert element uses the preceding and following XPath axis specifiers to test whether, if a width element occurs, there is an accompanying height element. Once again Schematron leverages XPath's powerful functions for its schema constraints.

Putting a Schematron Schema into Action

After a Schematron schema is defined, a Schematron XSLT stylesheet is used to transform the schema to a validating stylesheet. This stylesheet can then be applied to XML instances for validation purposes. There are several such Schematron stylesheets, each of which provides special functionality. You can find these stylesheets on the Schematron web site.

There is Schematron-basics which generates a stylesheet that simply returns the text output of the Schematron (the text of assert and report elements). As the name suggests, this is the most basic of the Schematron stylesheets.

The schematron-message stylesheet generates validating stylesheets that can be used with an XSLT processor that knows how to handle xml:message elements and send them to the standard output. This stylesheet is mainly used in conjunction with interactive editors such as Emacs and XED to validate an XML instance as it is being edited.

There are also schematron-report and schematron-pretty stylesheets. These generate validating stylesheets that produce HTML formatted messages. The schematron-report stylesheet produces output in a two-frame frameset. The first frame contains hyper-linked error messages organized by pattern. The bottom frame displays the offending XML source fragments corresponding to the selected error message. This stylesheet provides a helpful way to interactively review validation errors in an XML instance, and it's particularly useful when the XML instance source is large enough to be a burden to browse separately.

Resources

Schematron: An Interview with Rick Jelliffe
Sunworld on Schematron
Schematron homepage
A comparison of six XML schema languages

Finally there is schematron-xml which generates validation messages in XML. The elements have a location attribute containing XPath expressions that evaluate to the offending element. This Schematron stylesheet allows users to plug-in Schematron validation to their existing XML application logic.

There are several widely used XML schemas written in Schematron in addition to the RSS Schematron example, for example, the schema in Dan Connolly's Web Content Accessibility Checking Service. It's a service that checks web pages against the Web Content Accessibility Guidelines using the WAI example Schematron, downloadable from Rick Jelliffe's Schematron web page.