Validating XML with Schematron
by Chimezie Ogbuji
|
Pages: 1, 2
Structure of a Schematron Document
A Schematron XML document consist of a schema element in the Schematron namespace: http://www.ascc.net/xml/schematron. The schema element contains one or more pattern elements. Pattern elements allow the user to group schema constraints logically. Some examples of logical groupings are: Text Only Elements, Valid Root Element, Check for ID Attribute.
Pattern elements have a name attribute. They may also have a see attribute that refers to a URL for user documentation of the schema.
Rules
Rule elements define a collection of constraints on a particular context in a document instance (for example, on an element or collection of elements). This is very similar to XSLT templates, which are fired with respect to a node or group of nodes returned by an XPath expression. If we go back to the XSLT stylesheet we defined earlier:
<xsl:template match='shortStory'>
The match attribute causes the XSLT processor to evaluate
the XPath expression shortStory and then instantiate the
template relative to the shortStory element. The contents
of a rule element operate within the context of the elements matched
by its context attribute.
Rule elements may contain assert and report elements. Both elements are conditionally instantiated depending on the XPath evaluation of their test attribute. The only difference is that assert elements are instantiated if the XPath expressions evaluates to false, while the report elements are instantiated if it evaluates to true. (The general intent is that assert is used to detect errors, while report can be used to report affirmative qualities of an instance.)
The assert/report mechanism is similar to the XSLT xsl:if element in our example stylesheet above, which also has a test attribute that determines if the contents of the xsl:if element are instantiated in the resulting XML tree.
Note that a node can only be the context of a single rule (the first matching rule the processor comes across) within a pattern. However, a node can be matched multiple times within different patterns. Thus pattern groupings are important. Every match of a context node can be considered a discrete constraint.
These elements allow authors of Schematron schemas to provide functional (and humanly readable) feedback about invalid XML instances. The user-defined feedback makes Schematron's unique approach to schema declaration more powerful than other schema languages.
Finally, assert and report elements have a name element to use for substituting the name of an element into the output stream. The name element has an optional path attribute which returns the node whose tag name will be inserted in place of the name element. If the path attribute isn't specified the name of the current context node is used instead. This element is often used by assert and report elements to identify the tag name of an offending element within the validation message.
Powered by XPath
The power of Schematron lies with its use of XPath expressions. They allow XML instances to be queried by powerful patterns, providing validation of constraints beyond the capabilities of DTDs to declare. Let's consider selected portions of the "Structural Validation" pattern inside the RSS Schematron (which can be downloaded).
<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
<pattern name="Structural Validation">
<rule context="rss">
<assert test="@version">
An RSS version identifier should be supplied
</assert>
Here the rule context is the rss element. The assert
element tests for the existence of a version attribute with the
@version XPath expression. If the matched rss
element doesn't have a version attribute, the contents of the
assert element are instantiated: that is, the text message is
created in the output of the stylesheet to alert the user that a
version identifier is required.
<report test="@version != 0.91">
This Schematron validator is for RSS 0.91 only
</report>
This is an example of a report element whose content is instantiated only if the test expression evaluates to true. In this case, the Schematron is checking for a version number other than 0.91.
<assert test="count(channel) = 1">
An RSS element can only contain a single channel element
</assert>
Here we have a more complex constraint. It tests whether the
context node (/rss in this case) has only a single
channel element. The test expression uses the XPath
count function, one of the many powerful XPath functions
available to a Schematron.
<rule context="title|description|link">
<assert test="parent::channel or parent::image
or parent::item or parent::textinput">
A <name/> element can only be contained with a
channel, image, item or textinput element.
</assert>
<report test="child::*">
A <name/> element cannot contain sub-elements,
remove any additional markup
</report>
</rule>
This rule element's context node in the example above is
either a title, description, or link element. The
assert element checks that the context node's parent is either
a channel, an image, an item, or a
textinput. It uses the parent axis specifier for
the check.
The report element ensures that neither the title,
description, nor the link element contains a child element.
It uses the child axis specifier.
<rule context="image">
...
<assert test="count(width) = count(height)">
Width and Height elements should be balanced
</assert>
</rule>
This is another powerful example of the count function
being used for constraint. And it's another situation where a DTD
could not express this constraint for validation.
<rule context="width">
<assert test="preceding::height or following::height">
A width should be accompanied by a height
</assert>
</rule>
Finally, it also shows just what Schematron can validate. The
assert element uses the preceding and
following XPath axis specifiers to test whether, if a
width element occurs, there is an accompanying
height element. Once again Schematron leverages XPath's
powerful functions for its schema constraints.
Putting a Schematron Schema into Action
After a Schematron schema is defined, a Schematron XSLT stylesheet is used to transform the schema to a validating stylesheet. This stylesheet can then be applied to XML instances for validation purposes. There are several such Schematron stylesheets, each of which provides special functionality. You can find these stylesheets on the Schematron web site.
There is Schematron-basics which generates a stylesheet that simply returns the text output of the Schematron (the text of assert and report elements). As the name suggests, this is the most basic of the Schematron stylesheets.
The schematron-message stylesheet generates validating
stylesheets that can be used with an XSLT processor that knows how to
handle xml:message elements and send them to the standard
output. This stylesheet is mainly used in conjunction with
interactive editors such as Emacs and XED to validate an XML instance
as it is being edited.
There are also schematron-report and schematron-pretty stylesheets. These generate validating stylesheets that produce HTML formatted messages. The schematron-report stylesheet produces output in a two-frame frameset. The first frame contains hyper-linked error messages organized by pattern. The bottom frame displays the offending XML source fragments corresponding to the selected error message. This stylesheet provides a helpful way to interactively review validation errors in an XML instance, and it's particularly useful when the XML instance source is large enough to be a burden to browse separately.
|
Resources |
|
Schematron: An Interview with Rick Jelliffe |
Finally there is schematron-xml which generates validation messages in XML. The elements have a location attribute containing XPath expressions that evaluate to the offending element. This Schematron stylesheet allows users to plug-in Schematron validation to their existing XML application logic.
There are several widely used XML schemas written in Schematron in addition to the RSS Schematron example, for example, the schema in Dan Connolly's Web Content Accessibility Checking Service. It's a service that checks web pages against the Web Content Accessibility Guidelines using the WAI example Schematron, downloadable from Rick Jelliffe's Schematron web page.