An Introduction to Schematron
Table of Contents
• Introduction to Schematron |
The Schematron schema language differs from most other XML schema languages in that it is a rule-based language that uses path expressions instead of grammars. This means that instead of creating a grammar for an XML document, a Schematron schema makes assertions applied to a specific context within the document. If the assertion fails, a diagnostic message that is supplied by the author of the schema can be displayed.
One advantages of a rule-based approach is that in many cases
modifying the wanted constraint written in plain English can easily
create the Schematron rules. For example, a simple content model can
be written like this: "The Person element should in the
XML instance document have an attribute Title and contain
the elements Name and Gender in that
order. If the value of the Title attribute is 'Mr' the
value of the Gender element must be 'Male'."
In this sentence the context in which the assertions should be applied is clearly stated as
the Person element while there are four different assertions:
- The context element (
Person) should have an attributeTitle - The context element should contain two child elements,
NameandGender - The child element
Nameshould appear before the child elementGender - If attribute
Titlehas the value 'Mr' the elementGendermust have the value 'Male'
In order to implement the path expressions used in the rules in Schematron, XPath is used with various extensions provided by XSLT. Since the path expressions are built on top of XPath and XSLT, it is also trivial to implement Schematron using XSLT, which is shown later in the section Schematron processing.
It has already been mentioned that Schematron makes various assertions based on a specific context in a document. Both the assertions and the context make up two of the four layers in Schematron's fixed four-layer hierarchy:
- phases (top-level)
- patterns
- rules (defines the context)
- assertions
Schematron hierarchy
This introduction coves only three of these layers (patterns, rules and assertions); these are most important for using embedded Schematron rules in RELAX NG. For a full description of the Schematron schema language, see the Schematron specification.
The three layers covered in this section are constructed so that each assertion is grouped into rules and each rule defines a context. Each rule is then grouped into patterns, which are given a name that is displayed together with the error message (there is really more to patterns than just a grouping mechanism, but for this introduction this is sufficient).
The following XML document contains a very simple content model that helps explain the three layers in the hierarchy:
<Person Title="Mr">
|
Assertions
The bottom layer in the hierarchy is the assertions, which are used
to specify the constraints that should be checked within a specific
context of the XML instance document. In a Schematron schema, the
typical element used to define assertions is assert. The
assert element has a test attribute, which
is an XSLT
pattern. In the preceding example, there was four assertions made
on the document in order to specify the content model, namely:
- The context element (
Person) should have an attributeTitle - The context element should contain two child elements,
NameandGender - The child element
Nameshould appear before the child elementGender - If attribute
Titlehas the value 'Mr' the elementGendermust have the value 'Male'
Written using Schematron assertions this would be expressed as
<assert test="@Title">The element Person must have a Title attribute.</assert>
|
If you are familiar with XPath, these assertions are easy to
understand, but even for people with limited experience using XPath
they are rather straightforward. The first assertion simply tests for
the occurrence of an attribute Title. The second
assertion tests that the total number of children is equal to 2 and
that there is one Name element and one
Gender element. The third assertion tests that the first
child element is Name, and the last assertion tests that
if the person's title is 'Mr' the gender of the person must be
'Male'.
If the condition in the test attribute is not
fulfilled, the content of the assertion element is displayed to the
user. So, for example, if the third condition was broken (*[1] =
Name), the following message is displayed:
The element Name must appear before element Gender.
|
Each of these assertions has a condition that is evaluated, but the
assertion does not define where in the XML instance document this
condition should be checked. For example, the first assertion tests
for the occurrence of the attribute Title, but it is not
specified on which element in the XML instance document this assertion
is applied. The next layer in the hierarchy, the rules, specifies the
location of the contexts of assertions.
Rules
The rules in Schematron are declared by using the rule
element, which has a context attribute. The value of the
context attribute must match an XPath
Expression that is used to select one or more nodes in the
document. Like the name suggests, the context attribute
is used to specify the context in the XML instance document where the
assertions should be applied. In the previous example the context was
specified to be the Person element, and a Schematron rule
with the Person element as context would simply be
<rule context="Person"></rule>
|
Since the rules are used to group together all the assertions that
share the same context, the rules are designed so that the assertions
are declared as children of the rule element. For the
previous example this means that the complete Schematron rule would
be
<rule context="Person">
|
This means that all the assertions in the rule will be tested on
every Person element in the XML instance document. If the
context is not all the Person elements, it is easy to
change the XPath location path to define a more restricted
context. The value Database/Person for
example sets the context to be all the Person elements
that have the element Database as its parent.
Patterns
The third layer in the Schematron hierarchy is the pattern,
declared using the pattern element, which is used to
group together different rules. The pattern element also
has a name attribute that will be displayed in the output
when the pattern is checked. For the preceding assertions, you could
have two patterns: one for checking the structure and another for
checking the co-occurrence constraint. Since patterns group together
different rules, Schematron is designed so that rules are declared as
children of the pattern element. This means that the
previous example, using the two patterns, would look like
<pattern name="Check structure">
|
The name of the pattern will always be displayed in the output,
regardless of whether the assertions fail or succeed. If the assertion
fails, the output will also contain the content of the assertion
element. However, there is also additional information displayed
together with the assertion text to help you locate the source of the
failed assertion. For example, if the co-occurrence constraint above
was violated by having Title='Mr' and
Gender='Female' then the following diagnostic would be
generated by Schematron:
From pattern "Check structure":
|
The pattern names are always displayed, while the assertion text is
only displayed when the assertion fails. The additional information
starts with an XPath expression that shows the location of the context
element in the instance document (in this case the first
Person element) and then on a new line the start tag of
the context element is displayed.
The assertion to test the co-occurrence constraint is not trivial,
and in fact this rule could be written in a simpler way by using an XPath predicate when
selecting the context. Instead of having the context set to all
Person elements, the co-occurrence constraint can be
simplified by only specifying the context to be all the
Person elements that have the attribute
Title='Mr'. If the rule was specified using this
technique the co-occurrence constraint could be described like
this
<rule context="Person[@Title='Mr']">
|
By moving some of the logic from the assertion to the specification of the context, the complexity of the rule has been decreased. This technique is often very useful when writing Schematron schemas.
This concludes the introduction of patterns; now all that is left
to do to complete the schema is to wrap the patterns in the Schematron
schema in a schema element, and to specify that all the
Schematron elements used should be defined in the Schematron
namespace, http://www.ascc.net/xml/schematron. The complete
Schematron schema for the example follows:
<?xml version="1.0" encoding="UTF-8"?>
|
Namespaces and Schematron
Schematron can also be used to validate XML instance documents that
use namespaces. Each namespace used in the XML instance document
should be declared in the Schematron schema. The element used to
declare namespaces are the ns element which should appear
as a child of the schema element. The ns
element has two attributes, uri and prefix,
which are used to define the namespace URI and the namespace
prefix. If the XML instance document in the example were defined in
the namespace http://www.topologi.com/example, the Schematron
schema would look like this:
<?xml version="1.0" encoding="UTF-8"?>
|
Note that all XPath expressions that test element values now
include the namespace prefix ex.
This Schematron schema would now validate the following instance:
<ex:Person Title="Mr" xmlns:ex="http://www.topologi.com/example">
|
Pages: 1, 2 |