Combining RELAX NG and Schematron
by Eddie Robertsson
|
Pages: 1, 2, 3, 4, 5
Control over mixed text content
One of WXS's major advantages over previous schema languages is the ability to specify an extensive selection of datatypes for attributes but also for elements with text content. In RELAX NG it is possible to use all the datatypes from WXS by specifying these as the datatype library used. Unfortunately this ability to control the text content of an element disappears if the element is defined to have mixed content (child elements mixed with text content). With the help of embedded Schematron rules it is possible to apply basic text validation even for mixed content elements.
An example of this could be when you have source XML data that should be transformed into high quality PDF documents. A very simple paragraph in the final document can in XML be represented like this:
<p>This is <b>ok</b> but this is<b> not</b> ok</p>
In this case it is very important where the space characters around
the b elements are situated. If the space character is
situated inside the b element then the bold font will
make the space character bigger than what it is supposed to be. For
this reason it is important that the text content inside
the b element does not start or end with a space
character. For the same reason the text preceding the b
element should always end with a space character and the text
following the b element should always start with a space
character. In the above example the space around the
first b element are correctly located while they are
wrong around the second b element.
The RELAX NG schema for the above example is very simple:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<element name="p">
<mixed>
<zeroOrMore>
<element name="b">
<text/>
</element>
</zeroOrMore>
</mixed>
</element>
</start>
</grammar>
The Schematron rules that are needed to check the extra constraints on the text content can be implemented like this:
<sch:pattern name="Check spaces around b tags">
<sch:rule
context="p/node()[following-sibling::b][preceding-sibling::b][1]">
<sch:assert test="substring(., string-length(.)) = ' '">
A space must be present before the b tag.
</sch:assert>
<sch:assert test="starts-with(., ' ')">
A space must be present after the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/node()[following-sibling::b][1]">
<sch:assert test="substring(., string-length(.)) = ' '">
A space must be present before the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/node()[preceding-sibling::b][1]">
<sch:assert test="starts-with(., ' ')">
A space must be present after the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/b">
<sch:assert test="not(starts-with(., ' '))">
The text in the b tag cannot start with a space.
</sch:assert>
<sch:assert test="substring(., string-length(.)) != ' '">
The text in the b tag cannot end with a space.
</sch:assert>
</sch:rule>
</sch:pattern>
The Schematron rules to check this constraint is divided into four parts (each part is one rule with a separate context), which are explained in the order they are declared:
For all child nodes of the
pelement where the nearest preceding sibling and nearest following sibling is abelement, check that a space character is present immediately after the precedingbelement and that a space character is present immediately before the followingbelement.For all child nodes of the
pelement where the nearest following sibling is abelement, check that a space character is present immediately before thebelement.For all child nodes of the
pelement where the nearest preceding sibling is abelement, check that a space character is present immediately after thebelement.For all child
belements, check that the text content does not begin or end with a space character.
The complete RELAX NG schema with embedded Schematron rules look like this:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
xmlns:sch="http://www.ascc.net/xml/schematron">
<start>
<element name="p">
<sch:pattern name="Check spaces around b tags">
<sch:rule
context="p/node()[following-sibling::b][preceding-sibling::b][1]">
<sch:assert
test="substring(., string-length(.)) = ' '">
A space must be present before the b tag.
</sch:assert>
<sch:assert
test="starts-with(., ' ')">
A space must be present after the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/node()[following-sibling::b][1]">
<sch:assert
test="substring(., string-length(.)) = ' '">
A space must be present before the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/node()[preceding-sibling::b][1]">
<sch:assert test="starts-with(., ' ')">
A space must be present after the b tag.
</sch:assert>
</sch:rule>
<sch:rule context="p/b">
<sch:assert test="not(starts-with(., ' '))">
The text in the b tag cannot start with a space.
</sch:assert>
<sch:assert
test="substring(., string-length(.)) != ' '">
The text in the b tag cannot end with a space.
</sch:assert>
</sch:rule>
</sch:pattern>
<mixed>
<zeroOrMore>
<element name="b">
<text/>
</element>
</zeroOrMore>
</mixed>
</element>
</start>
</grammar>
This is of course a very simple example in which you only check for
space characters. In a more advanced example you also need to check
for other whitespace characters (like tabs), and the fact that the
last b element should not be followed by a space if the
immediately following character is a punctuation character. However,
the example still gives you an idea of the things you can do with
Schematron and mixed content.
Embedded Schematron using namespaces
Since Schematron is namespace-aware as is RELAX NG, it is no
problem to embed Schematron rules in a RELAX NG schema that define one
or more namespaces for the document. In the preceding section, it was
shown how Schematron schemas should be set up to use namespaces by
using the ns element. For embedded Schematron rules, this
works exactly the same. Instead of only embedding the Schematron rule
that defines the extra constraint, you also need to embed
the ns elements that define the namespaces used. The same
example that was used in Namespaces and Schematron is
used, but now RELAX NG is used to define the structure, while
Schematron checks the co-occurrence constraint. The instance example
used was:
<ex:Person Title="Mr" xmlns:ex="http://www.topologi.com/example">
<ex:Name>Eddie</ex:Name>
<ex:Gender>Male</ex:Gender>
</ex:Person>
A RELAX NG schema for the above would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example">
<start>
<element name="Person">
<element name="Name"><text/></element>
<element name="Gender">
<choice>
<value>Male</value>
<value>Female</value>
</choice>
</element>
<attribute name="Title"/>
</element>
</start>
</grammar>
The Schematron rule that needs to be embedded to check the
co-occurrence constraint (if title is "Mr" then the value of
element Gender must be "Male") will look like this (note
the use of the ex prefix):
<sch:pattern name="Check co-occurrence constraint">
<sch:rule context="ex:Person[@Title='Mr']">
<sch:assert test="ex:Gender = 'Male'">
If the Title is "Mr" then the gender of the person must be "Male".
</sch:assert>
</sch:rule>
</sch:pattern>
If this rule were embedded on its own the Schematron validation
would fail because the prefix ex is not mapped to a
namespace URI. In order for this to work, the ns element
that defines this mapping must also be embedded:
<sch:ns prefix="ex"
uri="http://www.topologi.com/example"
xmlns:sch="http://www.ascc.net/xml/schematron"/>
I always insert these Schematron namespace mappings at the start of the host schema. This means that they are always declared in the same place and it is easy to see which mappings are included without having to search through the entire schema. The complete RELAX NG schema with the embedded rules would then look like this:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example"
xmlns:sch="http://www.ascc.net/xml/schematron">
<!-- Include all the Schematron namespace mappings at the top -->
<sch:ns prefix="ex" uri="http://www.topologi.com/example"/>
<start>
<element name="Person">
<sch:pattern name="Check co-occurrence constraint">
<sch:rule context="ex:Person[@Title='Mr']">
<sch:assert test="ex:Gender = 'Male'">
If the Title is "Mr" then the gender of the person must be "Male".
</sch:assert>
</sch:rule>
</sch:pattern>
<element name="Name"><text/></element>
<element name="Gender">
<choice>
<value>Male</value>
<value>Female</value>
</choice>
</element>
<attribute name="Title"/>
</element>
</start>
</grammar>
Processing
Since embedded Schematron rules are not part of the RELAX NG specification, most RELAX NG processors will not recognize and perform the validation constraints expressed by the rules. In fact, the embedded Schematron rules will be completely ignored by the processor since they are declared in a different namespace then RELAX NG's. This means that in order to use the Schematron rules for validation this functionality must be added. Currently there exists two options for how this can be achieved:
The embedded rules are extracted from the RELAX NG schema and concatenated into a Schematron schema. This schema can then be used for normal Schematron validation of the XML instance document. Since both RELAX NG and Schematron use XML syntax, it is fairly easy to perform this extraction using XSLT. This technique will be described in detail in the following section.
The RELAX NG processor can be modified to allow embedded Schematron-like rules and perform the validation as part of the normal RELAX NG validation. This technique is used in Sun's MSV which has an add-on that will validate XML instance documents against RELAX NG schemas annotated with rules and assertions. However, the way the rules are embedded in the RELAX NG schema is slightly different if this option is used compared to the method described in this chapter. Some of these differences include:
- The rules can only be embedded within a RELAX NG element
- The context for each rule or assertion is determined by the element where they are declared in the RELAX NG schema
More information and details about this are provided in the documentation included in the download of the MSV add-on.
It should be noted that the rules and assertion specified using this method doesn't really have anything to do with Schematron more than that they use the same name for the elements.