Menu

Combining RELAX NG and Schematron

February 11, 2004

Eddie Robertsson

Embedding Schematron Rules in RELAX NG

This article explains how to integrate two powerful XML schema languages, RELAX NG and Schematron. Embedding Schematron rules in RELAX NG is very simple because a RELAX NG validator ignores all elements not in the RELAX NG namespace (http://relaxng.org/ns/structure/1.0). This means that Schematron rules can be embedded in any element and on any level in a RELAX NG schema.

Here is a very simple RELAX NG schema that only defines one element, Root:

<?xml version="1.0" encoding="UTF-8"?>
<element name="Root" xmlns="http://relaxng.org/ns/structure/1.0">
   <text/>
</element>

Now if a Schematron rule should have the Root element as its context, this rule could be added as an embedded Schematron rule within the element element that defines the pattern for Root:

<?xml version="1.0" encoding="UTF-8"?>
<element name="Root" xmlns="http://relaxng.org/ns/structure/1.0">
   <sch:pattern name="Test constraints on the Root element" 
	 xmlns:sch="http://www.ascc.net/xml/schematron">
    <sch:rule context="Root">
      <sch:assert test="test-condition">Error message when 
		  the assertion condition is broken...</sch:assert>
    </sch:rule>
   </sch:pattern>
   <text/>
</element> 			

The Schematron rules embedded in a RELAX NG schema are inserted on the pattern level and must be declared in the Schematron namespace (http://www.ascc.net/xml/schematron).

Co-occurrence constraints

Although RELAX NG has better support for co-occurrence constraints than WXS, there are still many types of co-occurrence constraints that cannot be sufficiently defined. An example of such a co-occurrence constraint is when the relationship between two (or more) element/attribute values is expressed as a mathematical expression.

As an example, we use a schema that defines a very simple international purchase order. This purchase order specifies the following:

  • The date of the order

  • An address to which the purchased products will be delivered

  • The items being purchased including an id, a name, a quantity, and a price with currency information)

  • Payment details including type of payment and total amount payable with currency information

Here is an example of an XML representation of such a purchase order:

<?xml version="1.0" encoding="UTF-8"?>
<purchaseOrder date="2002-10-22">
  <deliveryDetails>
    <name>John Doe</name>
    <address>123 Morgue Street, Death Valley</address>
    <phone>+61 2 9546 4146</phone>
  </deliveryDetails>
  <items>
    <item id="123-XY">
      <productName>Coffin</productName>
      <quantity>1</quantity>
      <price currency="AUD">2300</price>
      <totalAmount currency="AUD">2300</totalAmount>
    </item>
    <item id="112-AA">
      <productName>Shovel</productName>
      <quantity>2</quantity>
      <price currency="AUD">75</price>
      <totalAmount currency="AUD">150</totalAmount>
    </item>
  </items>
  <payment type="Prepaid">
    <amount currency="AUD">2450</amount>
  </payment>
</purchaseOrder>	

A real life purchase order would be much more complex, but for the purpose of this article, this example is sufficient. A RELAX NG schema for the purchase order could look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <ref name="purchaseOrder"/>
  </start>
  <define name="purchaseOrder">
    <element name="purchaseOrder">
      <attribute name="date">
        <data type="date"/>
      </attribute>
      <ref name="deliveryDetails"/>
      <element name="items">
        <oneOrMore>
          <ref name="item"/>
        </oneOrMore>
      </element>
      <ref name="payment"/>
    </element>
  </define>
  <define name="deliveryDetails">
    <element name="deliveryDetails">
      <element name="name"><text/></element>
      <element name="address"><text/></element>
      <element name="phone"><text/></element>
    </element>
  </define>
  <define name="item">
    <element name="item">
      <attribute name="id">
        <data type="string">
          <param name="pattern">\d{3}-[A-Z]{2}</param>
        </data>
      </attribute>
      <element name="productName"><text/></element>
      <element name="quantity">
        <data type="int"/>
      </element>
      <element name="price">
        <ref name="currency"/>
      </element>
      <element name="totalAmount">
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="payment">
    <element name="payment">
      <attribute name="type">
        <choice>
          <value>Prepaid</value>
          <value>OnArrival</value>
        </choice>
      </attribute>
      <element name="amount">
        <ref name="currency"/>
      </element>
     </element>
  </define>
  <define name="currency">
    <attribute name="currency">
      <choice>
        <value>AUD</value>
        <value>USD</value>
        <value>SEK</value>
      </choice>
    </attribute>
    <data type="int"/>
  </define>
</grammar>	

This RELAX NG schema makes sure that all the required elements and attributes are present, and that some of these have the correct datatype. For example, all price information must have an integer value; the id of an item must be three digits, followed by a hyphen, followed by two uppercase letters; and the currency value must be one of AUD, USD or SEK. However, in a real world scenario it is more likely that you need to check more than the structure and the datatypes to make sure the purchase order is valid.

For the purchase order, the following constraints cannot be checked by RELAX NG, but they would all be very useful for complete validation of the data:

  1. Each item specifies quantity, price and the totalAmount for that item. To make sure that the data is valid, the value of the totalAmount element must be equal to quantity * price.

  2. Both the price element and the totalAmount element specify a currency, and for this data to be valid, the price and totalAmount elements must have the same currency value

  3. The payments section of the purchase order specifies an amount element which value must equal the sum of all the item's totalAmount values

  4. All item's currency value must equal the currency value of the amount element in the payments section

Schematron can easily check all of these constraints, and the context definition in the language provides a logical grouping of the constraints. The first two rules specify constraints that apply to each item element in the purchase order and hence this element is the context. Here is an example of how you can specify the Schematron rules needed to express this constraint:

<sch:pattern name="Check that the pricing and currency of an item is correct." 
xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule context="purchaseOrder/items/item">
    <sch:assert test="number(price) * number(quantity) = number(totalAmount)">
      The total amount for the item doesn't add up to (quantity * price).</sch:assert>
    <sch:assert test="price/@currency = totalAmount/@currency">
      The currency in price doesn't match the currency in totalAmount.
		</sch:assert>
  </sch:rule>
</sch:pattern>			

The Schematron rule specifies its context as all item elements with a parent items element and a grandparent purchaseOrder. For each of the item elements that match this criterion, the first assertion checks that the value of the price child element multiplied by the value of the quantity child element match the value of the totalAmount child element. The second assertion makes sure that the currency value of the price child element matches the currency value of the totalAmount child element.

The last rules both apply to the amount element in the payment section. This is also the context for the Schematron rules that will check these two constraints. Here is an example of how these rules can be specified:

<sch:pattern name="Check that the total amount is correct and that the currencies match" 
xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule context="purchaseOrder/payment/amount">
    <sch:assert 
    test="number(.) = sum(/purchaseOrder/items/item/totalAmount)">
      The total purchase amount doesn't match the cost of all items.
    </sch:assert>
    <sch:assert 
    test = "not(/purchaseOrder/items/item/totalAmount/@currency != @currency)">
      The currency in at least one of the items doesn't match the 
      currency for the total amount.
    </sch:assert>
  </sch:rule>
</sch:pattern>	

The first assertion checks that the sum of all the item element's totalAmount is equal to the value of the context node (which is the amount element) by using XPath's sum() function. The second assertion makes sure that all the different item's currency values match the currency value for the amount element. Note that the following (similar) assertion does not perform the same check:

<sch:assert test = "/purchaseOrder/items/item/totalAmount/@currency = @currency"
    >...</sch:assert>			

This assertion checks that at least one of the item's currency values matches the currency in the amount element. However, in this case we want to make sure that all the item's currency values match, and hence we negate both the assertion expression (using XPath's not() function) and the operator used inside the assertion ('=' becomes '!='). When writing Schematron rules this technique is often used to express the desired constraint.

Now that all the Schematron rules are defined, the only remaining task is to insert them into the main RELAX NG schema. As already mentioned, a RELAX NG schema allows any element not in the RELAX NG namespace to appear anywhere in the schema where markup is allowed. However, to keep the RELAX NG schema well organized and easy to read, it is recommended that you embed the Schematron rules in one of two places:

  1. Insert all the embedded Schematron rules at the beginning of the RELAX NG schema as a child of the top-level element. Then you always know that if you have embedded rules, they will be specified together and in the same place.

  2. Specify each Schematron rule on the element pattern that specifies the context of the embedded rule. In the previous example this means that one of the Schematron rules would be embedded on the element pattern for the item element and the other on the element pattern for the amount element in the payment section.

I prefer to embed each Schematron rule in the element that defines the context, but it is really up to the developer which method to use. Another good rule to follow is to always declare the Schematron namespace on the top-level element in the RELAX NG schema. That way you know that if the top-level element contains a declaration for the Schematron namespace, then the schema contains embedded Schematron rules. The complete RELAX NG schema for the purchase order with embedded Schematron rules might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" 
xmlns:sch="http://www.ascc.net/xml/schematron">
  <start>
    <ref name="purchaseOrder"/>
  </start>
  <define name="purchaseOrder">
    <element name="purchaseOrder">
      <attribute name="date">
        <data type="date"/>
      </attribute>
      <ref name="deliveryDetails"/>
      <element name="items">
        <oneOrMore>
          <ref name="item"/>
        </oneOrMore>
      </element>
      <ref name="payment"/>
    </element>
  </define>
  <define name="deliveryDetails">
    <element name="deliveryDetails">
      <element name="name"><text/></element>
      <element name="address"><text/></element>
      <element name="phone"><text/></element>
    </element>
  </define>
  <define name="item">
    <element name="item">
      <sch:pattern 
      name="Check that the pricing and currency of an item is correct.">
        <sch:rule context="purchaseOrder/items/item">
          <sch:assert 
          test="number(price) * number(quantity) = number(totalAmount)">
            The total amount for the item doesn't add up to (quantity * price).
          </sch:assert>
          <sch:assert 
          test="price/@currency = totalAmount/@currency">
            The currency in price doesn't match the currency in totalAmount.
          </sch:assert>
        </sch:rule>
      </sch:pattern>
      <attribute name="id">
        <data type="string">
          <param name="pattern">\d{3}-[A-Z]{2}</param>
        </data>
      </attribute>
      <element name="productName"><text/></element>
      <element name="quantity">
        <data type="int"/>
      </element>
      <element name="price">
        <ref name="currency"/>
      </element>
      <element name="totalAmount">
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="payment">
    <element name="payment">
      <attribute name="type">
        <choice>
          <value>Prepaid</value>
          <value>OnArrival</value>
        </choice>
      </attribute>
      <element name="amount">
        <sch:pattern 
        name="Check that the total amount is correct and that the currencies match">
          <sch:rule context="purchaseOrder/payment/amount">
           <sch:assert 
           test="number(.) = sum(/purchaseOrder/items/item/totalAmount)">
             The total purchase amount doesn't match the cost of all items.
           </sch:assert>
           <sch:assert 
           test="not(/purchaseOrder/items/item/totalAmount/@currency != @currency)">
             The currency in at least one of the items doesn't match the 
             currency for the total amount.
           </sch:assert>
         </sch:rule>
        </sch:pattern>
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="currency">
    <attribute name="currency">
      <choice>
        <value>AUD</value>
        <value>USD</value>
        <value>SEK</value>
      </choice>
    </attribute>
    <data type="int"/>
  </define>
</grammar>

Dependency between XML documents

Like most other XML schema languages, RELAX NG lacks the ability to specify constraints between XML instance documents. In many XML applications, this is a very useful functionality. A typical example would be to check if a certain ID reference has a corresponding ID in a different document. For the purchase order example in the preceding section, this could be a simple database file where all the available products are listed. Typically a simple database would contain the following information:

  • Date when the database was updated

  • One or more products

  • Each product have an id, a name, a description, a price and the number of items in stock

A sample XML instance document for the database would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<products lastUpdated="2002-10-22">
  <product id="123-XY">
    <productName>Coffin</productName>
    <description>Standard coffin, Size 200x80x50cm</description>
    <numberInStock>4</numberInStock>
    <price currency="AUD">2300</price>
  </product>
  <product id="112-AA">
    <productName>Shovel</productName>
    <description>Plastic grip shovel</description>
    <numberInStock>2</numberInStock>
    <price currency="AUD">75</price>
  </product>
</products>			
			

With the corresponding RELAX NG schema:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <ref name="products"/>
  </start>
  <define name="products">
    <element name="products">
      <attribute name="lastUpdated">
        <data type="date"/>
      </attribute>
      <oneOrMore>
        <ref name="product"/>
      </oneOrMore>
    </element>
  </define>
  <define name="product">
    <element name="product">
      <attribute name="id">
   	    <data type="string">
    	     <param name="pattern">\d{3}-[A-Z]{2}</param>
   	    </data>
      </attribute>
      <element name="productName"><text/></element>
      <element name="description"><text/></element>
      <element name="numberInStock">
        <data type="int"/>
      </element>
      <element name="price">
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="currency">
    <attribute name="currency">
      <choice>
        <value>AUD</value>
        <value>USD</value>
        <value>SEK</value>
      </choice>
    </attribute>
    <data type="int"/>
  </define>
</grammar>		

Looking back at the purchase order in the preceding section each item purchased was specified as:

<item id="123-XY">
  <productName>Coffin</productName>
  <quantity>1</quantity>
  <price currency="AUD">2300</price>
  <totalAmount currency="AUD">2300</totalAmount>
</item>

Since there also exists a database for each product available for purchase, there are now at least two more constraints that can be checked for each purchase order:

  1. Make sure that each item's id exists as a product id in the database

  2. Make sure that the quantity ordered is less than or equal to the total number of products in stock for each item in the purchase order

Since these constraints require checks between XML documents, they can only be checked by Schematron processors that support XSLT's document() function (or similar functionality). If a Schematron processor based on XSLT is used, this is not a problem; but most XPath implementations of Schematron do not have this type of functionality. If you use an XSLT implementation, the Schematron rule for the first constraint can be specified like this:

<sch:pattern name="Check that the item exists in the database." 
  xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule context="purchaseOrder/items/item">
    <sch:assert test = "document('Products.xml')/products/product/@id = @id"
      >The item doesn't exist in the database.</sch:assert>
  </sch:rule>
</sch:pattern>		

Here the document() function is used to access the XML instance document that contains the available products. Once the document() function has retrieved the external document, you can use normal XPath expressions to select the nodes of interest. In this example, the id of all the product elements with a parent products is compared to the id of the item that is currently being checked. If an item element's id value does not exist in the database (Products.xml), the assertion will fail.

The easiest way to check the second constraint is to use a different rule where the context is restricted using predicates. Here is an example of how this can be specified:

<sch:pattern name="Check that there are enough items in stock for the purchase." 
  xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule 
  context="purchaseOrder/items/item[@id = document('Products.xml')/products/product/@id]">
    <sch:assert 
    test="number(document('Products.xml')/products/product[@id = current()/@id]/numberInStock)
	>= number(quantity)">
      There are not enough items of this type in stock for this quantity.
    </sch:assert>
  </sch:rule>
</sch:pattern>		

This rule is a bit more complicated than the previous ones. The first thing that is different is that the context specification for this rule is using a predicate to limit the number of elements checked. In this case, the predicate is used because instead of selecting all the item elements in the document, only the item elements with an id that exists in the database should be selected. This ensures that when the processor checks the assertion, it is certain that the item being validated exists in the database.

The assertion test itself does in this case specify a predicate in conjunction with the document() function. Here the predicate is used to select the product element that has an id that matches the id of the item element that is currently being checked. The assertion then checks that the numberInStock child element (of product) has a value that is greater than or equal to the value of the quantity child element (of item).

Now we know how the rule selects the context node, and how the assertion performs the validation, but what is the reason for the added restriction on which item elements are selected? Why can't the context simply be all the item elements in the document and then the assertion for both the above constraints can be included in the same rule?

The answer has its roots in the fact that a Schematron assertion will fire if its test condition evaluates to false. Part of the assertion expression look like this:

document('Products.xml')/products/product[@id = current()/@id]

This part of the assertion is specified to select the product element from the database that has the same id as the item currently being checked. If no such product exists, the document() function will not return any element at all, and this will cause the whole assertion expression to fail. This is not the desired result since this assertion should check that there are enough products in stock to make the purchase. However, by specifying a rule that only selects the item elements that do exist in the database, this situation will never occur.

Another important issue when defining the context of a rule is that an element can only be used once as the context for each pattern. This means that if more than one rule is specified in the same pattern with the same context element, only the first matching rule is used. If a pattern defines multiple rules with the same context element, the most restrictive rule must be specified first, followed by the other rules in descending order, based on the restrictive features of each rule. For programmers, this is analogous to how a long if-else chain is specified: you start with the most restrictive condition and finish with the most general condition. If done in reverse order, the first statement will always be true and the others will never execute. To illustrate, we will take a look at how to specify the above two rules in one pattern, since both rules use the same context (the item element).

<sch:pattern name="Combined pattern." 
  xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule context="purchaseOrder/items/item">
    ...
  </sch:rule>
  <sch:rule 
  context="purchaseOrder/items/item[@id = document('Products.xml')/products/product/@id]">
    ...
  </sch:rule>
</sch:pattern>

If the rules were specified in the above order (which is the order in which they were defined and specified in the example), validation would not be performed correctly. The reason is because both rules specify the same context element and in this case the most general rule (context="purchaseOrder/items/item") is specified first. Since this rule will match all the item elements, there will not be any item elements left to match the second rule. To make this work as expected, the rules must be specified in the reverse order (the most restrictive rule first):

<sch:pattern name="Combined pattern." 
  xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule 
  context="purchaseOrder/items/item[@id = document('Products.xml')/products/product/@id]">
    ...
  </sch:rule>
  <sch:rule context="purchaseOrder/items/item">
    ...
  </sch:rule>
</sch:pattern>

Now validation will be performed as expected. Since the most restrictive rule (selects only the item elements that do exist in the database) is specified first, the second rule will still be applied to all item elements that do not exist in the database. This means that the assertion in the second rule can be simplified to always fail (test="false()") because if the assertion is ever checked, it is certain that it is an invalid item that does not exist in the database.

Here is the complete specification of the pattern for the two constraints after the appropriate changes have been made:

<sch:pattern name="Check each item against the database." 
  xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:rule 
  context="purchaseOrder/items/item[@id = document('Products.xml')/products/product/@id]">
    <sch:assert 
    test="number(document('Products.xml')/products/product[@id = current()/@id]/numberInStock) 
	>= number(quantity)">
      There are not enough items of this type in stock for this quantity.
    </sch:assert>
  </sch:rule>
  <sch:rule context="purchaseOrder/items/item">
    <sch:assert test="false()"
      >The item doesn't exist in the database.</sch:assert>
  </sch:rule>
</sch:pattern>

The complete RELAX NG schema with embedded Schematron rules for both co-occurrence constraints and the database checks will look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" 
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
xmlns:sch="http://www.ascc.net/xml/schematron">
  <start>
    <ref name="purchaseOrder"/>
  </start>
  <define name="purchaseOrder">
    <element name="purchaseOrder">
      <attribute name="date">
        <data type="date"/>
      </attribute>
      <ref name="deliveryDetails"/>
      <element name="items">
        <oneOrMore>
          <ref name="item"/>
        </oneOrMore>
      </element>
      <ref name="payment"/>
    </element>
  </define>
  <define name="deliveryDetails">
    <element name="deliveryDetails">
      <element name="name"><text/></element>
      <element name="address"><text/></element>
      <element name="phone"><text/></element>
    </element>
  </define>
  <define name="item">
    <element name="item">
      <sch:pattern name="Validate each item.">
        <sch:rule 
        context="purchaseOrder/items/item[@id = document(
        'Products.xml')/products/product/@id]">
          <sch:assert 
          test="number(document('Products.xml')
          /products/product[@id = current()/@id]/numberInStock) >= number(quantity)">
            There are not enough items of this type in stock for this quantity.
          </sch:assert>
          <sch:assert 
          test="number(price) * number(quantity) = number(totalAmount)">
            The total amount for the item doesn't add up to (quantity * price).
          </sch:assert>
          <sch:assert test="price/@currency = totalAmount/@currency"
            >The currency in price doesn't match the currency in totalAmount.
          </sch:assert>
        </sch:rule>
        <sch:rule context="purchaseOrder/items/item">
          <sch:assert test="false()"
            >The item doesn't exist in the database.</sch:assert>
        </sch:rule>
      </sch:pattern>
      <attribute name="id">
        <data type="string">
          <param name="pattern">\d{3}-[A-Z]{2}</param>
        </data>
      </attribute>
      <element name="productName"><text/></element>
      <element name="quantity">
        <data type="int"/>
      </element>
      <element name="price">
        <ref name="currency"/>
      </element>
      <element name="totalAmount">
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="payment">
    <element name="payment">
      <attribute name="type">
        <choice>
          <value>Prepaid</value>
          <value>OnArrival</value>
        </choice>
      </attribute>
      <element name="amount">
        <sch:pattern 
        name="Check that the total amount is correct and that the currencies match">
          <sch:rule 
          context="purchaseOrder/payment/amount">
            <sch:assert 
            test="number(.) = sum(/purchaseOrder/items/item/totalAmount)">
              The total purchase amount doesn't match the cost of all items.
            </sch:assert>
            <sch:assert
            test="not(/purchaseOrder/items/item/totalAmount/@currency != @currency)">
           </sch:rule>
        </sch:pattern>
        <ref name="currency"/>
      </element>
    </element>
  </define>
  <define name="currency">
    <attribute name="currency">
      <choice>
        <value>AUD</value>
        <value>USD</value>
        <value>SEK</value>
      </choice>
    </attribute>
    <data type="int"/>
  </define>
</grammar>

Control over mixed text content

One of WXS's major advantages over previous schema languages is the ability to specify an extensive selection of datatypes for attributes but also for elements with text content. In RELAX NG it is possible to use all the datatypes from WXS by specifying these as the datatype library used. Unfortunately this ability to control the text content of an element disappears if the element is defined to have mixed content (child elements mixed with text content). With the help of embedded Schematron rules it is possible to apply basic text validation even for mixed content elements.

An example of this could be when you have source XML data that should be transformed into high quality PDF documents. A very simple paragraph in the final document can in XML be represented like this:

<p>This is <b>ok</b> but this is<b> not</b> ok</p>

In this case it is very important where the space characters around the b elements are situated. If the space character is situated inside the b element then the bold font will make the space character bigger than what it is supposed to be. For this reason it is important that the text content inside the b element does not start or end with a space character. For the same reason the text preceding the b element should always end with a space character and the text following the b element should always start with a space character. In the above example the space around the first b element are correctly located while they are wrong around the second b element.

The RELAX NG schema for the above example is very simple:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="p">
      <mixed>
        <zeroOrMore>
          <element name="b">
            <text/>
          </element>
        </zeroOrMore>
      </mixed>
    </element>
  </start>
</grammar>

The Schematron rules that are needed to check the extra constraints on the text content can be implemented like this:

<sch:pattern name="Check spaces around b tags">
	 <sch:rule 
   context="p/node()[following-sibling::b][preceding-sibling::b][1]">
		  <sch:assert test="substring(., string-length(.)) = ' '">
        A space must be present before the b tag.
      </sch:assert>
		  <sch:assert test="starts-with(., ' ')">
        A space must be present after the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/node()[following-sibling::b][1]">
		  <sch:assert test="substring(., string-length(.)) = ' '">
        A space must be present before the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/node()[preceding-sibling::b][1]">
		  <sch:assert test="starts-with(., ' ')">
        A space must be present after the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/b">
		  <sch:assert test="not(starts-with(., ' '))">
        The text in the b tag cannot start with a space.
       </sch:assert>
		  <sch:assert test="substring(., string-length(.)) != ' '">
        The text in the b tag cannot end with a space.
      </sch:assert>
	 </sch:rule>
</sch:pattern>

The Schematron rules to check this constraint is divided into four parts (each part is one rule with a separate context), which are explained in the order they are declared:

  1. For all child nodes of the p element where the nearest preceding sibling and nearest following sibling is a b element, check that a space character is present immediately after the preceding b element and that a space character is present immediately before the following b element.

  2. For all child nodes of the p element where the nearest following sibling is a b element, check that a space character is present immediately before the b element.

  3. For all child nodes of the p element where the nearest preceding sibling is a b element, check that a space character is present immediately after the b element.

  4. For all child b elements, check that the text content does not begin or end with a space character.

The complete RELAX NG schema with embedded Schematron rules look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
xmlns:sch="http://www.ascc.net/xml/schematron"> 
  <start>
    <element name="p">
      <sch:pattern name="Check spaces around b tags">
        <sch:rule
        context="p/node()[following-sibling::b][preceding-sibling::b][1]">
          <sch:assert 
          test="substring(., string-length(.)) = ' '">
            A space must be present before the b tag.
          </sch:assert>
          <sch:assert
          test="starts-with(., ' ')">
            A space must be present after the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/node()[following-sibling::b][1]">
          <sch:assert
          test="substring(., string-length(.)) = ' '">
            A space must be present before the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/node()[preceding-sibling::b][1]">
          <sch:assert test="starts-with(., ' ')">
            A space must be present after the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/b">
          <sch:assert test="not(starts-with(., ' '))">
            The text in the b tag cannot start with a space.
          </sch:assert>
          <sch:assert 
          test="substring(., string-length(.)) != ' '">
            The text in the b tag cannot end with a space.
          </sch:assert>
        </sch:rule>
      </sch:pattern>
      <mixed>
        <zeroOrMore>
          <element name="b">
            <text/>
          </element>
        </zeroOrMore>
      </mixed>
    </element>
  </start>
</grammar>

This is of course a very simple example in which you only check for space characters. In a more advanced example you also need to check for other whitespace characters (like tabs), and the fact that the last b element should not be followed by a space if the immediately following character is a punctuation character. However, the example still gives you an idea of the things you can do with Schematron and mixed content.

Embedded Schematron using namespaces

Since Schematron is namespace-aware as is RELAX NG, it is no problem to embed Schematron rules in a RELAX NG schema that define one or more namespaces for the document. In the preceding section, it was shown how Schematron schemas should be set up to use namespaces by using the ns element. For embedded Schematron rules, this works exactly the same. Instead of only embedding the Schematron rule that defines the extra constraint, you also need to embed the ns elements that define the namespaces used. The same example that was used in Namespaces and Schematron is used, but now RELAX NG is used to define the structure, while Schematron checks the co-occurrence constraint. The instance example used was:

<ex:Person Title="Mr" xmlns:ex="http://www.topologi.com/example">
   <ex:Name>Eddie</ex:Name>
   <ex:Gender>Male</ex:Gender>
</ex:Person>

A RELAX NG schema for the above would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example">
  <start>
    <element name="Person">
      <element name="Name"><text/></element>
      <element name="Gender">
        <choice>
          <value>Male</value>
          <value>Female</value>
        </choice>
      </element>
      <attribute name="Title"/>
    </element>
  </start>
</grammar>

The Schematron rule that needs to be embedded to check the co-occurrence constraint (if title is "Mr" then the value of element Gender must be "Male") will look like this (note the use of the ex prefix):

<sch:pattern name="Check co-occurrence constraint">
  <sch:rule context="ex:Person[@Title='Mr']">
    <sch:assert test="ex:Gender = 'Male'">
      If the Title is "Mr" then the gender of the person must be "Male".
    </sch:assert>
  </sch:rule>
</sch:pattern>

If this rule were embedded on its own the Schematron validation would fail because the prefix ex is not mapped to a namespace URI. In order for this to work, the ns element that defines this mapping must also be embedded:

<sch:ns prefix="ex" 
uri="http://www.topologi.com/example" 
xmlns:sch="http://www.ascc.net/xml/schematron"/>

I always insert these Schematron namespace mappings at the start of the host schema. This means that they are always declared in the same place and it is easy to see which mappings are included without having to search through the entire schema. The complete RELAX NG schema with the embedded rules would then look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example" 
xmlns:sch="http://www.ascc.net/xml/schematron">
  <!-- Include all the Schematron namespace mappings at the top -->
  <sch:ns prefix="ex" uri="http://www.topologi.com/example"/>
  <start>
    <element name="Person">
      <sch:pattern name="Check co-occurrence constraint">
        <sch:rule context="ex:Person[@Title='Mr']">
          <sch:assert test="ex:Gender = 'Male'">
            If the Title is "Mr" then the gender of the person must be "Male".
          </sch:assert>
        </sch:rule>
      </sch:pattern>
      <element name="Name"><text/></element>
      <element name="Gender">
        <choice>
          <value>Male</value>
          <value>Female</value>
        </choice>
      </element>
      <attribute name="Title"/>
    </element>
  </start>
</grammar> 

Processing

Since embedded Schematron rules are not part of the RELAX NG specification, most RELAX NG processors will not recognize and perform the validation constraints expressed by the rules. In fact, the embedded Schematron rules will be completely ignored by the processor since they are declared in a different namespace then RELAX NG's. This means that in order to use the Schematron rules for validation this functionality must be added. Currently there exists two options for how this can be achieved:

  1. The embedded rules are extracted from the RELAX NG schema and concatenated into a Schematron schema. This schema can then be used for normal Schematron validation of the XML instance document. Since both RELAX NG and Schematron use XML syntax, it is fairly easy to perform this extraction using XSLT. This technique will be described in detail in the following section.

  2. The RELAX NG processor can be modified to allow embedded Schematron-like rules and perform the validation as part of the normal RELAX NG validation. This technique is used in Sun's MSV which has an add-on that will validate XML instance documents against RELAX NG schemas annotated with rules and assertions. However, the way the rules are embedded in the RELAX NG schema is slightly different if this option is used compared to the method described in this chapter. Some of these differences include:

    • The rules can only be embedded within a RELAX NG element
    • The context for each rule or assertion is determined by the element where they are declared in the RELAX NG schema

    More information and details about this are provided in the documentation included in the download of the MSV add-on.

    It should be noted that the rules and assertion specified using this method doesn't really have anything to do with Schematron more than that they use the same name for the elements.

Validation using Extraction

To extract the embedded Schematron rules from the RELAX NG schema, the RNG2Schtrn.xsl stylesheet can be used. This stylesheet will also extract Schematron rules that have been declared in RELAX NG modules that are included in or referenced from the base schema.

The result from the script is a complete Schematron schema that can be used to validate the XML instance document using a Schematron processor as described in the section Introduction to Schematron. The XML instance document is then validated against the RELAX NG schema using a normal RELAX NG processor that will ignore all the embedded rules. This means that validation results are available from both Schematron validation and RELAX NG validation and if needed the results can be merged into one report. The whole process is described in the following figure:

As shown in the figure, there are two distinct paths in the validation process, which means that if timing requirements are important both paths can be implemented as a separate process and be executed in parallel.

A batch file that would (using the Win32 executable of Jing and Saxon) validate an XML instance document against both a RELAX NG schema and its embedded Schematron rules can look like this:

echo Running Jing validation on Sample.xml...

   jing PurchaseOrder.rng Sample.xml

echo Creating Schematron schema from PurchaseOrder.rng...

   saxon -o PurchaseOrder.sch PurchaseOrder.rng RNG2Schtron.xsl

echo Running Basic Schematron validation on file Sample.xml...

   saxon -o validate.xsl PurchaseOrder.sch schematron-basic.xsl
   saxon Sample.xml validate.xsl

So, first, the XML instance document is validated against the RELAX NG schema using Jing, and then it is validated with the embedded Schematron rules using Saxon. An output example could look like this:

Running Jing validation on Sample.xml...

Error at URL "file:/C:/Sample.xml", line number 7: unknown element "BogusElement"

Creating Schematron schema from PurchaseOrder.rng...

Running Basic Schematron validation on file Sample.xml...

From pattern "Check that each team is registered in the tournament":
   Assertion fails: "The item doesn't exist in the database." at 
    /purchaseOrder[1]/items[1]/item[2]
     <item id="112-AX">...</>

Done.		

The Topologi Schematron Validator is a free graphical validator that can validate an XML instance document against a RELAX NG schema with embedded Schematron rules.

Summary

Schematron is a very good complement to RELAX NG, and there is little that cannot be validated by the combination of the two. This article has shown how to embed Schematron rules in a RELAX NG schema as well as providing guidelines for how to perform validation. A Java implementation of Schematron that works as a wrapper around Xalan can be downloaded from Topologi. This implementation also contains classes to perform RELAX NG validation (using Jing) with embedded Schematron rules.

It is up to each project and use-case to evaluate if embedding Schematron rules in RELAX NG schemas is a suitable technique to achieve more powerful validation. Following is a list of some advantages to take into account:

  • By combining the power of WXS and Schematron the limit for what can be performed in terms of validation is raised to a new level.

  • Many of the constraints that previously had to be checked in the application can now be moved out of the application and into the schema.

  • Since Schematron lets you provide your own error messages (the content of the assertion elements) you can assure that each message is as explanatory as needed.

And some disadvantages:

  • In time critical applications the time overhead of processing the embedded Schematron rules may be too long. This is especially true if XSLT implementations of Schematron are used in conjunction with the extraction method in the preceding section. Extensive use of XSLT's document() function is also very resource demanding and time consuming.

  • Since the extraction of Schematron rules from a RELAX NG schema is performed with XSLT, embedded Schematron rules are only supported in RELAX NG schemas that use the full XML syntax.

The ability to combine embedded Schematron rules with a different schema language is not unique to RELAX NG and should be possible in all XML schema languages that use XML syntax and have an extensibility mechanism. The only thing needed is to modify the XSLT extractor stylesheet to accommodate the extension mechanism in the host XML schema language used.

Acknowledgements

I would like to thank Rick Jelliffe and Mike Fitzgerald for comments and suggestions on this article.

Resources