XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Combining RELAX NG and Schematron
by Eddie Robertsson | Pages: 1, 2, 3, 4, 5

Control over mixed text content

One of WXS's major advantages over previous schema languages is the ability to specify an extensive selection of datatypes for attributes but also for elements with text content. In RELAX NG it is possible to use all the datatypes from WXS by specifying these as the datatype library used. Unfortunately this ability to control the text content of an element disappears if the element is defined to have mixed content (child elements mixed with text content). With the help of embedded Schematron rules it is possible to apply basic text validation even for mixed content elements.

An example of this could be when you have source XML data that should be transformed into high quality PDF documents. A very simple paragraph in the final document can in XML be represented like this:

<p>This is <b>ok</b> but this is<b> not</b> ok</p>

In this case it is very important where the space characters around the b elements are situated. If the space character is situated inside the b element then the bold font will make the space character bigger than what it is supposed to be. For this reason it is important that the text content inside the b element does not start or end with a space character. For the same reason the text preceding the b element should always end with a space character and the text following the b element should always start with a space character. In the above example the space around the first b element are correctly located while they are wrong around the second b element.

The RELAX NG schema for the above example is very simple:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="p">
      <mixed>
        <zeroOrMore>
          <element name="b">
            <text/>
          </element>
        </zeroOrMore>
      </mixed>
    </element>
  </start>
</grammar>

The Schematron rules that are needed to check the extra constraints on the text content can be implemented like this:

<sch:pattern name="Check spaces around b tags">
	 <sch:rule 
   context="p/node()[following-sibling::b][preceding-sibling::b][1]">
		  <sch:assert test="substring(., string-length(.)) = ' '">
        A space must be present before the b tag.
      </sch:assert>
		  <sch:assert test="starts-with(., ' ')">
        A space must be present after the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/node()[following-sibling::b][1]">
		  <sch:assert test="substring(., string-length(.)) = ' '">
        A space must be present before the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/node()[preceding-sibling::b][1]">
		  <sch:assert test="starts-with(., ' ')">
        A space must be present after the b tag.
      </sch:assert>
	 </sch:rule>
	 <sch:rule context="p/b">
		  <sch:assert test="not(starts-with(., ' '))">
        The text in the b tag cannot start with a space.
       </sch:assert>
		  <sch:assert test="substring(., string-length(.)) != ' '">
        The text in the b tag cannot end with a space.
      </sch:assert>
	 </sch:rule>
</sch:pattern>

The Schematron rules to check this constraint is divided into four parts (each part is one rule with a separate context), which are explained in the order they are declared:

  1. For all child nodes of the p element where the nearest preceding sibling and nearest following sibling is a b element, check that a space character is present immediately after the preceding b element and that a space character is present immediately before the following b element.

  2. For all child nodes of the p element where the nearest following sibling is a b element, check that a space character is present immediately before the b element.

  3. For all child nodes of the p element where the nearest preceding sibling is a b element, check that a space character is present immediately after the b element.

  4. For all child b elements, check that the text content does not begin or end with a space character.

The complete RELAX NG schema with embedded Schematron rules look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
xmlns:sch="http://www.ascc.net/xml/schematron"> 
  <start>
    <element name="p">
      <sch:pattern name="Check spaces around b tags">
        <sch:rule
        context="p/node()[following-sibling::b][preceding-sibling::b][1]">
          <sch:assert 
          test="substring(., string-length(.)) = ' '">
            A space must be present before the b tag.
          </sch:assert>
          <sch:assert
          test="starts-with(., ' ')">
            A space must be present after the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/node()[following-sibling::b][1]">
          <sch:assert
          test="substring(., string-length(.)) = ' '">
            A space must be present before the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/node()[preceding-sibling::b][1]">
          <sch:assert test="starts-with(., ' ')">
            A space must be present after the b tag.
          </sch:assert>
        </sch:rule>
        <sch:rule context="p/b">
          <sch:assert test="not(starts-with(., ' '))">
            The text in the b tag cannot start with a space.
          </sch:assert>
          <sch:assert 
          test="substring(., string-length(.)) != ' '">
            The text in the b tag cannot end with a space.
          </sch:assert>
        </sch:rule>
      </sch:pattern>
      <mixed>
        <zeroOrMore>
          <element name="b">
            <text/>
          </element>
        </zeroOrMore>
      </mixed>
    </element>
  </start>
</grammar>

This is of course a very simple example in which you only check for space characters. In a more advanced example you also need to check for other whitespace characters (like tabs), and the fact that the last b element should not be followed by a space if the immediately following character is a punctuation character. However, the example still gives you an idea of the things you can do with Schematron and mixed content.

Embedded Schematron using namespaces

Since Schematron is namespace-aware as is RELAX NG, it is no problem to embed Schematron rules in a RELAX NG schema that define one or more namespaces for the document. In the preceding section, it was shown how Schematron schemas should be set up to use namespaces by using the ns element. For embedded Schematron rules, this works exactly the same. Instead of only embedding the Schematron rule that defines the extra constraint, you also need to embed the ns elements that define the namespaces used. The same example that was used in Namespaces and Schematron is used, but now RELAX NG is used to define the structure, while Schematron checks the co-occurrence constraint. The instance example used was:

<ex:Person Title="Mr" xmlns:ex="http://www.topologi.com/example">
   <ex:Name>Eddie</ex:Name>
   <ex:Gender>Male</ex:Gender>
</ex:Person>

A RELAX NG schema for the above would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example">
  <start>
    <element name="Person">
      <element name="Name"><text/></element>
      <element name="Gender">
        <choice>
          <value>Male</value>
          <value>Female</value>
        </choice>
      </element>
      <attribute name="Title"/>
    </element>
  </start>
</grammar>

The Schematron rule that needs to be embedded to check the co-occurrence constraint (if title is "Mr" then the value of element Gender must be "Male") will look like this (note the use of the ex prefix):

<sch:pattern name="Check co-occurrence constraint">
  <sch:rule context="ex:Person[@Title='Mr']">
    <sch:assert test="ex:Gender = 'Male'">
      If the Title is "Mr" then the gender of the person must be "Male".
    </sch:assert>
  </sch:rule>
</sch:pattern>

If this rule were embedded on its own the Schematron validation would fail because the prefix ex is not mapped to a namespace URI. In order for this to work, the ns element that defines this mapping must also be embedded:

<sch:ns prefix="ex" 
uri="http://www.topologi.com/example" 
xmlns:sch="http://www.ascc.net/xml/schematron"/>

I always insert these Schematron namespace mappings at the start of the host schema. This means that they are always declared in the same place and it is easy to see which mappings are included without having to search through the entire schema. The complete RELAX NG schema with the embedded rules would then look like this:

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.topologi.com/example" 
xmlns:sch="http://www.ascc.net/xml/schematron">
  <!-- Include all the Schematron namespace mappings at the top -->
  <sch:ns prefix="ex" uri="http://www.topologi.com/example"/>
  <start>
    <element name="Person">
      <sch:pattern name="Check co-occurrence constraint">
        <sch:rule context="ex:Person[@Title='Mr']">
          <sch:assert test="ex:Gender = 'Male'">
            If the Title is "Mr" then the gender of the person must be "Male".
          </sch:assert>
        </sch:rule>
      </sch:pattern>
      <element name="Name"><text/></element>
      <element name="Gender">
        <choice>
          <value>Male</value>
          <value>Female</value>
        </choice>
      </element>
      <attribute name="Title"/>
    </element>
  </start>
</grammar> 

Processing

Since embedded Schematron rules are not part of the RELAX NG specification, most RELAX NG processors will not recognize and perform the validation constraints expressed by the rules. In fact, the embedded Schematron rules will be completely ignored by the processor since they are declared in a different namespace then RELAX NG's. This means that in order to use the Schematron rules for validation this functionality must be added. Currently there exists two options for how this can be achieved:

  1. The embedded rules are extracted from the RELAX NG schema and concatenated into a Schematron schema. This schema can then be used for normal Schematron validation of the XML instance document. Since both RELAX NG and Schematron use XML syntax, it is fairly easy to perform this extraction using XSLT. This technique will be described in detail in the following section.

  2. The RELAX NG processor can be modified to allow embedded Schematron-like rules and perform the validation as part of the normal RELAX NG validation. This technique is used in Sun's MSV which has an add-on that will validate XML instance documents against RELAX NG schemas annotated with rules and assertions. However, the way the rules are embedded in the RELAX NG schema is slightly different if this option is used compared to the method described in this chapter. Some of these differences include:

    • The rules can only be embedded within a RELAX NG element
    • The context for each rule or assertion is determined by the element where they are declared in the RELAX NG schema

    More information and details about this are provided in the documentation included in the download of the MSV add-on.

    It should be noted that the rules and assertion specified using this method doesn't really have anything to do with Schematron more than that they use the same name for the elements.

Pages: 1, 2, 3, 4, 5

Next Pagearrow