Using Customized Schema Constraints

November 10, 2004

Last month we looked at XSLT's role in the reference implementation for Schematron, a schema language that lets you express many constraints that can't be expressed in RELAX NG (RNG) or the W3C Schema language (XSD). Instead of defining an entire schema, Schematron rules usually supplement the structural and typing rules described using one of the other two languages—or perhaps even written as a DTD—so that a Schematron rule set typically accompanies a schema written in one of the other languages.

You can also insert the Schematron rules inside of an RNG or XSD schema. Because RNG and XSD schemas let you add elements from outside of their specialized namespaces, you can add customizations to your schemas without breaking anything. For example, you can add highly structured documentation, Schematron rules, or new classes of constraints designed around your system's needs.

Once you've added these foreign elements to a schema, how do you pull them out and use them when you need them? With XSLT!

Pulling Schematron Rules Out of a RELAX NG Schema

Sun provides an add-on to its msv (multi-schema-validator) program that supports Schematron assert statements inside of RNG element patterns in RNG schemas. Any other use of Schematron rules embedded in an RNG or XSD schema requires you to pull the Schematron rules out of the schema, save them in their own separate file, and then check your data against the rules in the extracted file. A short batch file or shell script used with the following style sheet can automate these steps, making their execution very simple.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:sc="http://www.ascc.net/xml/schematron"
                version="1.0">

  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes"/>

  <xsl:template match="sc:* | sc:*/@*">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- output schematron text nodes -->
  <xsl:template match="sc:*/text()"> 
    <xsl:value-of select="."/>
  </xsl:template>

  <!-- suppress any others -->
  <xsl:template match="text()"/>  

</xsl:stylesheet>

The style sheet simply copies anything from the Schematron namespace to the result tree. When run against this RNG schema, it pulls out the following Schematron rule set:

<st:schema xmlns:st="http://www.ascc.net/xml/schematron"
           xmlns="http://relaxng.org/ns/structure/1.0">
   <st:pattern name="doc schematron constraints">
      <st:rule context="doc">
         <st:assert test="@endDate &gt;= @startDate">
           The endDate attribute value must be equal
           to or greater than the startDate value.
         </st:assert>
      </st:rule>
      <st:rule context="fn">
         <st:report test=".//fn">
           Footnotes are not allowed inside of other footnotes.
         </st:report>
      </st:rule>
   </st:pattern>
</st:schema>

The first rule checks whether a doc element's endDate attribute has a value greater than or equal to the value of its startDate attribute, and the second rule checks that a given fn element has no fn elements anywhere inside of it.

The second rule bears a closer look. Why would a DTD or schema allow a footnote to be inside of a footnote? If your content models let you put a footnote inside of a paragraph and a paragraph inside of a footnote, you have a circular condition that makes it valid to have a footnote within a footnote.

SGML had a feature called exclusion exceptions that let an element declaration specify one or more elements that were not allowed anywhere, whether as a child or as a more distant descendant, in that element. For example, an SGML DTD might have this element declaration:

<!ELEMENT fn - - (para+) -(fn) >

(The first two hyphens are a bit of SGML syntax not necessary in XML DTDs; they specify that start- and end-tags are required for fn elements.) The hyphen and fn in parentheses at the end of the declaration show that no fn elements are allowed at any level within the element being declared. With this declaration, an SGML validator would flag any fn element found inside of another as an error, even if fn was explicitly declared in the content model of the para element that is part of the fn element's content model.

This is just the kind of feature that got thrown out of the SGML profile known as XML in order to make the coding of a parser easier and its memory footprint smaller. Many people with an SGML background miss it; as we can see here, a schema/Schematron combination can give it back to them.

Customized Schema Constraints

A key reason to use a schema language is that its base syntax is XML, unlike the specialized syntax used by XML 1.0 DTDs. This lets you use your collection of XML manipulation tools on the schemas themselves, which gives you a lot more flexibility in how you use the schemas. We saw above how a short XSLT stylesheet can pull Schematron rules out of a schema into a separate file so that a Schematron implementation can check a document instance file against those rules. We can take this even further: you can make up your own elements and attributes to describe customized schema constraints and then use an XSLT stylesheet to pull them out and save them in a format which can be used by a separate program that augments your data quality checking workflow.

Let's take SGML exclusion exceptions as an example. Instead of adding a Schematron rule to my next RNG schema, I've created a new element called exclusion. It could be an attribute of the RNG element instead of a child element, but making it a child element makes it easier to specify more than one element to exclude from an element's subtree.

The following RNG schema excerpt shows the use of my exclusion element in a declaration for the fn element. Because additional elements added to an RNG schema must be from a separate namespace, I've declared the "sn" prefix to represent the http://www.snee.com/ns/misc/ namespace and have specified that the exclude elements come from that namespace.

<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
         xmlns:sn="http://www.snee.com/ns/misc/">

<!-- most of schema skipped -->

  <define name="fn-pattern">
    <element name="fn">
      <sn:exclusion name="fn"/>
      <sn:exclusion name="foo"/>
      <oneOrMore>
        <ref name="para"/>
      </oneOrMore>
    </element>
  </define>
</grammar>

An RNG validator will ignore the sn:exclude elements. So how do we make sure that the fn elements in our data have no fn or foo descendants? By using XSLT to convert the custom elements into something that another program can read, understand, and use to validate the constraints.

We've already seen how exclusion exceptions can be checked using Schematron, so the style sheet below converts the sn:exclusion elements to Schematron rules. The first template rule adds the wrapper for a set of Schematron rules to the result tree, and the second adds a Schematron rule to the result tree for each sn:exclusion element that it finds. (Bolding shows lines that need slight changes for this style sheet to extra sn:extract elements from XSD schemas, as described further on.)

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:st="http://www.ascc.net/xml/schematron"
                xmlns:sn="http://www.snee.com/ns/misc/"
                version="1.0">

  <xsl:template match="/">
    <st:schema>
      <st:pattern name="schema customizations">
        <xsl:apply-templates/>
      </st:pattern>
    </st:schema>
  </xsl:template>

  <xsl:template match="sn:exclusion">
    <st:rule context="{../@name}">
      <st:report test=".//{@name}">
        This <xsl:value-of select="../@name"/> element has
        a <xsl:value-of select="@name"/> element inside of it, 
        which is prohibited.
      </st:report>
    </st:rule>
  </xsl:template>

</xsl:stylesheet>

This style sheet actually creates something very similar to the Schematron rule set that I showed earlier. Why use customized elements instead of inserting the Schematron rules directly into your schema? There's nothing wrong with adding Schematron rules to your schema, but designing customized elements and converting them to a separate Schematron implementation gives you the flexibility to convert them to an implementation based on something besides Schematron if you wish. I happened to choose Schematron here because I knew that it could achieve the purpose of the sn:exclusion elements.

Perhaps you have an existing data quality checking program designed around data issues specific to your industry. Perhaps you have to write a new, specialized program in your favorite programming language to handle these custom constraints. The main point is that instead of storing the validation constraints that your schema language can handle in a schema and storing other constraints in other metadata configuration files specific to your processes, you can store them in a single schema and then generate the other files as needed. This centralized control makes maintenance of the rules easier to coordinate and track, and XSLT is the tool that lets you put together such a system with a minimum amount of trouble.

Customizing XSD Schemas

RNG lets you add elements from foreign namespaces to just about anywhere in a schema. In an XSD Schema, these must go inside of an xs:appinfo element inside of an xs:annotation element, but because most XSD schema components can include an xs:annotation element, you still have plenty of flexibility in where you put customization elements. The following shows an XSD declaration for the fn element that includes the two sn:exclusion elements which we saw in the RNG excerpt above:

<xs:element name="fn">
  <xs:annotation>
    <xs:appinfo>
      <sn:exclusion name="fn"/>
      <sn:exclusion name="foo"/>
    </xs:appinfo>
  </xs:annotation>
  <xs:complexType>
    <xs:sequence>
      <xs:element maxOccurs="unbounded" ref="para"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Also in Transforming XML

Automating Stylesheet Creation

To implement the sn:exclusion elements (that is, to pull them out of this schema and express them as a set of usable Schematron rules), we can use the same style sheet that pulled them out of the RNG schema, with small modifications to the lines shown in bold in that style sheet. For the style sheet to check what element the exclusion applies to, the original version of the style sheet looks at the name attribute of the sn:exclusion elements' parent element, so you see the XPath expression "../@name" in each bolded line. Because the XSD version requires the sn:exclusion element to be inside of an xs:appinfo element inside of an xs:annotation element inside of the xs:element element, use of this style sheet with XSD requires the XPath expressions to look a few levels higher to find what they need. Changing them to "../../../@name" sets them to look in the name attribute of the great-grandparent element for the name of the element that the exclusion applies to.

This technique takes advantage of two important properties that RNG and XSD have in common: they're both expressed in XML, and they both allow the addition of elements (and attributes) from foreign namespaces. The storing of RDBMS schemas in the same kinds of tables where we store RDBMS data itself means that you can use many of the same tricks on the schemas that you use on the data; similarly, the fact that RNG and W3C XSD schemas are stored in XML means that you can use XML manipulation tools such as XSLT on the schemas themselves. When combined with the ability to add customized elements to schemas, this gives us great new possibilities in how we define and control our XML data.