Filling in the DTD Gaps with Schematron

May 15, 2002

Many XML developers, just when they've gotten used to DTDs, are hearing about alternatives and wondering what to do with them. W3C schemas, RELAX NG, Schematron -- which should they go with? What will each buy them? What software support does each have? How much of their current systems will they still be able to use? The feeling of unease behind these questions can be summed up with one question: if I leave DTDs behind to use one of the others, will I regret it?

One nice thing about Schematron is its ability to work as an adjunct to the others, including DTDs, so you don't have to leave DTDs behind to take advantage of Schematron. To use Schematron in combination with RELAX NG, Sun's msv validator has an add-on library that lets you check a document against a RELAX NG schema with embedded Schematron rules, but you don't need a combination validator like msv to take advantage of Schematron. There's nothing wrong with checking a document against one type of schema and then checking it against a set of Schematron rules as well. In fact, more and more XML developers are realizing that a pipeline of specialized processes that each check for one class of problems can serve their needs better than a monolithic processor that does most of what they need and several more things that they don't need.

This turns out to be the answer to the prayers of many developers wondering about the best way to move on from DTDs. If you have a working system built around DTDs and want to take advantage of the Schematron features that are unavailable in DTDs, you can go ahead and write the Schematron rules that fill in those gaps and continue using your DTD-based system.

While writing a DTD for PRISM ("Publishing Requirements for Industry Standard Metadata"), I was frustrated to realize that several constraints described in the PRISM specification could not be expressed using XML 1.0 DTDs. RELAX NG could express all of the PRISM spec's constraints, but that wouldn't help me create valid XML documents using Emacs with PSGML. I then realized that a DTD expressing most of what I need would let me create the documents, and a straightforward Schematron schema less than half the length of the DTD could ensure that the document met the additional constraints, and I would have everything I needed.

Exclusive ORs

"Exclusive or" is programmer talk for ensuring that one and only one of a number of conditions is true. For example, the PRISM spec says that the dc:identifier element must have a value as content between the dc:identifier start- and end-tags or as the value of an rdf:resource attribute, but cannot have both. This is easy to express in RELAX NG, where the choice element can specify that one or the other must be there:

<element name="dc:identifier">
  <choice>

    <text/>

    <attribute name="rdf:resource">
      <text/>
    </attribute>

  </choice>
</element>

In a DTD, if either the element's content or the rdf:identifier attribute may or may not be there, the content model can be PCDATA, because an empty element will still validate. The attribute must be #IMPLIED to show that it's optional:

<!ELEMENT dc:identifier (#PCDATA)>
<!ATTLIST dc:identifier rdf:resource CDATA #IMPLIED>

So how do we ensure that there is either a value for the rdf:resource attribute or a value between the dc:identifier tags, but not both? With a Schematron pattern.

A Schematron pattern can contain assertions, which declare a condition that must be true if there is to be no error message, and reports, which describe problems that, if found, should trigger error messages. The following pattern has two report checks for potential problems in dc:identifier elements. The first report checks if the element has both content of more than zero characters and an rdf:resource attribute specified. The second checks whether the element has neither content nor an rdf:resource attribute. Both report elements include the appropriate error message to output.

These two report elements demonstrate the simple, straightforward way that Schematron lets you pair a description of a condition whose verification can be easily automated with a natural language description of the condition that can be used for intuitive error output. These natural language messages include the Schematron name element, which inserts the name of the element type.

<pattern name="Check exclusive OR of content and rdf:resource attribute">

  <rule context="dc:identifier">

    <report test="(string-length(.) &gt; 0) and (@rdf:resource)"
            diagnostics="resourceAttrVal">
      <name/> element may not have both content and rdf:resource value.
    </report>

    <report test="(string-length(.) = 0) and (not(@rdf:resource))">
      <name/> element must have either content or rdf:resource value.
    </report>

  </rule>

</pattern>

The first report element's diagnostics attribute names a routine to use to provide further information about the problem. The "resourceAttrVal" diagnostic outputs a message that includes the value of the rdf:resource attribute:

<diagnostic id="resourceAttrVal">
  (rdf:resource value "<value-of select='@rdf:resource'/>")
</diagnostic>

This makes it easier to find which dc:identifier element has both content and an rdf:resource value. (There's no point in using a diagnostic with the other report, because an empty element with no rdf:resource value has no useful information to pass along.)

The PRISM spec actually designates many more elements that may have either content or an rdf:resource attribute value. Changing the rule above to account for them all merely means adding them to the context attribute in the rule element's start-tag:

<rule context="dc:identifier | dc:creator | dc:contributor |
               dc:coverage | dc:identifier | dc:publisher |
               dc:rights | dc:subject | dc:type | prism:category |
               prism:distributor | prism:event | prism:industry |
               prism:location | prism:object | prism:organization |
               prism:person">

Cardinality and Unordered Content

For my PRISM DTD, I only wanted to declare the constraints and structures described in the PRISM spec. This spec doesn't describe any structural context in which to use these elements, and I didn't want to put all these different metadata elements inside of a single large, flat element holding different collections of metadata serving different purposes. I declared containers for the different functional categories of PRISM elements in a separate DTD of "PRISM containers." For example, I declared a timeConstraints element as a container for the prism:creationTime, prism:expirationTime, prism:modificationTime, and related elements that store timestamp information. This was for the purposes of my own sample data; other PRISM users may use the PRISM elements differently.

The PRISM spec requires that there be no more than one of each of these timestamps. I didn't want to impose a specific order on these elements -- who cares whether prism:creationTime appears before or after prism:modificationTime within the timeConstraints element? This has always been a classic XML modeling problem: DTD syntax offers no way to say that the order of an element's subelements doesn't matter but that each subelement may appear no more than once. For example, if you want element videoRental to have the children customerID, videoID, and rentalDate, you could do it like this:

<!ELEMENT videoRental (customerID, videoID, rentalDate)>

But what if you wanted to allow the three subelements to show up in any order? You would need a mixed content model, like this:

<!ELEMENT videoRental (customerID | videoID | rentalDate)*>

Unfortunately, a videoRental element with thirteen customerID subelements and no videoID or rentalDate subelements would be perfectly valid according to this model. SGML DTDs, RELAX NG, and, with some limitations, W3C schemas, all offer syntax to let you say "one of each, in any order," but XML 1.0 doesn't. (This was considered one of the more difficult parts of SGML to implement, so it was dropped from the "SGML Lite" effort originally known as "WebSGML" and later known as "XML.")

How can Schematron fill in this gap in XML 1.0 DTD syntax? By letting you check on the cardinality, or number of occurrences of a given subelement, Schematron can catch the potential problems when a DTD uses a mixed content model to provide flexibility in the ordering of an element's subelements.

The following Schematron rule, when used with the mixed content model shown above, allows the videoRental element's subelements to appear in any order while still ensuring that each appears only once. Instead of report elements, which report on something that shouldn't happen but did, this Schematron rule has assert elements, which assert conditions that must be true. A Schematron processor outputs an assert element's natural language message if the condition in its test attribute is not true.

<rule context="videoRental">

  <assert test="count(videoID) = 1">
    Each videoRental element must have one videoID subelement.
  </assert>

  <assert test="count(rentalDate) = 1">
    Each videoRental element must have one rentalDate subelement.
  </assert>

  <assert test="count(customerID) = 1">
    Each videoRental element must have one customerID subelement.
  </assert>

</rule>

For my PRISM Schematron schema's timeConstraints, I put an element type declaration into prismCont.dtd that has a similar structure to the second element declaration for videoRental above:

<!ELEMENT timeConstraints (prism:creationTime | prism:expirationTime |
                           prism:modificationTime | prism:publicationTime |
                           prism:releaseTime | prism:receptionTime)*>

I could have written a rule with "timeConstraints" as its context that counted the occurrences of timeConstraints children, but I wanted my Schematron rules to work with all PRISM documents and not just those that used the container elements defined in my prismCont supplement to the PRISM DTD. So, I wrote a Schematron pattern for each timeConstraint subelement that follows this model:

<pattern name="Check prism:creationTime occurrences">

  <rule context="*[prism:creationTime]">
    <report test="count(prism:creationTime) &gt; 1">
      Only one creationTime subelement allowed.
    </report>
  </rule>

</pattern>

This rule looks a bit different from the videoRental example of cardinality checking because the timestamp elements are optional -- instead of ensuring that there is exactly one of each, we only need to check that there is no more than one of each. This prism:creationTime rule has a context expression telling the Schematron processor to check elements of any name (*) that have a prism:creationTime subelement. If it has more than one, the message "Only one creationTime subelement allowed" is reported as an error message. When a Schematron processor uses these patterns to check a document that is valid against prismCont.dtd, an error-free run tells us that each timeConstraint element has no more than one of each of its children. (In addition to the count() function used here, all the functions listed in the Additional Functions section of the XSLT Recommendation are all at Schematron's disposal.)

Controlling Vocabularies

According to the PRISM spec, a prl:usage element's rdf:resource attribute should have a value of "#none", "#use", "#notApplicable", or "#permissionsUnknown." (These are relative URI references to more detailed descriptions of these values in documents referenced by the base URIs in table 12 of the PRISM spec.) Can an XML 1.0 DTD ensure that a prl:usage element always has one of these four values?

Almost, but not quite. One option for specifying an attribute type in a DTD is an enumerated type, in which the attribute declaration lists name tokens and a validating parser makes sure that values for that attribute are from the list. For example, The color attribute for a shirt element might be declared like this:

<!ATTLIST shirt color ( red | green | blue ) #REQUIRED>

The problem is the "name tokens" part. The XML 1.0 Recommendation defines a name token as any mixture of name characters, which includes letters, digits, and a few punctuation characters. It excludes spaces, which means that "navy blue" could not be included as a potential shirt color above. It also excludes the pound sign (#), which prevents the use of "#none", "#use", and the other potential values of the PRISM prl:usage element's rdf:resource attribute.

This is not a problem for Schematron. A rule element can easily check whether each prl:usage element's rdf:resource attribute has one of the values allowed by the spec:

<pattern name="Check for valid values in prl:usage element's
               rdf:resource attribute">
 <rule context="prl:usage[@rdf:resource]">
    <assert test="(normalize-space(@rdf:resource) = '#none') or
                  (normalize-space(@rdf:resource) = '#use') or
                  (normalize-space(@rdf:resource) = '#notApplicable') or
                  (normalize-space(@rdf:resource) = '#permissionsUnknown')"
                  diagnostics="resourceAttrVal">
    Section 3.6 of PRISM 1.1 spec: rdf:resource attribute of prl:usage element
    must have #none, #use, #notApplicable, or #permissionsUnknown as value.
    </assert>
  </rule>
</pattern>

The same "resourceAttrVal" diagnostic used to check dc:identifier elements earlier will add the illegal attribute value to the error message, making the offending element easier to track down. (Schematron can take advantage of XSLT engines that return line numbers to make it even easier to track down problems, but to my knowledge only XT does this for now.) The normalize-space() function removes leading and trailing spaces from the @rdf:resource value before comparing it to each legal value.

A "controlled vocabulary" refers to a preset list of values that may have been compiled for a specific application or may already be a well-known standard such as the ISO 3166 list of country codes. Requiring that an attribute or element value come from a controlled vocabulary adds consistency to data, making it easier for applications to use. A Schematron rule can have a hardcoded list of values to use, like the rule above, or it can use the XSLT document() and key() functions to check an external document for legal values to use.

Note that I said attribute or element value -- with DTDs, the limited enumeration possible for attribute values can't be done at all with element declarations.

Great for Now

It would be nice to specify some data typing constraints and see them enforced. For example, using my DTD/Schematron combination I can't specify that the prism:contentLength value must always be an integer or that the date values describe earlier must be in ISO 8601 format. (Well, I actually could use some XSLT functions to put together constraints that do some basic type checking better than XML 1.0 DTDs can, but I want a simpler way to do it.) Once XPath 2.0 is implemented in some XSLT processors, I should be able to do type checking properly without moving beyond my Schematron+DTD combination. And I do have a lot right now: all the new possibilities of Schematron for specifying data constraints without giving up any of the features of DTDs and the extensive support available for them. Schematron support only requires an XSLT processor, and there are plenty of those around. The fact that all of the examples in this article with the exception of the videoRental and shirt elements were real problems that I had to solve for a project unrelated to this article made it clear to me: Schematron can add a lot to XML-based systems currently in production or in development without forcing us to leave DTDs behind until we're good and ready to.