Filling in the DTD Gaps with Schematron
Many XML developers, just when they've gotten used to DTDs, are hearing about alternatives and wondering what to do with them. W3C schemas, RELAX NG, Schematron -- which should they go with? What will each buy them? What software support does each have? How much of their current systems will they still be able to use? The feeling of unease behind these questions can be summed up with one question: if I leave DTDs behind to use one of the others, will I regret it?
One nice thing about Schematron is its ability to work as an adjunct to the others, including DTDs, so you don't have to leave DTDs behind to take advantage of Schematron. To use Schematron in combination with RELAX NG, Sun's msv validator has an add-on library that lets you check a document against a RELAX NG schema with embedded Schematron rules, but you don't need a combination validator like msv to take advantage of Schematron. There's nothing wrong with checking a document against one type of schema and then checking it against a set of Schematron rules as well. In fact, more and more XML developers are realizing that a pipeline of specialized processes that each check for one class of problems can serve their needs better than a monolithic processor that does most of what they need and several more things that they don't need.
This turns out to be the answer to the prayers of many developers wondering about the best way to move on from DTDs. If you have a working system built around DTDs and want to take advantage of the Schematron features that are unavailable in DTDs, you can go ahead and write the Schematron rules that fill in those gaps and continue using your DTD-based system.
While writing a DTD for PRISM ("Publishing Requirements for Industry Standard Metadata"), I was frustrated to realize that several constraints described in the PRISM specification could not be expressed using XML 1.0 DTDs. RELAX NG could express all of the PRISM spec's constraints, but that wouldn't help me create valid XML documents using Emacs with PSGML. I then realized that a DTD expressing most of what I need would let me create the documents, and a straightforward Schematron schema less than half the length of the DTD could ensure that the document met the additional constraints, and I would have everything I needed.
"Exclusive or" is programmer talk for ensuring that one and only one of a number of conditions is true. For example, the PRISM spec says that the dc:identifier element must have a value as content between the dc:identifier start- and end-tags or as the value of an rdf:resource attribute, but cannot have both. This is easy to express in RELAX NG, where the choice element can specify that one or the other must be there:
<element name="dc:identifier"> <choice> <text/> <attribute name="rdf:resource"> <text/> </attribute> </choice> </element>
In a DTD, if either the element's content or the rdf:identifier attribute may or may not be there, the content model can be PCDATA, because an empty element will still validate. The attribute must be #IMPLIED to show that it's optional:
<!ELEMENT dc:identifier (#PCDATA)> <!ATTLIST dc:identifier rdf:resource CDATA #IMPLIED>
So how do we ensure that there is either a value for the rdf:resource attribute or a value between the dc:identifier tags, but not both? With a Schematron pattern.
|Share your comments or questions about this article in our forum.|
A Schematron pattern can contain assertions, which declare a condition that must be true if there is to be no error message, and reports, which describe problems that, if found, should trigger error messages. The following pattern has two report checks for potential problems in dc:identifier elements. The first report checks if the element has both content of more than zero characters and an rdf:resource attribute specified. The second checks whether the element has neither content nor an rdf:resource attribute. Both report elements include the appropriate error message to output.
These two report elements demonstrate the simple, straightforward way that Schematron lets you pair a description of a condition whose verification can be easily automated with a natural language description of the condition that can be used for intuitive error output. These natural language messages include the Schematron name element, which inserts the name of the element type.
<pattern name="Check exclusive OR of content and rdf:resource attribute"> <rule context="dc:identifier"> <report test="(string-length(.) > 0) and (@rdf:resource)" diagnostics="resourceAttrVal"> <name/> element may not have both content and rdf:resource value. </report> <report test="(string-length(.) = 0) and (not(@rdf:resource))"> <name/> element must have either content or rdf:resource value. </report> </rule> </pattern>
The first report element's diagnostics attribute names a routine to use to provide further information about the problem. The "resourceAttrVal" diagnostic outputs a message that includes the value of the rdf:resource attribute:
<diagnostic id="resourceAttrVal"> (rdf:resource value "<value-of select='@rdf:resource'/>") </diagnostic>
This makes it easier to find which dc:identifier element has both content and an rdf:resource value. (There's no point in using a diagnostic with the other report, because an empty element with no rdf:resource value has no useful information to pass along.)
The PRISM spec actually designates many more elements that may have either content or an rdf:resource attribute value. Changing the rule above to account for them all merely means adding them to the context attribute in the rule element's start-tag:
<rule context="dc:identifier | dc:creator | dc:contributor | dc:coverage | dc:identifier | dc:publisher | dc:rights | dc:subject | dc:type | prism:category | prism:distributor | prism:event | prism:industry | prism:location | prism:object | prism:organization | prism:person">
Pages: 1, 2