
Web Content Validation with XML::Schematron
Introduction
A fair part of the Web's initial popularity was based in the relative simplicity of HTML authoring. Love it or hate it, HTML offered a standard, ubiquitous markup language that one could expect would be viewable as more or less intended by anyone requesting the document. This ubiquity made web-based applications possible. By having a common, albeit limited language from which to build user interfaces, client-server applications could often abandon the use of platform- and application-specific client-side executables in favor of accessing data and logic on the server through the CGI or Web server extension.
The importance of HTML's ubiquity in web applications is especially noticeable in the class of applications I'll call "in-browser content editors". The details vary widely, but the basic interface and functionality is the same: there is a section of the page that contains a largish <textarea> for entering HTML markup and a preview area where that markup is displayed. When the form is submitted the preview section is updated. What makes this type of application so popular is that it is drop-dead easy to implement. The same markup entered in the textarea is printed as-is in the preview section; and since you are using an HTML browser to view the HTML content you're authoring, if the contents of the preview section look right then you can be reasonably sure that the document that contains that markup will look right. That is, the validation of the markup content is handled implicitly by virtue of using an application that is specifically designed to render that markup in a predictable way.
Choosing XML to markup web content knocks that implicit validation into a cocked hat. With the exception of XHTML, XML languages are completely foreign to HTML browsers. You may get a nice colorized tree representing an entire XML document in some, but that is a far cry from the "if it renders correctly here, it will render correctly most anywhere" that goes along with checking HTML markup in an HTML browser.
How then do you ensure that the XML content being authored is correct? There are DTDs, which can be used with validating parsers, but DTDs require that the entire content model be explicitly described, which can be tricky for mixed content (e.g., elements that can contain both character data and other elements). There are W3C Schemas, but there, too, the entire model must be described, and the technology itself seems a bit biased toward the stricter "data transfer" uses of XML rather the looser models that characterize human communication. DTDs and W3C Schemas have their place, but the learning curve involved in getting it right in order provide a useful level of content validation make their use for most applications impractical.
Enter the Schematron. Created by Rick Jelliffe, Schematron is a simple XML application language designed to make validating the structures of XML documents as straightforward and painless as possible. It uses the XPath syntax to define a series of rules that should or should not be true about a given document's structure. Those rules, and the context in which they are evaluated, can be as coarse or as finely-grained as the task at hand requires. Content models may be open or closed; you can declare a document structurally valid based on a single all-important rule; or you can create rules for each and every element and attribute that may appear in the document -- the choice is yours.
This month we will be looking at the Perl implementation of the
Schematron: my XML::Schematron.
Writing Schematron Schemas
Before we dig into XML::Schematron let's take a quick
look at a Schematron schema. The basic rules for writing schemas
are very simple:
- The schema will contain single top-level
<schema>element. - The
<schema>element will contain one or more<pattern>elements. - Each
<pattern>element will contain one or more<rule>elements. - Each
<rule>element will contain acontextattribute consisting of an XPath expression that provides the context for evaluation, and a mix of one or more<assert>or<report> elements. - Each
<assert>element will contain atestattribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by thetestattribute evaluates to false. - Each
<report>element will contain atestattribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by thetestattribute evaluates to true.
Let's look at a sample schema to see how these rule take shape.
<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
<pattern name="Example HTML Schematron Schema">
<rule context="/">
<assert test="html">
The root element of an HTML page must be named 'html'.
</assert>
</rule>
After declaring the top-level <schema> element, we create a
single <pattern> element. Patterns allow for the logical
grouping of tests but our needs are modest in this case so we'll
have only one. Next, we have a <rule> element with the
required context attribute. This attribute takes an
XPath expression that provides the context in which the enclosed
<assert> and <report> tests will be evaluated. In this
case, the context is "/", the abstract root of the document. Within
that rule we have a single <assert> element with the required
test attribute. Here, too, the attribute takes an XPath
expression. The expression in an assert element says, in essence,
"here is some test expression that should be true within
the context provided by the enclosing rule, but if it evaluates to
false, I'll print my warning message". In this example we
are checking the document being validated for the presence of an
<html> element in the context of the abstract root. If that is
not the case, if the top-level element were called something else,
the text contained by the assert element would delivered to the user
as an indication that the rule failed.
<rule context="html">
<report test="count(*) != count(head | body)">
The html element may only contain head and body elements.
</report>
<assert test="count(body) = 1">
The html element must contain a single body element.
</assert>
</rule>
After checking in the previous rule that the top-level element is
named 'html', we define a rule with that element as the
context so that we may examine its contents. Like the
<assert> element. the <report> element requires a
test attribute that takes an XPath expression. The
difference is that test in an assert element contains an expression
that should evaluate to true in the given context for the structure
to be valid; a report element's test expression creates a validity
rule that should evaluate to false in the given context for
it to pass. Here, we want to ensure that the <html> element
contains only <head> and <body> elements so we create a
report test that contains the XPath expression count(*) !=
count(head | body); or, in English, "the number of all child
elements, regardless of name, is not equal to the number of child
elements named 'head' and 'body'". Remember, this is a
report element, so the test expression should evaluate to
false for the structure to be valid.
|
|
| Post your comments |
Next, we create an <assert> with the test expression
count(body) = 1. This ensures that the <html>
element contains a <body> element; but only one, since having
multiple body sections in a document is likely to drive browsers
crazy.
Note that the combination of these two tests creates a open content model. That is, both <head> and <body> elements are allowed, but only the <body> element is required to pass our definition of structural validity.
</pattern> </schema>
Finally, we close the <pattern> and <schema> elements to complete the schema.
This basic schema only hints at Schematron's power. Any valid XPath expression that can be evaluated as true or false can be used to test a document's structure. For example,
<rule context="a">
<assert test="@href or @name">
An a element must contain either an 'href' or
'name' attribute.
</assert>
</rule>
creates a rule that ensures all <a> elements contain either a name or
href attribute. And
<rule context="mytag">
<assert test="@boolean='true' or @boolean='false'">
The mytag element's boolean attribute must be
set to either true or false.
</assert>
</rule>
verifies that the <mytag> element's boolean
attribute contains either true or false.
Now that you have a basic working overview of Schematron, let's get down to business.
Pages: 1, 2 |