Web Content Validation with XML::Schematron

January 23, 2002

Introduction

A fair part of the Web's initial popularity was based in the relative simplicity of HTML authoring. Love it or hate it, HTML offered a standard, ubiquitous markup language that one could expect would be viewable as more or less intended by anyone requesting the document. This ubiquity made web-based applications possible. By having a common, albeit limited language from which to build user interfaces, client-server applications could often abandon the use of platform- and application-specific client-side executables in favor of accessing data and logic on the server through the CGI or Web server extension.

The importance of HTML's ubiquity in web applications is especially noticeable in the class of applications I'll call "in-browser content editors". The details vary widely, but the basic interface and functionality is the same: there is a section of the page that contains a largish <textarea> for entering HTML markup and a preview area where that markup is displayed. When the form is submitted the preview section is updated. What makes this type of application so popular is that it is drop-dead easy to implement. The same markup entered in the textarea is printed as-is in the preview section; and since you are using an HTML browser to view the HTML content you're authoring, if the contents of the preview section look right then you can be reasonably sure that the document that contains that markup will look right. That is, the validation of the markup content is handled implicitly by virtue of using an application that is specifically designed to render that markup in a predictable way.

Choosing XML to markup web content knocks that implicit validation into a cocked hat. With the exception of XHTML, XML languages are completely foreign to HTML browsers. You may get a nice colorized tree representing an entire XML document in some, but that is a far cry from the "if it renders correctly here, it will render correctly most anywhere" that goes along with checking HTML markup in an HTML browser.

How then do you ensure that the XML content being authored is correct? There are DTDs, which can be used with validating parsers, but DTDs require that the entire content model be explicitly described, which can be tricky for mixed content (e.g., elements that can contain both character data and other elements). There are W3C Schemas, but there, too, the entire model must be described, and the technology itself seems a bit biased toward the stricter "data transfer" uses of XML rather the looser models that characterize human communication. DTDs and W3C Schemas have their place, but the learning curve involved in getting it right in order provide a useful level of content validation make their use for most applications impractical.

Enter the Schematron. Created by Rick Jelliffe, Schematron is a simple XML application language designed to make validating the structures of XML documents as straightforward and painless as possible. It uses the XPath syntax to define a series of rules that should or should not be true about a given document's structure. Those rules, and the context in which they are evaluated, can be as coarse or as finely-grained as the task at hand requires. Content models may be open or closed; you can declare a document structurally valid based on a single all-important rule; or you can create rules for each and every element and attribute that may appear in the document -- the choice is yours.

This month we will be looking at the Perl implementation of the Schematron: my XML::Schematron.

Writing Schematron Schemas

Before we dig into XML::Schematron let's take a quick look at a Schematron schema. The basic rules for writing schemas are very simple:

The schema will contain single top-level <schema> element.
The <schema> element will contain one or more <pattern> elements.
Each <pattern> element will contain one or more <rule> elements.
Each <rule> element will contain a context attribute consisting of an XPath expression that provides the context for evaluation, and a mix of one or more <assert> or <report> elements.
Each <assert> element will contain a test attribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by the test attribute evaluates to false.
Each <report> element will contain a test attribute consisting of an XPath expression, and text content containing a descriptive message that will be delivered to the user if the expression contained by the test attribute evaluates to true.

Let's look at a sample schema to see how these rule take shape.


<?xml version="1.0"?>

<schema xmlns="http://www.ascc.net/xml/schematron">

  <pattern name="Example HTML Schematron Schema">

    <rule context="/">

      <assert test="html">

        The root element of an HTML page must be named 'html'.

      </assert>

    </rule>

After declaring the top-level <schema> element, we create a single <pattern> element. Patterns allow for the logical grouping of tests but our needs are modest in this case so we'll have only one. Next, we have a <rule> element with the required context attribute. This attribute takes an XPath expression that provides the context in which the enclosed <assert> and <report> tests will be evaluated. In this case, the context is "/", the abstract root of the document. Within that rule we have a single <assert> element with the required test attribute. Here, too, the attribute takes an XPath expression. The expression in an assert element says, in essence, "here is some test expression that should be true within the context provided by the enclosing rule, but if it evaluates to false, I'll print my warning message". In this example we are checking the document being validated for the presence of an <html> element in the context of the abstract root. If that is not the case, if the top-level element were called something else, the text contained by the assert element would delivered to the user as an indication that the rule failed.


    <rule context="html">

      <report test="count(*) != count(head | body)">

        The html element may only contain head and body elements.

      </report>

      <assert test="count(body) = 1">

        The html element must contain a single body element.

      </assert>

    </rule>

After checking in the previous rule that the top-level element is named 'html', we define a rule with that element as the context so that we may examine its contents. Like the <assert> element. the <report> element requires a test attribute that takes an XPath expression. The difference is that test in an assert element contains an expression that should evaluate to true in the given context for the structure to be valid; a report element's test expression creates a validity rule that should evaluate to false in the given context for it to pass. Here, we want to ensure that the <html> element contains only <head> and <body> elements so we create a report test that contains the XPath expression count(*) != count(head | body); or, in English, "the number of all child elements, regardless of name, is not equal to the number of child elements named 'head' and 'body'". Remember, this is a report element, so the test expression should evaluate to false for the structure to be valid.

Next, we create an <assert> with the test expression count(body) = 1. This ensures that the <html> element contains a <body> element; but only one, since having multiple body sections in a document is likely to drive browsers crazy.

Note that the combination of these two tests creates a open content model. That is, both <head> and <body> elements are allowed, but only the <body> element is required to pass our definition of structural validity.


  </pattern>

</schema>

Finally, we close the <pattern> and <schema> elements to complete the schema.

This basic schema only hints at Schematron's power. Any valid XPath expression that can be evaluated as true or false can be used to test a document's structure. For example,


<rule context="a">

  <assert test="@href or @name">

    An a element must contain either an 'href' or

    'name' attribute.

  </assert>

</rule>

creates a rule that ensures all <a> elements contain either a name or href attribute. And


<rule context="mytag">

  <assert test="@boolean='true' or @boolean='false'">

    The  mytag element's boolean attribute must be

    set to either true or false.

  </assert>

</rule>

verifies that the <mytag> element's boolean attribute contains either true or false.

Now that you have a basic working overview of Schematron, let's get down to business.

Using XML::Schematron

Basic usage of XML::Schematron is very simple and best shown by example. The following script takes a path to a Schematron schema and an XML document and prints any validation errors to STDOUT:


#!/usr/bin/perl -w

use strict;

use XML::Schematron::LibXSLT;





my $schema_file = $ARGV[0];

my $xml_file    = $ARGV[1];



die "Usage: perl schematron.pl schemafile XMLfile.\n"

    unless defined $schema_file and defined $xml_file;



my $tron = XML::Schematron::LibXSLT->new();



$tron->schema($schema_file);

my $ret = $tron->verify($xml_file);



print $ret . "\n";

After collecting the filenames from the command line, this script creates a new instance of XML::Schematron::LibXSLT, then sets the schema to use for validation using that object's schema method, validates the XML file using the verify method, and prints any results to standard output. If the script runs silently, then the document in question is structurally valid by the definition provided by the schema.

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

Careful readers will have noticed that we imported the Perl Schematron library with use XML::Schematron::LibXSLT; rather than use XML::Schematron;. The reason is that the Schematron module actually ships with several backends that can be chosen based on the type of processor that you want to use. Schematron's secret is that it's most often implemented as an XSLT stylesheet, in which the Schematron stylesheet is applied to the schema and the result of that transformation is applied as a stylesheet to the document being validated. The same is true with most flavors of XML::Schematron except that the stylesheet is created dynamically and all the details hidden from view. Currently, the Sablotron and LibXSLT processors are supported, but if you do not have or want an XSLT processor installed, you may use XML::Schematron::XPath, a pure Perl implementation built upon Matt Sergeant's XML::XPath.

Example -- A Browser-based XML-friendly Content Editor

For our second and final example we will create a browser-based XML content editor that uses XML::Schematron to validate the content being authored. To keep things nice and tidy we will use Christian Glahn's astonishingly cool CGI::XMLApplication which we learned about last month. First, though, we need to decide on the XML language that we want to use to capture our content. To keep things simple, we will choose a very minimal subset of DocBook-XML which will, nevertheless, provide more semantic richness than plain HTML. Here's the simplified Schematron schema:


<?xml version="1.0"?>

<schema xmlns="http://www.ascc.net/xml/schematron">

  <pattern name="Basic Web Site Content Validator">

    <rule context="/">

      <assert test="article">

        The root element of a content page must be named

        'article'.

      </assert>

    </rule>

    <rule context="article">

      <assert test="count(*) = count(title|section|copyright|abstract)">

        Unexpected element(s) found in element 'article'.

        an article element should contain only title,

        section, copyright, or abstract elements.

      </assert>

      <assert test="title">

        A document element must contain a title element.

      </assert>

       <assert test="section">

         A document element must contain a section element.

       </assert>

       <assert test="copyright">

         A document element must contain a copyright element.

       </assert>

    </rule>

    <rule context="title">

      <assert test="string-length() > 0 and string-length() < 51">

        The title element must contain between 1 and 50 characters.

      </assert>

    </rule>

    <rule context="copyright">

      <assert test="count(*) = count(name | date)">

        Unexpected element(s) found: the copyright element may

        only contain name and date elements.

      </assert>

      <assert test="name">

        A copyright element must contain a name element.

      </assert>

      <assert test="date">

        A copyright element must contain a date element.

      </assert>

    </rule>

  </pattern>

</schema>

First, we create the CGI::XMLApplication interface that will validate that content and warn the user about any validation errors that may have been encountered. To avoid information overload we will focus on the parts that are directly relevant to validating the submitted content and warning the user about any errors encountered. However, the complete working application is available in this month's sample code for you to peruse, install, or extend as desired.

First, we will create the application's verify_content method.


sub verify_content {

    my ( $self, $context ) = @_;

    my $content = $context->{CONTENT};

    warn "content $content \n";

    my $tron = XML::Schematron::LibXSLT->new( );

    $tron->schema( $context->{SCHEMA} );

    my @messages = ();



    eval {

        @messages = $tron->verify( $content );

    };

    if ( $@ ) {

        my $error = "Error processing XML document: $@";

        push @{$context->{ERRORS}}, $error;

    }

    else {

        push @{$context->{ERRORS}}, @messages;

    }

}

In the verify_content method we create an instance of XML::Schematron::LibXSLT and set the schema to the value contained by the $context->{SCHEMA} field. Then we verify the XML content contained in $context->{CONTENT}. Note that the call to the XML::Schematron::LibXSLT object's verify method is wrapped in an eval block. This ensures that any well-formedness errors encountered can also be captured cleanly and sent to the user without causing a server error. If no parsing errors are encountered we push any structural validity errors that may have resulted from applying our schema to the document on to the $context->{ERRORS} array reference for later use.

Now we create the requestDOM that CGI::XMLApplication uses to build the content sent to the browser:


sub requestDOM {

    my ($self, $context) = @_;

    my $dom = XML::LibXML::Document->new();

    my $root = $dom->createElement( 'document' );

    $dom->setDocumentElement( $root );



    # add errors if any

    if ( scalar( @{$context->{ERRORS}} ) > 0 ) {

        my $errors = $dom->createElement( 'errors' );



        foreach my $message ( @{$context->{ERRORS}} ) {

            $errors->appendTextChild( 'error', $message );

        }



        $root->appendChild( $errors );

    }



    return $dom;

}

Resources

• Download the sample code.

• The Schematron Homepage

• Zvon's Schematron Tutorial

• The W3C XPath Recommendation

Here we have created a new DOM tree using XML::LibXML::Document's new method and added a top-level element named 'document'. If any errors were pushed on to $context->{ERRORS} during validation we create a child of the <document> element called 'errors' and loop over the errors encountered, adding an <error> to that for each error found, and, finally, we return the new DOM tree. The XSLT stylesheet that renders the returned DOM will check for the presence of the <errors> element and print a list of validation errors to the user.

Conclusions

Validating XML content does not have to be a painful process. With XML::Schematron and a good working knowledge of the XPath syntax you can add a powerful layer of structural validation to your Perl XML processing in a fraction of the time required by other solutions. Schematron may not completely replace DTDs or W3C Schemas for stricter XML systems, but the value that it provides for the minimal time investment makes it a big winner in my book.