XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Web Content Validation with XML::Schematron
by Kip Hampton | Pages: 1, 2

Using XML::Schematron

Basic usage of XML::Schematron is very simple and best shown by example. The following script takes a path to a Schematron schema and an XML document and prints any validation errors to STDOUT:

#!/usr/bin/perl -w
use strict;
use XML::Schematron::LibXSLT;


my $schema_file = $ARGV[0];
my $xml_file    = $ARGV[1];

die "Usage: perl schematron.pl schemafile XMLfile.\n"
    unless defined $schema_file and defined $xml_file;

my $tron = XML::Schematron::LibXSLT->new();

$tron->schema($schema_file);
my $ret = $tron->verify($xml_file);

print $ret . "\n";

After collecting the filenames from the command line, this script creates a new instance of XML::Schematron::LibXSLT, then sets the schema to use for validation using that object's schema method, validates the XML file using the verify method, and prints any results to standard output. If the script runs silently, then the document in question is structurally valid by the definition provided by the schema.

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

Careful readers will have noticed that we imported the Perl Schematron library with use XML::Schematron::LibXSLT; rather than use XML::Schematron;. The reason is that the Schematron module actually ships with several backends that can be chosen based on the type of processor that you want to use. Schematron's secret is that it's most often implemented as an XSLT stylesheet, in which the Schematron stylesheet is applied to the schema and the result of that transformation is applied as a stylesheet to the document being validated. The same is true with most flavors of XML::Schematron except that the stylesheet is created dynamically and all the details hidden from view. Currently, the Sablotron and LibXSLT processors are supported, but if you do not have or want an XSLT processor installed, you may use XML::Schematron::XPath, a pure Perl implementation built upon Matt Sergeant's XML::XPath.

Example -- A Browser-based XML-friendly Content Editor

For our second and final example we will create a browser-based XML content editor that uses XML::Schematron to validate the content being authored. To keep things nice and tidy we will use Christian Glahn's astonishingly cool CGI::XMLApplication which we learned about last month. First, though, we need to decide on the XML language that we want to use to capture our content. To keep things simple, we will choose a very minimal subset of DocBook-XML which will, nevertheless, provide more semantic richness than plain HTML. Here's the simplified Schematron schema:

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <pattern name="Basic Web Site Content Validator">
    <rule context="/">
      <assert test="article">
        The root element of a content page must be named
        'article'.
      </assert>
    </rule>
    <rule context="article">
      <assert test="count(*) = count(title|section|copyright|abstract)">
        Unexpected element(s) found in element 'article'.
        an article element should contain only title,
        section, copyright, or abstract elements.
      </assert>
      <assert test="title">
        A document element must contain a title element.
      </assert>
       <assert test="section">
         A document element must contain a section element.
       </assert>
       <assert test="copyright">
         A document element must contain a copyright element.
       </assert>
    </rule>
    <rule context="title">
      <assert test="string-length() > 0 and string-length() < 51">
        The title element must contain between 1 and 50 characters.
      </assert>
    </rule>
    <rule context="copyright">
      <assert test="count(*) = count(name | date)">
        Unexpected element(s) found: the copyright element may
        only contain name and date elements.
      </assert>
      <assert test="name">
        A copyright element must contain a name element.
      </assert>
      <assert test="date">
        A copyright element must contain a date element.
      </assert>
    </rule>
  </pattern>
</schema>

First, we create the CGI::XMLApplication interface that will validate that content and warn the user about any validation errors that may have been encountered. To avoid information overload we will focus on the parts that are directly relevant to validating the submitted content and warning the user about any errors encountered. However, the complete working application is available in this month's sample code for you to peruse, install, or extend as desired.

First, we will create the application's verify_content method.

sub verify_content {
    my ( $self, $context ) = @_;
    my $content = $context->{CONTENT};
    warn "content $content \n";
    my $tron = XML::Schematron::LibXSLT->new( );
    $tron->schema( $context->{SCHEMA} );
    my @messages = ();

    eval {
        @messages = $tron->verify( $content );
    };
    if ( $@ ) {
        my $error = "Error processing XML document: $@";
        push @{$context->{ERRORS}}, $error;
    }
    else {
        push @{$context->{ERRORS}}, @messages;
    }
}

In the verify_content method we create an instance of XML::Schematron::LibXSLT and set the schema to the value contained by the $context->{SCHEMA} field. Then we verify the XML content contained in $context->{CONTENT}. Note that the call to the XML::Schematron::LibXSLT object's verify method is wrapped in an eval block. This ensures that any well-formedness errors encountered can also be captured cleanly and sent to the user without causing a server error. If no parsing errors are encountered we push any structural validity errors that may have resulted from applying our schema to the document on to the $context->{ERRORS} array reference for later use.

Now we create the requestDOM that CGI::XMLApplication uses to build the content sent to the browser:

sub requestDOM {
    my ($self, $context) = @_;
    my $dom = XML::LibXML::Document->new();
    my $root = $dom->createElement( 'document' );
    $dom->setDocumentElement( $root );

    # add errors if any
    if ( scalar( @{$context->{ERRORS}} ) > 0 ) {
        my $errors = $dom->createElement( 'errors' );

        foreach my $message ( @{$context->{ERRORS}} ) {
            $errors->appendTextChild( 'error', $message );
        }

        $root->appendChild( $errors );
    }

    return $dom;
}

Resources

Download the sample code.

The Schematron Homepage

Zvon's Schematron Tutorial

The W3C XPath Recommendation

Here we have created a new DOM tree using XML::LibXML::Document's new method and added a top-level element named 'document'. If any errors were pushed on to $context->{ERRORS} during validation we create a child of the <document> element called 'errors' and loop over the errors encountered, adding an <error> to that for each error found, and, finally, we return the new DOM tree. The XSLT stylesheet that renders the returned DOM will check for the presence of the <errors> element and print a list of validation errors to the user.

Conclusions

Validating XML content does not have to be a painful process. With XML::Schematron and a good working knowledge of the XPath syntax you can add a powerful layer of structural validation to your Perl XML processing in a fraction of the time required by other solutions. Schematron may not completely replace DTDs or W3C Schemas for stricter XML systems, but the value that it provides for the minimal time investment makes it a big winner in my book.