Web Content Validation with XML::Schematron
by Kip Hampton
|
Pages: 1, 2
Using XML::Schematron
Basic usage of XML::Schematron is very simple and best
shown by example. The following script takes a path to a Schematron
schema and an XML document and prints any validation errors to
STDOUT:
#!/usr/bin/perl -w
use strict;
use XML::Schematron::LibXSLT;
my $schema_file = $ARGV[0];
my $xml_file = $ARGV[1];
die "Usage: perl schematron.pl schemafile XMLfile.\n"
unless defined $schema_file and defined $xml_file;
my $tron = XML::Schematron::LibXSLT->new();
$tron->schema($schema_file);
my $ret = $tron->verify($xml_file);
print $ret . "\n";
After collecting the filenames from the command line, this script
creates a new instance of XML::Schematron::LibXSLT,
then sets the schema to use for validation using that object's
schema method, validates the XML file using the
verify method, and prints any results to standard
output. If the script runs silently, then the document in question
is structurally valid by the definition provided by the schema.
Also in Perl and XML |
|
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
Careful readers will have noticed that we imported the Perl
Schematron library with use XML::Schematron::LibXSLT;
rather than use XML::Schematron;. The reason is that
the Schematron module actually ships with several backends that can
be chosen based on the type of processor that you want to
use. Schematron's secret is that it's most often implemented as an
XSLT stylesheet, in which the Schematron stylesheet is applied to
the schema and the result of that transformation is applied as a
stylesheet to the document being validated. The same is true with
most flavors of XML::Schematron except that the
stylesheet is created dynamically and all the details hidden from
view. Currently, the Sablotron and LibXSLT processors are supported,
but if you do not have or want an XSLT processor installed, you may
use XML::Schematron::XPath, a pure Perl implementation
built upon Matt Sergeant's XML::XPath.
Example -- A Browser-based XML-friendly Content Editor
For our second and final example we will create a browser-based XML
content editor that uses XML::Schematron to validate
the content being authored. To keep things nice and tidy we will use
Christian Glahn's astonishingly cool
CGI::XMLApplication which we learned about
last month. First, though, we need to decide on the XML language
that we want to use to capture our content. To keep things simple, we
will choose a very minimal subset of DocBook-XML which
will, nevertheless, provide more semantic richness than plain
HTML. Here's the simplified Schematron schema:
<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
<pattern name="Basic Web Site Content Validator">
<rule context="/">
<assert test="article">
The root element of a content page must be named
'article'.
</assert>
</rule>
<rule context="article">
<assert test="count(*) = count(title|section|copyright|abstract)">
Unexpected element(s) found in element 'article'.
an article element should contain only title,
section, copyright, or abstract elements.
</assert>
<assert test="title">
A document element must contain a title element.
</assert>
<assert test="section">
A document element must contain a section element.
</assert>
<assert test="copyright">
A document element must contain a copyright element.
</assert>
</rule>
<rule context="title">
<assert test="string-length() > 0 and string-length() < 51">
The title element must contain between 1 and 50 characters.
</assert>
</rule>
<rule context="copyright">
<assert test="count(*) = count(name | date)">
Unexpected element(s) found: the copyright element may
only contain name and date elements.
</assert>
<assert test="name">
A copyright element must contain a name element.
</assert>
<assert test="date">
A copyright element must contain a date element.
</assert>
</rule>
</pattern>
</schema>
First, we create the CGI::XMLApplication interface that
will validate that content and warn the user about any validation
errors that may have been encountered. To avoid information overload
we will focus on the parts that are directly relevant to validating
the submitted content and warning the user about any errors
encountered. However, the complete working application is available
in this month's sample code for you to peruse,
install, or extend as desired.
First, we will create the application's verify_content method.
sub verify_content {
my ( $self, $context ) = @_;
my $content = $context->{CONTENT};
warn "content $content \n";
my $tron = XML::Schematron::LibXSLT->new( );
$tron->schema( $context->{SCHEMA} );
my @messages = ();
eval {
@messages = $tron->verify( $content );
};
if ( $@ ) {
my $error = "Error processing XML document: $@";
push @{$context->{ERRORS}}, $error;
}
else {
push @{$context->{ERRORS}}, @messages;
}
}
In the verify_content method we create an instance of
XML::Schematron::LibXSLT and set the schema to the
value contained by the $context->{SCHEMA} field. Then
we verify the XML content contained in
$context->{CONTENT}. Note that the call to the
XML::Schematron::LibXSLT object's verify
method is wrapped in an eval block. This ensures that any
well-formedness errors encountered can also be captured cleanly and
sent to the user without causing a server error. If no parsing
errors are encountered we push any structural validity errors that
may have resulted from applying our schema to the document on to the
$context->{ERRORS} array reference for later use.
Now we create the requestDOM that
CGI::XMLApplication uses to build the content sent to
the browser:
sub requestDOM {
my ($self, $context) = @_;
my $dom = XML::LibXML::Document->new();
my $root = $dom->createElement( 'document' );
$dom->setDocumentElement( $root );
# add errors if any
if ( scalar( @{$context->{ERRORS}} ) > 0 ) {
my $errors = $dom->createElement( 'errors' );
foreach my $message ( @{$context->{ERRORS}} ) {
$errors->appendTextChild( 'error', $message );
}
$root->appendChild( $errors );
}
return $dom;
}
|
Resources |
|
Download the sample code. |
Here we have created a new DOM tree using
XML::LibXML::Document's new method and
added a top-level element named 'document'. If any errors were
pushed on to $context->{ERRORS} during validation we
create a child of the <document> element called 'errors' and
loop over the errors encountered, adding an <error> to that
for each error found, and, finally, we return the new DOM tree. The
XSLT stylesheet that renders the returned DOM will check for the
presence of the <errors> element and print a list of
validation errors to the user.
Conclusions
Validating XML content does not have to be a painful process. With
XML::Schematron and a good working knowledge of the
XPath syntax you can add a powerful layer of structural validation
to your Perl XML processing in a fraction of the time required by
other solutions. Schematron may not completely replace DTDs or W3C
Schemas for stricter XML systems, but the value that it provides for
the minimal time investment makes it a big winner in my book.