Simple XML Validation with Perl
The Problem: Although XML Schemas and RELAX promise fine-grained validation for XML documents, neither are presently available in the Perl world. You need a way to validate the structure of your documents now. Today. Preferably before lunch.
The Solution: Combine the simplicity of Test.pm from the standard Perl distribution with the flexibility of XPath.
Overcoming Test Anxiety
Before we show how Perl can make XML validation simple, we need to
take a small detour through the standard Test module. For
those not familiar with it, the Test module was designed to
give the harried hacker an easy way to ensure that his or her code
passes a series of basic functional test before they unleash it on the
world, and, in the case of writing modules, that those same tests are
passed on the system on which the code is being installed. It is not
surprising, then, that using Test.pm is a very
straightforward proposition. Each test is defined as call to the
function ok(), which takes up to three arguments: a test,
an expected return value and an optional message to display upon
failure. If the interpolated values of the first two arguments match,
the test succeeds. An an example, consider the following two
tests:
ok('good','good', 'its all good');
# this test passes because the first two
# arguments return the same values.
ok(sub { 2+2 }, 5, '2 and 2 is 4');
# this test fails for the obvious mathematical
# reason and prints a descriptive error.
Following the XPath
Now what does Test have to do with validating an XML document? The answer lies in its combination with the XML::XPath module. The XPath language provides a simple, powerful syntax for navigating the logical structure of an XML document. XML::XPath allows us to take advantage of that power from within Perl.
XPath's syntax is quite accessible. For example, the XPath
expression /foo/bar will find all of the "bar" elements
contained within all "foo" elements that are children of the root node
(the root element denoted by the leading "/"). Alternately, the expression
/foo/bar/* will return the same nodes as the previous example,
and bring all of the "bar" elements' descendants along for the
ride.
XPath also provides a number of functions and shortcuts that further
simplify examining a document's structure. For instance,
count(/foo/bar[@name]) will return the number of "bar"
elements that have the attribute "name". As we will soon see, combining
Test.pm's compact syntax with the simple power of XPath expressions
will allow us to tackle the task of validating an XML document simply and
efficiently.
Rolling Our Own XML Validator
Let's try out what we've covered so far by creating our own simple
XML validation tool. To do this, we will need a sample XML file, a
test script, and simple Perl "wrapper" script to allow our tool to
validate more than a single type of document. We begin with the Perl
script, which we will call xml_test.pl. (You can also download the script.)
use Test::Harness qw(&runtests $verbose);
use strict;
while(@ARGV > 2) {
my $arg = shift @ARGV;
if ($arg eq '-d') {
$verbose = 1;
}
}
if (@ARGV < 2) {
usage();
exit(0);
}
sub usage {
warn "Usage: xml_test.pl [-d] testscript xmlfile\n";
}
$ENV{XMLFILE} = $ARGV[1];
runtests $ARGV[0];
This script allows us to be more flexible in our testing, by providing a way to specify both the XML file and the test file from the command line. Let's move on to creating a sample XML instance that we intend to validate. (Download the sample file here.)
<?xml version="1.0" standalone="yes"?>
<order>
<customer>
<name>Coyote, Ltd.</name>
<shipping_info>
<address>1313 Desert Road</address>
<city>Nowheresville</city>
<state>AZ</state>
<zip>90210</zip>
</shipping_info>
</customer>
<item>
<product id="1111">Acme Rocket Jet Pack</product>
<quantity type="each">1</quantity>
</item>
<item>
<product id="2222">Roadrunner Chow</product>
<quantity type="bag">10</quantity>
</item>
</order>
Now let's consider what tests would be appropriate to validate this type of document. At the very least, we need to verify that the document contains an order, and that the order contains a customer, a shipping address, and a list of items. Beyond that, we should also verify that each item contains a product and a quantity. So, we'll need five tests to verify the basic structure.
Let's create a small test script named order.t (download order.t
here). and begin with the basics.
use Test;
BEGIN { plan tests => 5 }
use XML::XPath;
my $xp = XML::XPath->new(filename => 'customer_order.xml');
my (@nodes, $test); # pre-declare a few vars
First we'll define a test that checks whether or not the document root is indeed an "order" element. We will do this by attempting to select the nodeset for an "order" element at the document root into an array, testing that the resulting array contains only one element, and then verifying that our test is true.
@nodes =
$xp->findnodes('/order'); OK(@nodes == 1, 1, "the root element must
be an 'order'");
Next we need to confirm that our order document contains a "customer" element, and that the "customer" element contains a "shipping_info" element. Rather than running separate tests for each, we can combine these tests into a single expression and, if either element is missing or misplaced, our test will fail.
@nodes = $xp->findnodes('/order/customer/shipping_info');
ok(@nodes == 1, 1, "an order must contain a 'customer'
element with a 'shipping_info' child");
As the Perl mantra goes, "There's More Than One Way To Do It", and
the same is true with XML::XPath. Rather than selecting the
nodes into an array and evaluating that array in a scalar context to
get the number of matches, we can use the XPath count()
function to achieve the same effect. Note that we will be using
XML::XPath's find() function instead of
findnodes() since the type of test we are performing
returns a literal value instead of a set of document nodes.
$test = $xp->find('count(/order/item)');
ok($test > 0, 1, "an order must contain at least one 'item' element");
Finally, we need to be sure that every "item" element contains both
a "product" element and a "quantity" element. Here we'll get a little
fancier and use XPath's boolean() function which returns
true (1) only if the entire expression we pass to it evaluates to
true. We need only check that the number of "item" elements is equal
to the number of "item" elements that have the child elements we are
testing for.
$test = $xp->find(
'boolean(count(/order/item/product)=count(/order/item/))');
ok($test1 == 1, 1,
"a 'item' element must contain a an 'product' element.");
$test = $xp->find(
'boolean(count(/order/item/quantity)=count(/order/item))');
ok($test1 == 1, 1,
"a 'item' element must contain a 'quantity' element.");
We now have tests to cover the five basic areas that we defined earlier as
the most critical in terms of structural validation. Having saved our test
file as order.t, let's fire up our xml_test.pl
script.
% perl xml_test.pl -d order.t customer_order.xml order.................1..5OK 1 OK 2 OK 3 OK 4 OK 5 OK All tests successful. Files=1, Tests=5, 1 wallclock secs ( 0.69 cusr + 0.07 csys = 0.76 CPU)
Great, our sample document passed muster. But what if it didn't? To find
out, open customer_order.xml, remove one of the "quantity"
elements, save the file, and run the script again.
order.................1..5
OK 1
OK 2
OK 3
OK 4
not OK 5
# Test 5 got: '' (order.t at line 26)
# Expected: '1' (a 'item' element must contain a 'quantity' element.)
FAILED test 5
Failed 1/5 tests, 80.00% okay
Failed Test Status Wstat Total Fail Failed List of failed
-------------------------------------------------------------
order.t 5 1 20.00% 5
Failed 1/1 test scripts, 0.00% okay. 1/5 subtests failed, 80.00% okay.
Admittedly the output is not very pretty, but it is functional. We now know that the current document is invalid, and we also know why.
The handful of tests that we currently have clearly would not be
sufficient validation for a production environment, but with these
few examples, you hopefully have a clear view of the basics and could
extend the test script to handle nearly any case. You could, for
example, iterate over the "quantity" elements and test each
text() node against a regular expression to ensure that
each contained only a numeric value. You are limited only by your
imagination.
The Same Old Scheme?
|
Other Resources |
|
Using XSL as
a Validation Language by Rick Jelliffe |
As much as I would love to take credit for basic ideas presented here, I admit that the notion of using XPath expressions to validate an XML document's structure is not at all new. In fact, this concept is the foundation of Rick Jelliffe's popular Schematron. Thanks should also go to Matt Sergeant, the author of AxKit, for pointing out that Perl's Test and Test::Harness modules would make a nifty environment for a Perl Schematronesque clone. The goal here has been to spark your imagination, to get you to experiment, and, hopefully, to point to the ability of Perl and its modules to make even the more complex XML tasks, like validation, easy to solve.