Simple XML Validation with Perl

November 8, 2000

The Problem: Although XML Schemas and RELAX promise fine-grained validation for XML documents, neither are presently available in the Perl world. You need a way to validate the structure of your documents now. Today. Preferably before lunch.

The Solution: Combine the simplicity of Test.pm from the standard Perl distribution with the flexibility of XPath.

Overcoming Test Anxiety

Before we show how Perl can make XML validation simple, we need to take a small detour through the standard Test module. For those not familiar with it, the Test module was designed to give the harried hacker an easy way to ensure that his or her code passes a series of basic functional test before they unleash it on the world, and, in the case of writing modules, that those same tests are passed on the system on which the code is being installed. It is not surprising, then, that using Test.pm is a very straightforward proposition. Each test is defined as call to the function ok(), which takes up to three arguments: a test, an expected return value and an optional message to display upon failure. If the interpolated values of the first two arguments match, the test succeeds. An an example, consider the following two tests:


ok('good','good', 'its all good');

# this test passes because the first two

# arguments return the same values.




ok(sub { 2+2 }, 5, '2 and 2 is 4');

# this test fails for the obvious mathematical

# reason and prints a descriptive error.

Following the XPath

Now what does Test have to do with validating an XML document? The answer lies in its combination with the XML::XPath module. The XPath language provides a simple, powerful syntax for navigating the logical structure of an XML document. XML::XPath allows us to take advantage of that power from within Perl.

XPath's syntax is quite accessible. For example, the XPath expression /foo/bar will find all of the "bar" elements contained within all "foo" elements that are children of the root node (the root element denoted by the leading "/"). Alternately, the expression /foo/bar/* will return the same nodes as the previous example, and bring all of the "bar" elements' descendants along for the ride.

XPath also provides a number of functions and shortcuts that further simplify examining a document's structure. For instance, count(/foo/bar[@name]) will return the number of "bar" elements that have the attribute "name". As we will soon see, combining Test.pm's compact syntax with the simple power of XPath expressions will allow us to tackle the task of validating an XML document simply and efficiently.

Rolling Our Own XML Validator

Let's try out what we've covered so far by creating our own simple XML validation tool. To do this, we will need a sample XML file, a test script, and simple Perl "wrapper" script to allow our tool to validate more than a single type of document. We begin with the Perl script, which we will call xml_test.pl. (You can also download the script.)


use Test::Harness qw(&runtests $verbose);

use strict;



while(@ARGV > 2) {

    my $arg = shift @ARGV;

    if ($arg eq '-d') {

        $verbose = 1;

    }

}



if (@ARGV < 2) {

    usage();

    exit(0);

}



sub usage {

    warn "Usage: xml_test.pl [-d] testscript xmlfile\n";

}



$ENV{XMLFILE} = $ARGV[1];



runtests $ARGV[0];

This script allows us to be more flexible in our testing, by providing a way to specify both the XML file and the test file from the command line. Let's move on to creating a sample XML instance that we intend to validate. (Download the sample file here.)


<?xml version="1.0" standalone="yes"?>

<order>

 <customer>

  <name>Coyote, Ltd.</name>

  <shipping_info>

    <address>1313 Desert Road</address>

    <city>Nowheresville</city>

    <state>AZ</state>

    <zip>90210</zip>

  </shipping_info>

 </customer>

 <item>

  <product id="1111">Acme Rocket Jet Pack</product>

  <quantity type="each">1</quantity>

 </item>

 <item>

  <product id="2222">Roadrunner Chow</product>

  <quantity type="bag">10</quantity>

 </item>

</order>

Now let's consider what tests would be appropriate to validate this type of document. At the very least, we need to verify that the document contains an order, and that the order contains a customer, a shipping address, and a list of items. Beyond that, we should also verify that each item contains a product and a quantity. So, we'll need five tests to verify the basic structure.

Let's create a small test script named order.t (download order.t here). and begin with the basics.

use Test;

BEGIN { plan tests => 5 }

use XML::XPath;

my $xp = XML::XPath->new(filename => 'customer_order.xml');

my (@nodes, $test); # pre-declare a few vars

First we'll define a test that checks whether or not the document root is indeed an "order" element. We will do this by attempting to select the nodeset for an "order" element at the document root into an array, testing that the resulting array contains only one element, and then verifying that our test is true.

@nodes =

$xp->findnodes('/order'); OK(@nodes == 1, 1, "the root element must

be an 'order'");

Next we need to confirm that our order document contains a "customer" element, and that the "customer" element contains a "shipping_info" element. Rather than running separate tests for each, we can combine these tests into a single expression and, if either element is missing or misplaced, our test will fail.


@nodes = $xp->findnodes('/order/customer/shipping_info');

ok(@nodes == 1, 1, "an order must contain a 'customer' 

                    element with a 'shipping_info' child");

As the Perl mantra goes, "There's More Than One Way To Do It", and the same is true with XML::XPath. Rather than selecting the nodes into an array and evaluating that array in a scalar context to get the number of matches, we can use the XPath count() function to achieve the same effect. Note that we will be using XML::XPath's find() function instead of findnodes() since the type of test we are performing returns a literal value instead of a set of document nodes.


$test = $xp->find('count(/order/item)');

ok($test > 0, 1, "an order must contain at least one 'item' element");

Finally, we need to be sure that every "item" element contains both a "product" element and a "quantity" element. Here we'll get a little fancier and use XPath's boolean() function which returns true (1) only if the entire expression we pass to it evaluates to true. We need only check that the number of "item" elements is equal to the number of "item" elements that have the child elements we are testing for.


$test = $xp->find(

   'boolean(count(/order/item/product)=count(/order/item/))');

ok($test1 == 1, 1, 

   "a 'item' element must contain a an 'product' element.");



$test = $xp->find(

   'boolean(count(/order/item/quantity)=count(/order/item))');

ok($test1 == 1, 1, 

    "a 'item' element must contain a 'quantity' element.");

We now have tests to cover the five basic areas that we defined earlier as the most critical in terms of structural validation. Having saved our test file as order.t, let's fire up our xml_test.pl script.


% perl xml_test.pl -d order.t customer_order.xml



order.................1..5

OK 1

OK 2

OK 3

OK 4

OK 5

OK

All tests successful.

Files=1, Tests=5, 1 wallclock secs ( 0.69 cusr + 0.07 csys = 0.76 CPU)

Great, our sample document passed muster. But what if it didn't? To find out, open customer_order.xml, remove one of the "quantity" elements, save the file, and run the script again.


order.................1..5

OK 1

OK 2

OK 3

OK 4

not OK 5

# Test 5 got: '' (order.t at line 26)

# Expected: '1' (a 'item' element must contain a 'quantity' element.)

FAILED test 5

        Failed 1/5 tests, 80.00% okay

Failed Test Status Wstat    Total Fail Failed List of failed

-------------------------------------------------------------

order.t 5 1 20.00% 5

Failed 1/1 test scripts, 0.00% okay. 1/5 subtests failed, 80.00% okay.

Admittedly the output is not very pretty, but it is functional. We now know that the current document is invalid, and we also know why.

The handful of tests that we currently have clearly would not be sufficient validation for a production environment, but with these few examples, you hopefully have a clear view of the basics and could extend the test script to handle nearly any case. You could, for example, iterate over the "quantity" elements and test each text() node against a regular expression to ensure that each contained only a numeric value. You are limited only by your imagination.

The Same Old Scheme?

Other Resources

• Using XSL as a Validation Language by Rick Jelliffe

• Introducing the Schematron by Uche Ogbuji

As much as I would love to take credit for basic ideas presented here, I admit that the notion of using XPath expressions to validate an XML document's structure is not at all new. In fact, this concept is the foundation of Rick Jelliffe's popular Schematron. Thanks should also go to Matt Sergeant, the author of AxKit, for pointing out that Perl's Test and Test::Harness modules would make a nifty environment for a Perl Schematronesque clone. The goal here has been to spark your imagination, to get you to experiment, and, hopefully, to point to the ability of Perl and its modules to make even the more complex XML tasks, like validation, easy to solve.