XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML::LibXML - An XML::Parser Alternative

XML::LibXML - An XML::Parser Alternative

November 14, 2001

Introduction

The vast majority of Perl's XML modules are built on top of XML::Parser, Larry Wall and Clark Cooper's Perl interface to James Clark's expat parser. The expat-XML::Parser combination is not the only full-featured XML parser available in the Perl World. This month we'll look at XML::LibXML, Matt Sergeant and Christian Glahn's Perl interface to Daniel Velliard's libxml2.

Why Would You Want Yet Another XML Parser?

Expat and XML::Parser have proven themselves to be quite capable, but they are not without limitations. Expat was among the first XML parsers available and, as a result, its interfaces reflect the expectations of users at the time it was written. Expat and XML::Parser do not implement the Document Object Model, SAX, or XPath language interfaces (things that most modern XML users take for granted) because either the given interface did not exist or was still being heavily evaluated and not considered "standard" at the time it was written.

The somewhat unfortunate result of this is that most of the available Perl XML modules are built upon one of XML::Parser's non- or not-quite-standard interfaces with the presumption that the input will be some sort of textual representation of an XML document (file, filehandle, string, socket stream) that must be parsed before proceeding. While this works for many simple cases, most advanced XML applications need to do more than one thing with a given document and that means that for each stage in the process, the document must be serialized to a string and then re-parsed by the next module.

By contrast libxml2 was written after the DOM, XPath, and SAX interfaces became common, and so it implements all three. In-memory trees can be built by parsing documents stored in files, strings, and so on, or generated from a series of SAX events. Those trees can then be operated on using the W3C DOM and XPath interfaces or used to generate SAX events that are handed off to external event handlers. This added flexibility, which reflects current XML processing expectations, makes XML::LibXML a strong contender for XML::Parser's throne.

Using XML::LibXML

This month's column may be seen as a addendum to the Perl/XML Quickstart Guide published earlier this year, when XML::LibXML was in its infancy, and we'll use the same tests from the Quickstart to put XML::LibXML though its paces. For a detailed overview of the test cases see the first installment in the Quickstart; but, to summarize, the two tests illustrate how to extract and print data from an XML document, and how to build and print, programmatically, an XML document from data stored in a Perl HASH using the facilities offered by a given XML module.

Reading

For accessing the data stored in XML documents, XML::LibXML provides a standard W3C DOM interface. Documents are treated as a tree of nodes and the data those nodes contain are accessed by calling methods on the node objects themselves.

use strict;
use XML::LibXML;

my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
my @species = $root->getElementsByTagName('species');

foreach my $camelid (@species) {
    my $latin_name = $camelid->getAttribute('name');
    my @name_node  = $camelid->getElementsByTagName('common-name');
    my $common_name = $name_node[0]->getFirstChild->getData;
    my @c_node  = $camelid->getElementsByTagName('conservation');
    my $status =  $c_node[0]->getAttribute('status');
    print "$common_name ($latin_name) $status \n";
}

One of the more exciting features of XML::LibXML is that, in addition to the DOM interface, it allows you to select nodes using the XPath language. The following illustrates how to achieve the same effect as the previous example using XPath to select the desired nodes:

use strict;
use XML::LibXML;

my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;

foreach my $camelid ($root->findnodes('species')) {
    my $latin_name = $camelid->findvalue('@name');
    my $common_name = $camelid->findvalue('common-name');
    my $status =  $camelid->findvalue('conservation/@status');
    print "$common_name ($latin_name) $status \n";
}

What makes this exciting is that you can you can mix and match methods from the DOM and XPath interfaces to best suit the needs of your application, while operating on the same tree of nodes.

Pages: 1, 2

Next Pagearrow