XML::LibXML - An XML::Parser Alternative
The vast majority of Perl's XML modules are built on top of
XML::Parser, Larry Wall and Clark Cooper's Perl interface
to James Clark's expat parser. The
expat-XML::Parser combination is
not the only full-featured XML parser available in the Perl
World. This month we'll look at XML::LibXML, Matt
Sergeant and Christian Glahn's Perl interface to Daniel Velliard's
libxml2.
Expat and XML::Parser have proven themselves to be
quite capable, but they are not without limitations. Expat was among
the first XML parsers available and, as a result, its interfaces
reflect the expectations of users at the time it was written. Expat
and XML::Parser do not implement the Document Object
Model, SAX, or XPath language interfaces (things that most modern XML
users take for granted) because either the given interface did not
exist or was still being heavily evaluated and not considered
"standard" at the time it was written.
The somewhat unfortunate result of this is that most of the
available Perl XML modules are built upon one of
XML::Parser's non- or not-quite-standard interfaces with
the presumption that the input will be some sort of textual
representation of an XML document (file, filehandle, string, socket
stream) that must be parsed before proceeding. While this works for
many simple cases, most advanced XML applications need to do more than
one thing with a given document and that means that for each stage in
the process, the document must be serialized to a string and then
re-parsed by the next module.
By contrast libxml2 was written after the DOM, XPath,
and SAX interfaces became common, and so it implements all three.
In-memory trees can be built by parsing documents stored in files,
strings, and so on, or generated from a series of SAX events. Those
trees can then be operated on using the W3C DOM and XPath interfaces
or used to generate SAX events that are handed off to external event
handlers. This added flexibility, which reflects current XML
processing expectations, makes XML::LibXML a strong
contender for XML::Parser's throne.
XML::LibXML This month's column may be seen as a addendum to the Perl/XML
Quickstart Guide published earlier this year, when
XML::LibXML was in its infancy, and we'll use the same
tests from the Quickstart to put XML::LibXML though its
paces. For a detailed overview of the test cases see the first
installment in the Quickstart; but, to summarize, the two tests
illustrate how to extract and print data from an XML document, and how
to build and print, programmatically, an XML document from data stored
in a Perl HASH using the facilities offered by a given XML module.
For accessing the data stored in XML documents,
XML::LibXML provides a standard W3C DOM interface.
Documents are treated as a tree of nodes and the data those nodes
contain are accessed by calling methods on the node objects
themselves.
use strict;
use XML::LibXML;
my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
my @species = $root->getElementsByTagName('species');
foreach my $camelid (@species) {
my $latin_name = $camelid->getAttribute('name');
my @name_node = $camelid->getElementsByTagName('common-name');
my $common_name = $name_node[0]->getFirstChild->getData;
my @c_node = $camelid->getElementsByTagName('conservation');
my $status = $c_node[0]->getAttribute('status');
print "$common_name ($latin_name) $status \n";
}
One of the more exciting features of XML::LibXML is
that, in addition to the DOM interface, it allows you to select nodes
using the XPath language. The following illustrates how to achieve
the same effect as the previous example using XPath to select the
desired nodes:
use strict;
use XML::LibXML;
my $file = 'files/camelids.xml';
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($file);
my $root = $tree->getDocumentElement;
foreach my $camelid ($root->findnodes('species')) {
my $latin_name = $camelid->findvalue('@name');
my $common_name = $camelid->findvalue('common-name');
my $status = $camelid->findvalue('conservation/@status');
print "$common_name ($latin_name) $status \n";
}
What makes this exciting is that you can you can mix and match methods from the DOM and XPath interfaces to best suit the needs of your application, while operating on the same tree of nodes.
|
To create an XML document programmatically with XML::LibXML you simply use
the provided DOM interface:
use strict;
use XML::LibXML;
my $doc = XML::LibXML::Document->new();
my $root = $doc->createElement('html');
$doc->setDocumentElement($root);
my $body = $doc->createElement('body');
$root->appendChild($body);
foreach my $item (keys (%camelid_links)) {
my $link = $doc->createElement('a');
$link->setAttribute('href', $camelid_links{$item}->{url});
my $text = XML::LibXML::Text->new($camelid_links{$item}->{description});
$link->appendChild($text);
$body->appendChild($link);
}
print $doc->toString;
An important difference between XML::LibXML and
XML::DOM is that libxml2's object model
conforms to the W3C DOM Level 2 interface, which is better able to
cope with documents containing XML Namespaces. So, where
XML::DOM is limited to:
@nodeset = getElementsByTagName($element_name);
and
$node = $doc->createElement($element_name);
XML::LibXML also provides:
@nodeset = getElementsByTagNameNS($namespace_uri, $element_name);
and
$node = $doc->createElementNS($namespace_uri, $element_name);
Also in Perl and XML |
|
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
We've seen the DOM and XPath goodness that
XML::LibXML provides, but the story does not end there.
The libxml2 library also offers a SAX interface that can
be used to create DOM trees from SAX events or generate SAX events
from DOM trees.
The following creates a DOM tree programmatically from a SAX
driver built on XML::SAX::Base. In this example, the
initial SAX events are generated from a custom driver implemented in
the CamelDriver class that calls the handler events in
the XML::LibXML::SAX::Builder class to build the DOM
tree.
use XML::LibXML;
use XML::LibXML::SAX::Builder;
my $builder = XML::LibXML::SAX::Builder->new();
my $driver = CamelDriver->new(Handler => $builder);
my $doc = $driver->parse(%camelid_links);
# doc is an XML::LibXML::Document object
print $doc->toString;
package CamelDriver;
use base qw(XML::SAX::Base);
sub parse {
my $self = shift;
my %links = @_;
$self->SUPER::start_document;
$self->SUPER::start_element({Name => 'html'});
$self->SUPER::start_element({Name => 'body'});
foreach my $item (keys (%camelid_links)) {
$self->SUPER::start_element({Name => 'a',
Attributes => {
'href' => $links{$item}->{url}
}
});
$self->SUPER::characters({Data => $links{$item}->{description}});
$self->SUPER::end_element({Name => 'a'});
}
$self->SUPER::end_element({Name => 'body'});
$self->SUPER::end_element({Name => 'html'});
$self->SUPER::end_document;
}
1;
You can also generate SAX events from an existing DOM tree using
XML::LibXML::SAX::Generator. In the following snippet,
the DOM tree created by parsing the file camelids.xml is
handed to XML::LibXML::SAX::Generator's
generate() method which in turn calls the event handlers
in XML::Handler::XMLWriter to print the document to
STDOUT.
use strict; use XML::LibXML; use XML::LibXML::SAX::Generator; use XML::Handler::XMLWriter; my $file = 'files/camelids.xml'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($file); my $handler = XML::Handler::XMLWriter->new(); my $driver = XML::LibXML::SAX::Generator->new(Handler => $handler); # generate SAX events that are captured # by a SAX Handler or Filter. $driver->generate($doc);
| Resources |
|
Perl XML Quickstart: The Standard XML Interfaces |
This ability to accept and emit SAX events is especially useful in
light of the recent discussion in this column of
generating SAX events from non-XML data and writing SAX
filter chains. You could, for example, use a SAX driver written in
Perl to emit events based on data returned from a database query that
creates a DOM object, which is then transformed in C-space for display
using XSLT and the mind-numbingly fast libxslt library
(which expects libxml2 DOM objects), and then emit SAX
events from that transformed DOM tree for further processing using
custom SAX filters to provide the finishing touches -- all without
once having had to serialize the document to a string for re-parsing.
Wow.
As we have seen, XML::LibXML offers a fast, updated
approach to XML processing that may be superior to the
first-generation XML::Parser for many cases. Do not
misunderstand, XML::Parser and its dependents are still
quite useful, well-supported, and are not likely to go away any time
soon. But it is not the only game in town, and given the added
flexibility that XML::LibXML provides, I would strongly
encourage you to give XML::LibXML a closer look before
beginning your next Perl/XML project.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.