Perl XML Quickstart: The Perl XML Interfaces
Introduction
A recent flurry of questions to the Perl-XML mailing list points to the need for a document that gives new users a quick, how-to overview of the various Perl XML modules. For the next few months I will be devoting this column solely to that purpose.
The XML modules available from CPAN can be divided into three main categories: modules that provide unique interfaces to XML data (usually concerned with translating data between an XML instance and Perl data structures), modules that implement one of the standard XML APIs, and special-purpose modules that seek to simplify the execution of some specific XML-related task. This month we will be looking the first of these, the Perl-specific XML interfaces.
use Disclaimer qw(:standard);
This is not an exercise in comparative performance benchmarking, nor is it my intention to suggest that any one module is inherently more useful than another. Choosing the right XML module for your project depends largely upon the nature of the project and your past experience. Different interfaces lend themselves to different kinds of tasks and to different kinds of people. My only goal is to offer working examples of the various interfaces by defining two simple tasks, and then showing how to achieve the same net result using each of the selected modules.
The Tasks
While the uses for XML are rich and varied, most XML-related tasks can be divided into two groups: those related to extracting data from existing XML documents, and those related to creating a new XML documents using data from other sources. With this in mind, the examples that we will use for our module introductions will consist of extracting a specific set data from an XML file, and and marking up a Perl data structure in a specific XML format.
Task One: Extracting Information
First, consider the following XML fragment:
<?xml version="1.0"?>
<camelids>
<species name="Camelus dromedarius">
<common-name>Dromedary, or Arabian Camel</common-name>
<physical-characteristics>
<mass>300 to 690 kg.</mass>
<appearance>
The dromedary camel is characterized by a long-curved
neck, deep-narrow chest, and a single hump.
...
</appearance>
</physical-characteristics>
<natural-history>
<food-habits>
The dromedary camel is an herbivore.
...
</food-habits>
<reproduction>
The dromedary camel has a lifespan of about 40-50 years
...
</reproduction>
<behavior>
With the exception of rutting males, dromedaries show
very little aggressive behavior.
...
</behavior>
<habitat>
The camels prefer desert conditions characterized by a
long dry season and a short rainy season.
...
</habitat>
</natural-history>
<conservation status="no special status">
<detail>
Since the dromedary camel is domesticated, the camel has
no special status in conservation.
</detail>
</conservation>
</species>
...
</camelids>
Now let's say that the complete document (available with this
month's sample code) contains the same information for all the members
of Camelidae family, not just our friend the single-humped
Dromedary Camel. To illustrate how each module might be used to
extract a subset of the data stored in this document, we will write a
tiny script that parses the camelids.xml document and, for
each species found, prints a line to STDOUT containing
that species' common name, Latin name (in parentheses), and
conservation status. So, having processed the entire document, the
output of each script should yield the following result:
Bactrian Camel (Camelus bactrianus) endangered Dromedary, or Arabian Camel (Camelus dromedarius) no special status Llama (Lama glama) no special status Guanaco (Lama guanicoe) special concern Vicuna (Vicugna vicugna) endangered
Task Two: Creating An XML Document
To demonstrate how each of the selected modules may be used to create XML documents from other data sources, we will write a small script that marks up a simple Perl hash containing URLs to a few cool camelid-related pages on the Web as a simple XHTML document.
Here's the hash:
my %camelid_links = (
one => { url => '
http://www.online.discovery.com/news/picture/may99/photo20.html',
description => 'Bactrian Camel in front of Great ' .
'Pyramids in Giza, Egypt.'},
two => { url => 'http://www.fotos-online.de/english/m/09/9532.htm',
description => 'Dromedary Camel illustrates the ' .
'importance of accessorizing.'},
three => { url => 'http://www.eskimo.com/~wallama/funny.htm',
description => 'Charlie - biography of a narcissistic llama.'},
four => { url => 'http://arrow.colorado.edu/travels/other/turkey.html',
description => 'A visual metaphor for the perl5-porters ' .
'list?'},
five => { url => 'http://www.galaonline.org/pics.htm',
description => 'Many cool alpacas.'},
six => { url => 'http://www.thpf.de/suedamerikareise/galerie/vicunas.htm',
description => 'Wild Vicunas in a scenic landscape.'}
);
And here is an example of the document that we hope to create from that hash:
<?xml version="1.0">
<html>
<body>
<a href="http://www.eskimo.com/~wallama/funny.htm">Charlie -
biography of a narcissistic llama.</a>
<a href="http://www.online.discovery.com/news/picture/may99/photo20.html">Bactrian
Camel in front of Great Pyramids in Giza, Egypt.</a>
<a href="http://www.fotos-online.de/english/m/09/9532.htm">Dromedary
Camel illustrates the importance of accessorizing.</a>
<a href="http://www.galaonline.org/pics.htm">Many cool alpacas.</a>
<a href="http://arrow.colorado.edu/travels/other/turkey.html">A visual
metaphor for the perl5-porters list?</a>
<a href="http://www.thpf.de/suedamerikareise/galerie/vicunas.htm">Wild
Vicunas in a scenic landscape.</a>
</body>
</html>
It's important to note that while the resulting XML is indented for readability (as shown above), this sort of fine-grained whitespace handling is not part of our sample requirement. All we care about is that the resulting document is well-formed XML, and that it accurately reflects the data stored in our hash.
With our tasks defined, let's get straight to the code samples.
Samples of the Perl-specific XML Interfaces
XML::Simple
Originally created to simplify the task of reading and writing config files
in an XML format, XML::Simple translates data between XML
documents and native Perl data structures with no intervening abstract
interface. Elements and attributes are accessed using nested references.
Reading
use XML::Simple;
my $file = 'files/camelids.xml';
my $xs1 = XML::Simple->new();
my $doc = $xs1->XMLin($file);
foreach my $key (keys (%{$doc->{species}})){
print $doc->{species}->{$key}->{'common-name'} . ' (' . $key . ') ';
print $doc->{species}->{$key}->{conservation}->final . "\n";
}
Writing
use XML::Simple;
require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();
my $xsimple = XML::Simple->new();
print $xsimple->XMLout(\%camelid_links,
noattr => 1,
xmldecl => '<?xml version="1.0">');
Note that the requirements of the data-to-document task reveals one
of XML::Simple's few weaknesses: it doesn't allow us to
decide which keys in our hash should be returned as elements and which
should be returned as attributes. The output from the sample above
would be close to the requirement, but it wouldn't be close
enough. For those cases where we prefer to manipulate the contents of
an XML document using native Perl data structures, but need finer
control over the output, a combination of XML::Simple and
XML::Writer works nicely.
The following illustrates how to use XML::Writer to
meet the output requirement.
use XML::Writer;
require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();
my $writer = XML::Writer->new();
$writer->xmlDecl();
$writer->startTag('html');
$writer->startTag('body');
foreach my $item ( keys (%camelid_links) ) {
$writer->startTag('a', 'href' => $camelid_links{$item}->{url});
$writer->characters($camelid_links{$item}->{description});
$writer->endTag('a');
}
$writer->endTag('body');
$writer->endTag('html');
$writer->end();
XML::SimpleObject
XML::SimpleObject provides an object-oriented
interface to XML data using accessor methods that are reminiscent of
the Document Object Model.
Reading
use XML::Parser;
use XML::SimpleObject;
my $file = 'files/camelids.xml';
my $parser = XML::Parser->new(ErrorContext => 2, Style => "Tree");
my $xso = XML::SimpleObject->new( $parser->parsefile($file) );
foreach my $species ($xso->child('camelids')->children('species')) {
print $species->child('common-name')->{VALUE};
print ' (' . $species->attribute('name') . ') ';
print $species->child('conservation')->attribute('status');
print "\n";
}
Writing
XML::SimpleObject has no facility for creating new XML
documents from scratch. It can, however, easily be used in conjunction
with XML::Writer in the way illustrated in the
XML::Simple example above.
XML::TreeBuilder
The XML::TreeBuilder distribution ships with two
modules; XML::Element, for creating or accessing the
contents of XML element nodes, and XML::TreeBuilder, a
factory package that simplifies the building of document trees from
existing XML files. Those who have had past experience with the
venerable HTML::Element and HTML::Tree
modules will find XML::TreeBuilder very easy to use,
since the interfaces are identical apart from a few XML-specific
methods.
Reading
use XML::TreeBuilder;
my $file = 'files/camelids.xml';
my $tree = XML::TreeBuilder->new();
$tree->parse_file($file);
foreach my $species ($tree->find_by_tag_name('species')){
print $species->find_by_tag_name('common-name')->as_text;
print ' (' . $species->attr_get_i('name') . ') ';
print $species->find_by_tag_name('conservation')->attr_get_i('status');
print "\n";
}
Writing
use XML::Element;
require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();
my $root = XML::Element->new('html');
my $body = XML::Element->new('body');
my $xml_pi = XML::Element->new('~pi', text => 'xml version="1.0"');
$root->push_content($body);
foreach my $item ( keys (%camelid_links) ) {
my $link = XML::Element->new('a', 'href' => $camelid_links{$item}->{url});
$link->push_content($camelid_links{$item}->{description});
$body->push_content($link);
}
print $xml_pi->as_XML;
print $root->as_XML();
XML::Twig
XML::Twig stands apart from the other Perl-only XML
interfaces in that it combines an inventive Perlish interface with
many of the features found in the standard XML APIs. For a more
detailed introduction to XML::Twig see this XML.com
article.
Reading
use XML::Twig;
my $file = 'files/camelids.xml';
my $twig = XML::Twig->new();
$twig->parsefile($file);
my $root = $twig->root;
foreach my $species ($root->children('species')){
print $species->first_child_text('common-name');
print ' (' . $species->att('name') . ') ';
print $species->first_child('conservation')->att('status');
print "\n";
}
Writing
use XML::Twig;
require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();
my $root = XML::Twig::Elt->new('html');
my $body = XML::Twig::Elt->new('body');
$body->paste($root);
foreach my $item ( keys (%camelid_links) ) {
my $link = XML::Twig::Elt->new('a');
$link->set_att('href', $camelid_links{$item}->{url});
$link->set_text($camelid_links{$item}->{description});
$link->paste('last_child', $body);
}
print qq|<?xml version="1.0"?>|;
$root->print;
These examples have illustrated the basic usage for the more
generic Perl XML modules. My goal has been to give just enough example
code to give you a feel for what it is like to work with each of these
modules. Next month we will look at those Perl modules that implement
one of the standard XML interfaces; specifically,
XML::DOM, XML::XPath, and the various SAX
and SAX-like modules.
Resources
- Download sample code.
- A complete list of the XML modules available from CPAN
- Perl-XML mailing list archives
- Using XML::Twig