Perl XML Quickstart: The Perl XML Interfaces

April 18, 2001

Introduction

A recent flurry of questions to the Perl-XML mailing list points to the need for a document that gives new users a quick, how-to overview of the various Perl XML modules. For the next few months I will be devoting this column solely to that purpose.

The XML modules available from CPAN can be divided into three main categories: modules that provide unique interfaces to XML data (usually concerned with translating data between an XML instance and Perl data structures), modules that implement one of the standard XML APIs, and special-purpose modules that seek to simplify the execution of some specific XML-related task. This month we will be looking the first of these, the Perl-specific XML interfaces.

use Disclaimer qw(:standard);

This is not an exercise in comparative performance benchmarking, nor is it my intention to suggest that any one module is inherently more useful than another. Choosing the right XML module for your project depends largely upon the nature of the project and your past experience. Different interfaces lend themselves to different kinds of tasks and to different kinds of people. My only goal is to offer working examples of the various interfaces by defining two simple tasks, and then showing how to achieve the same net result using each of the selected modules.

The Tasks

While the uses for XML are rich and varied, most XML-related tasks can be divided into two groups: those related to extracting data from existing XML documents, and those related to creating a new XML documents using data from other sources. With this in mind, the examples that we will use for our module introductions will consist of extracting a specific set data from an XML file, and and marking up a Perl data structure in a specific XML format.

Task One: Extracting Information

First, consider the following XML fragment:

<?xml version="1.0"?>

<camelids>

  <species name="Camelus dromedarius">

    <common-name>Dromedary, or Arabian Camel</common-name>

    <physical-characteristics>

      <mass>300 to 690 kg.</mass>

      <appearance>

        The dromedary camel is characterized by a long-curved 

        neck, deep-narrow chest, and a single hump.

        ...

      </appearance>

    </physical-characteristics>

    <natural-history>

       <food-habits>

         The dromedary camel is an herbivore.

         ...

       </food-habits>

       <reproduction>

         The dromedary camel has a lifespan of about 40-50 years

         ...

       </reproduction>

       <behavior>

         With the exception of rutting males, dromedaries show

         very little aggressive behavior.

         ...

       </behavior>

       <habitat>

         The camels prefer desert conditions characterized by a

         long dry season and a short rainy season.

         ...

       </habitat>

    </natural-history>

    <conservation status="no special status">

      <detail>

        Since the dromedary camel is domesticated, the camel has

        no special status in conservation.

      </detail>

    </conservation>

  </species>

  ...

</camelids>

Now let's say that the complete document (available with this month's sample code) contains the same information for all the members of Camelidae family, not just our friend the single-humped Dromedary Camel. To illustrate how each module might be used to extract a subset of the data stored in this document, we will write a tiny script that parses the camelids.xml document and, for each species found, prints a line to STDOUT containing that species' common name, Latin name (in parentheses), and conservation status. So, having processed the entire document, the output of each script should yield the following result:

Bactrian Camel (Camelus bactrianus) endangered 

Dromedary, or Arabian Camel (Camelus dromedarius) no special status 

Llama (Lama glama) no special status 

Guanaco (Lama guanicoe) special concern

Vicuna (Vicugna vicugna) endangered

Task Two: Creating An XML Document

To demonstrate how each of the selected modules may be used to create XML documents from other data sources, we will write a small script that marks up a simple Perl hash containing URLs to a few cool camelid-related pages on the Web as a simple XHTML document.

Here's the hash:

my %camelid_links = (

    one   => { url         => '

    http://www.online.discovery.com/news/picture/may99/photo20.html',

               description => 'Bactrian Camel in front of Great ' .

                              'Pyramids in Giza, Egypt.'},

    two   => { url         => 'http://www.fotos-online.de/english/m/09/9532.htm',

               description => 'Dromedary Camel illustrates the ' . 

                              'importance of accessorizing.'},

    three => { url         => 'http://www.eskimo.com/~wallama/funny.htm',

               description => 'Charlie - biography of a narcissistic llama.'},

    four  => { url         => 'http://arrow.colorado.edu/travels/other/turkey.html',

               description => 'A visual metaphor for the perl5-porters ' .

                              'list?'},

    five  => { url         => 'http://www.galaonline.org/pics.htm',

               description => 'Many cool alpacas.'},

    six   => { url         => 'http://www.thpf.de/suedamerikareise/galerie/vicunas.htm',

               description => 'Wild Vicunas in a scenic landscape.'}

);

And here is an example of the document that we hope to create from that hash:

<?xml version="1.0">

<html>

  <body>

    <a href="http://www.eskimo.com/~wallama/funny.htm">Charlie - 

      biography of a narcissistic llama.</a>

    <a href="http://www.online.discovery.com/news/picture/may99/photo20.html">Bactrian

      Camel in front of Great Pyramids in Giza, Egypt.</a>

    <a href="http://www.fotos-online.de/english/m/09/9532.htm">Dromedary

      Camel illustrates the importance of accessorizing.</a>

    <a href="http://www.galaonline.org/pics.htm">Many cool alpacas.</a>

    <a href="http://arrow.colorado.edu/travels/other/turkey.html">A visual 

      metaphor for the perl5-porters list?</a>

    <a href="http://www.thpf.de/suedamerikareise/galerie/vicunas.htm">Wild

      Vicunas in a scenic landscape.</a>

  </body>

</html>

It's important to note that while the resulting XML is indented for readability (as shown above), this sort of fine-grained whitespace handling is not part of our sample requirement. All we care about is that the resulting document is well-formed XML, and that it accurately reflects the data stored in our hash.

With our tasks defined, let's get straight to the code samples.

Samples of the Perl-specific XML Interfaces

XML::Simple

Originally created to simplify the task of reading and writing config files in an XML format, XML::Simple translates data between XML documents and native Perl data structures with no intervening abstract interface. Elements and attributes are accessed using nested references.

Reading

use XML::Simple;



my $file = 'files/camelids.xml';

my $xs1 = XML::Simple->new();



my $doc = $xs1->XMLin($file);



foreach my $key (keys (%{$doc->{species}})){

   print $doc->{species}->{$key}->{'common-name'} . ' (' . $key . ') ';

   print $doc->{species}->{$key}->{conservation}->final . "\n";

}

Writing

use XML::Simple;



require "files/camelid_links.pl";

my %camelid_links = get_camelid_data();



my $xsimple = XML::Simple->new();



print $xsimple->XMLout(\%camelid_links,

                       noattr => 1,

                       xmldecl => '<?xml version="1.0">');

Note that the requirements of the data-to-document task reveals one of XML::Simple's few weaknesses: it doesn't allow us to decide which keys in our hash should be returned as elements and which should be returned as attributes. The output from the sample above would be close to the requirement, but it wouldn't be close enough. For those cases where we prefer to manipulate the contents of an XML document using native Perl data structures, but need finer control over the output, a combination of XML::Simple and XML::Writer works nicely.

The following illustrates how to use XML::Writer to meet the output requirement.

use XML::Writer;



require "files/camelid_links.pl";

my %camelid_links = get_camelid_data();



my $writer = XML::Writer->new();



$writer->xmlDecl();

$writer->startTag('html');

$writer->startTag('body');



foreach my $item ( keys (%camelid_links) ) {

    $writer->startTag('a', 'href' => $camelid_links{$item}->{url});

    $writer->characters($camelid_links{$item}->{description});

    $writer->endTag('a');

}



$writer->endTag('body');

$writer->endTag('html');



$writer->end();

XML::SimpleObject

XML::SimpleObject provides an object-oriented interface to XML data using accessor methods that are reminiscent of the Document Object Model.

Reading

use XML::Parser;

use XML::SimpleObject;



my $file = 'files/camelids.xml';



my $parser = XML::Parser->new(ErrorContext => 2, Style => "Tree");

my $xso = XML::SimpleObject->new( $parser->parsefile($file) );



foreach my $species ($xso->child('camelids')->children('species')) {

    print $species->child('common-name')->{VALUE};

    print ' (' . $species->attribute('name') . ') ';

    print $species->child('conservation')->attribute('status');

    print "\n";

}

Writing

XML::SimpleObject has no facility for creating new XML documents from scratch. It can, however, easily be used in conjunction with XML::Writer in the way illustrated in the XML::Simple example above.

XML::TreeBuilder

The XML::TreeBuilder distribution ships with two modules; XML::Element, for creating or accessing the contents of XML element nodes, and XML::TreeBuilder, a factory package that simplifies the building of document trees from existing XML files. Those who have had past experience with the venerable HTML::Element and HTML::Tree modules will find XML::TreeBuilder very easy to use, since the interfaces are identical apart from a few XML-specific methods.

Reading

use XML::TreeBuilder;



my $file = 'files/camelids.xml';

my $tree = XML::TreeBuilder->new();



$tree->parse_file($file);



foreach my $species ($tree->find_by_tag_name('species')){

    print $species->find_by_tag_name('common-name')->as_text;

    print ' (' . $species->attr_get_i('name') . ') ';

    print $species->find_by_tag_name('conservation')->attr_get_i('status');

    print "\n";

}

Writing

use XML::Element;



require "files/camelid_links.pl";

my %camelid_links = get_camelid_data();





my $root = XML::Element->new('html');

my $body = XML::Element->new('body');

my $xml_pi = XML::Element->new('~pi', text => 'xml version="1.0"');

$root->push_content($body);



foreach my $item ( keys (%camelid_links) ) {

    my $link = XML::Element->new('a', 'href' => $camelid_links{$item}->{url});

    $link->push_content($camelid_links{$item}->{description});

    $body->push_content($link);

}



print $xml_pi->as_XML;

print $root->as_XML();

XML::Twig

XML::Twig stands apart from the other Perl-only XML interfaces in that it combines an inventive Perlish interface with many of the features found in the standard XML APIs. For a more detailed introduction to XML::Twig see this XML.com article.

Reading

use XML::Twig;



my $file = 'files/camelids.xml';

my $twig = XML::Twig->new();



$twig->parsefile($file);



my $root = $twig->root;



foreach my $species ($root->children('species')){

    print $species->first_child_text('common-name');

    print ' (' . $species->att('name') . ') ';

    print $species->first_child('conservation')->att('status');

    print "\n";

}

Writing

use XML::Twig;



require "files/camelid_links.pl";

my %camelid_links = get_camelid_data();



my $root = XML::Twig::Elt->new('html');

my $body = XML::Twig::Elt->new('body');

$body->paste($root);



foreach my $item ( keys (%camelid_links) ) {

    my $link = XML::Twig::Elt->new('a');

    $link->set_att('href', $camelid_links{$item}->{url});

    $link->set_text($camelid_links{$item}->{description});

    $link->paste('last_child', $body);

}



print qq|<?xml version="1.0"?>|;

$root->print;

These examples have illustrated the basic usage for the more generic Perl XML modules. My goal has been to give just enough example code to give you a feel for what it is like to work with each of these modules. Next month we will look at those Perl modules that implement one of the standard XML interfaces; specifically, XML::DOM, XML::XPath, and the various SAX and SAX-like modules.