XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Perl XML Quickstart: The Standard XML Interfaces

May 16, 2001

Introduction


O'Reilly Open Source Convention Featured Speaker

Kip Hampton is speaking at the O'Reilly Open Source Convention in San Diego, CA, July 23 - 27, 2001. Rub elbows with open source leaders while relaxing on the beautiful Sheraton San Diego Hotel and Marina waterfront. For more information, visit our conference home page.


This is the second part in a series of articles meant to quickly introduce some of the more popular Perl XML modules. This month we look at the Perl implementations of the standard XML APIs: The Document Object Model, The XPath language, and the Simple API for XML.

As stated in part one, this series is not concerned with comparing the relative merits of the various XML modules. My only goal is to provide enough sample code to help you decide for yourself which module or approach is most appropriate for your situation by showing you how to achieve the same result with each module given two simple tasks. Those tasks are 1) extracting data from an XML document and 2) producing an XML document from a Perl hash. Please see last month's column for a complete description of the sample requirements.

Samples of the Perl Implementations of the Standard XML Interfaces

The Document Object Model (XML::DOM)

The Document Object Model, or DOM for short, provides a language neutral interface to XML data by representing the document's contents as a hierarchical structure of objects whose properties describe the relationships between one object and another. The Perl implementation of the DOM is called, unsurprisingly, XML::DOM.

Reading


use XML::DOM;
use XML::DOM;

my $file = 'files/camelids.xml';
my $parser = XML::DOM::Parser->new();

my $doc = $parser->parsefile($file);

foreach my $species ($doc->getElementsByTagName('species')){
  print $species->getElementsByTagName('common-name')->item(0)
            ->getFirstChild->getNodeValue;
  print ' (' . $species->getAttribute('name') . ') ';
  print $species->getElementsByTagName('conservation')->item(0)

            ->getAttribute('status');
  print "\n";
}

Writing

use XML::DOM;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $doc = XML::DOM::Document->new;
my $xml_pi = $doc->createXMLDecl ('1.0');
my $root = $doc->createElement('html');
my $body = $doc->createElement('body');
$root->appendChild($body);

foreach my $item ( keys (%camelid_links) ) {
  my $link = $doc->createElement('a');
  $link->setAttribute('href', $camelid_links{$item}->{url});
  my $text = $doc->createTextNode($camelid_links{$item}->­description});
  $link->appendChild($text);
  $body->appendChild($link);
}

print $xml_pi->toString;
print $root->toString;

XPath (XML::XPath)

Originally developed to provide a node matching syntax for the eXtensible Stylesheet Language (XSLT) and, later, for XPointer projects, the XPath language provides an interface to an XML document's contents using a compact set of expressions and functions that, like the DOM, treats the data as a tree of nodes. XPath differs significantly from the DOM in that it allows developers fine-grained access to a document's contents based on both the structural relationships between nodes (paths) and the properties of those nodes (expression evaluation). For example, in XPath syntax you can say, "give me all the div elements that have a background attribute with the value of blue" by writing //div[@background="blue"].

Reading

use XML::XPath;

my $file = 'files/camelids.xml';
my $xp = XML::XPath->new(filename => $file);

foreach my $species ($xp->find('//species')->get_nodelist){
    print $species->find('common-name')->string_value;
    print ' (' . $species->find('@name') . ') ';
    print $species->find('conservation/@status');
    print "\n";
}

Writing

use XML::XPath;

require "files/camelid_links.pl";
my %camelid_links = get_camelid_data();

my $xp = XML::XPath->new();
my $xml_pi = XML::XPath::Node::PI->new('xml', 'version="1.0"');
my $root = XML::XPath::Node::Element->new('html');
my $body = XML::XPath::Node::Element->new('body');
$root->appendChild($body);

foreach my $item ( keys (%camelid_links) ) {
    my $link = XML::XPath::Node::Element->new('a');
    my $href = XML::XPath::Node::Attribute->new('href', 
         $camelid_links{$item}->{url});
    $link->appendAttribute($href);
    my $text = XML::XPath::Node::Text->new(
         $camelid_links{$item}->{description});
    $link->appendChild($text);
    $body->appendChild($link);
}

print $xml_pi->toString;
print $root->toString

SAX 1 (XML::Parser::PerlSAX)

The SAX, or Simple API for XML, interface provides access to XML data using an event model in which the contents of an XML document are made available through callback subroutines, which it calls handlers. In contrast to the DOM and XPath APIs, the SAX interface does not build an internal representation of the entire XML document. Instead, data is passed to the handlers in response to the various events (the beginning of an element, the end of an element, etc.) that occur as the document is parsed. This makes SAX extremely fast and memory efficient, but it leaves the task defining node relationships entirely up to the developer.

Reading

use XML::Parser::PerlSAX;
my $file = "files/camelids.xml";

my $handler = CamelHandler->new();
my $parser = XML::Parser::PerlSAX->new(Handler => $handler);

$parser->parse(Source => { SystemId => $file});

package CamelHandler;

use strict;

sub new {
    my $type = shift;
    return bless {}, $type;
}

my $current_element = '';
my $latin_name = '';
my $common_name = '';

sub start_element {
    my ($self, $element) = @_;

    my %attrs = %{$element->{Attributes}};
    $current_element = $element->{Name};

    if ($current_element eq 'species') {
        $latin_name = $element->{Attributes}->{'name'};
    }
    elsif ($current_element eq 'conservation') {
        print $common_name .' (' . $latin_name .') '
        .  $element->{Attributes}->{'status'} . "\n";
    }
}

sub end_element {
    my ($self, $element) = @_;

    if ($element->{LocalName} eq 'species') {
        $common_name = undef;
        $latin_name  = undef;
    }
}

sub characters {
    my ($self, $characters) = @_;
    my $text = $characters->{Data};
    $text =~ s/^\s*//;
    $text =~ s/\s*$//;
    return '' unless $text;

    if ($current_element eq 'common-name') {
        $common_name = $text;
    }
}

1;

Writing

Unlike DOM and XPath, SAX offers no in-memory representation of an XML document and, consequently, has no API facilities for directly creating such a representation. However, there is theoretically no limit to the logic that can embedded in the various event handlers, so creating one or more XML documents based on the SAX events generated by another is quite common.

SAX 2 (Orchard::SAXDriver::Expat)

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

The most important difference between the SAX 1 and SAX2 APIs is SAX 2's support for XML namespaces. A complete SAX 2 implementation is available as part of Ken MacLeod's Orchard project. Since a sample for Orchard::SAXDriver::Expat would look largely the same as the previous, SAX 1 example, I omit it here. However, if you are curious, you can browse orchard_saxdriver_read.pl in this month's sample code.

Familiarity with the standard XML APIs, their strengths and weaknesses relative to a given task, is key to a mature understanding of XML technology. Much has been written about the interfaces covered here, and I strongly encourage you to follow the links in this month's "Resources" section for more information.

Up to this point each module we've looked at shares the common goal of providing a generic interface to the contents any well-formed XML document. Next month we will depart from this pattern a bit by exploring some of the modules that, while perhaps less generically useful, seek to simplify the execution of some specific XML-related task.

Resources