Menu

Processing XML with Perl

April 5, 2000

Michel Rodriguez



Table of Contents

Introduction
XML::Parser
SAX
Tree Processing Modules
Other XML Technologies
Other Modules
Benchmarking Processors
Closing Comments

Perl is one of the most powerful (and even the most devout Python zealots will agree here) and widely used text processing languages. Its use on the Web is particularly widespread. It is then easy to understand why a whole host of modules have been developed so that the power of Perl (and especially its regular expression language) can be applied to XML.

In this article I will review the main Perl XML modules, from the venerable XML::Parser to DOM, XQL, XSLT, XPath implementations and more. I'll give the main characteristics of each module and, as much as possible, examples of how to use them.

XML::Parser

XML::Parser is the ultimate ancestor and cornerstone of XML processing in Perl. Nearly all of the modules that read XML use it. It was developed initially by Larry Wall, and is now maintained by Clark Cooper. XML::Parser in turn is based on the expat non-validating parser written by James Clark.

XML::Parser can parse one or more XML documents. As it is based on a non-validating parser, it only checks for the document well-formedness, and does not fill in implied attributes. A user-defined handler can be called on each event encountered by the parser, allowing processing of the document.

Besides its basic interface, XML::Parser offers "styles" that improve its ease of use. Predefined styles include "Stream," "Object," and "Tree." More styles can be created by calling scripts.

As a lot of other modules are based on XML::Parser, I think it's worth mentioning a couple of peculiarities that may surprise the newcomer (and believe me, they will bite you!).

  • Following the XML specification, the parser (and usually the calling script) dies after finding an error in the XML document and displaying an error message. The solution to this is to enclose the call to the parser in an eval block, so that the error can be trapped, and processing--but not parsing--can be resumed.

  • All parsed strings are returned encoded in UTF-8. This is usually not a problem for English-only documents as UTF-8 and regular ASCII are identical for English characters. However, this can be a real pain for, let's say, French or Germans working with non-UTF8 systems, for which all accented characters are transcoded. It is possible though to get the original string back from XML::Parser (except you then have to manually extract attributes from the tag string). The Unicode::String modules can also be used to go from UTF-8 to extended ASCII. Also, a module associated with XML::Parser -- XML::Encoding -- lets you define additional encodings besides the built-in UTF-8, ISO-8859-1, UTF-16, and US-ASCII.

  • Expat is fast, I mean really fast! In order to achieve that speed it uses sophisticated caching techniques. At the same time, the XML specification states: "An XML processor must always pass all characters in a document that are not markup through to the application" (translated as "if it ain't markup it's data"). The conjunction of these two factors has the following effect on the character handler in XML::Parser:

    • It is called for all characters, including \n or spaces added in the markup to make it more readable for human consumption ("non-significant spaces"). It is the responsibility of the calling application to discard those characters it does not want to process.

    • The strings an application receives may be split arbitrarily, i.e., the content of a single element can cause several successive calls to the character handler, each with a part of the complete string, especially when the string includes entities.

More information on how to use XML::Parser can be found in Using The Perl XML::Parser Module by Clark Cooper, in XML and scripting languages by Parand Tony Daruger, and on Perl Month.

Here is a simple example of a script using XML::Parser's Stream mode.

XML::Parser is in mature state.

SAX

SAX defines an event-oriented interface that allows various XML processors to communicate. XML::DOM, XML::Grove, XML::Path and XML::XQL, amongst others, offer a SAX interface.

The XML::Parser::PerlSAX module is (oddly enough!) a Perl SAX parser.

In use by various other XML modules, XML::Parser::PerlSAX can be considered quite robust. It is included in the libxml bundle, which includes a whole bunch of XML modules, including XML::Grove, XML::Hander, and XML::PatAct.

Tree processing modules

Those modules load documents (or parts of documents) into memory and allow access to the elements, attributes, sometimes the DTD, etc. They usually also facilitate the outputting of the document in XML.

XML::DOM

XML::DOM is a Perl implementation of the W3C's DOM Level 1, plus some extensions. It is one of the most widely used Perl XML modules, and the de facto standard for XML transformation with Perl.

Here is an example of an XML::DOM script. A more detailed introduction to XML::DOM can be found on Perl Month.

XML::DOM is based on XML::Parser, and offers a SAX interface. It is distributed as part of the libxml-enno bundle. Being widely used, it is probably one of the most robust XML modules.

XML::Simple

XML::Simple was first written to allow easy loading and updating of configuration files written in XML. It can be used to process other kinds of simple XML documents. One limitation of XML::Simple is that it does not grok mixed content (<p>this is <b>mixed</b>content</p>). You might consider using this module for configuration files as it offers a straightforward interface, much simpler than the DOM for example.

XML::Simple is based on XML::Parser, and is in beta state.

XML::Twig

XML::Twig offers another tree-oriented interface to XML documents. It allows loading of only parts of the document in order to keep memory requirements to a minimum. If your documents are too big to fit in memory (and consider that all tree-oriented modules have a huge, typically around 10 times, expansion factor), but you still want tree access to parts of the document, then consider using XML::Twig.

<commercial-break>As the author of XML::Twig, I personally think it's a terrific module! Due to popular demand, I might add support for at least a subset of the DOM, and a SAX(2) interface.</commercial-break>

Here is an example of an XML::Twig script. More information is available on XML::Twig at the XML::Twig page.

XML::Twig is based on XML::Parser, and is somewhere between beta and mature.

XML::Grove

XML::Grove loads an XML document in memory and creates a tree of Perl objects that can be accessed and manipulated. It's interface is more perlish than the DOM one, including the capability for creating visitor classes on a Grove. XML::Grove can also be used on SGML and HTML documents. You might want to use XML::Grove if you don't care about using the DOM standard, you prefer its style over the other tree-oriented modules, and/or you want to process XML, HTML, and SGML documents.

Here is an example of an XML::Grove script. XML::Grove is based on XML::Parser::PerlSAX and XML::Grove::Builder. It has a SAX interface. It is in mature state, and is included in the libxml bundle.

Support for Other XML Technologies

Table of Contents

Introduction
XML::Parser
SAX
Tree Processing Modules
Other XML Technologies
Other Modules
Benchmarking Processors
Closing Comments

There are also Perl modules covering most XML-related technologies, from XSLT to XPath and XQL.

XSLT

XML::XSLT implements the W3C's XSLT specification. XML::XSLT is based on XML::DOM. It is still in alpha state and does not cover all of the XSLT specification.

RSS

XML::RSS allows for the creation and updating of RSS (Rich Site Summary) files, which are used by (amongst others) Slashdot and Freshmeat. It also allows for the conversion of RSS to HTML. RSS is primarily used by content authors who want to create a Netscape Netcenter channel, or have their content flowed into aggregators such as O'Reilly Network's Meerkat. However, that doesn't exclude us from using it in other applications. For example, you may want to distribute daily news headlines to partners and customers who convert it to some other format, like HTML.

Here is an example that uses XML::RSS to convert an RSS file to HTML.

XML::RSS is based on XML::Parser, and seems to be in beta stage.

XQL

Although the W3C has not yet standardized a query language, XQL (a query language submitted as a proposal to the W3C) support is offered by 2 modules: XML::XQL and XML::miniXQL.

XML::XQL performs XQL queries on a DOM document. It offers strict XQL support plus some extensions (such as regexp support) and allows users to define additional extensions. Alternative tree structures can also be plugged in.

XML::XQL comes with an XQL tutorial. The module offers SAX and DOM interfaces, and is part of the libxml-enno bundle. It is still in alpha.

XML::miniXQL performs stream-based XQL queries. It only offers a subset of XQL (as it does not access the whole document), but is faster than XML::XQL. It is in alpha.

Note that the XML::QL module implements an alternative query language, described in XML-QL: A Query Language for XML. (For a scary list of XML query languages, have a look at this list from the 1998 W3C Workshop.)

XPath

XML::XPath implements W3C's XPath specification, allowing users to transform XML documents. It also allows users to define additional extensions.

Here is an example using XML::XPath. The module is based on XML::Parser and has a SAX interface. It is in alpha state.

XML::Checker

XML::Checker is a validating parser that operates on XML documents or on DOM trees. It offers error processing (including a user-defined handler).

XML::Checker is still in alpha state. It has SAX and DOM interfaces, and is distributed as part of the libxml-enno bundle.

DBIx::XML_RDB

DBIx::XML_RDB exports data from a DBI database (nearly any relational database). The result of a SELECT clause is turned into an XML document. It is distributed with a standalone tool that has the same functionality and a Win32 OLE wrapper, allowing any OLE software to use the module. It is still in alpha.

XMLNews

XMLNews includes 2 sub-modules, XMLNews::HTMLTemplate and XMLNews::Meta, which perform creation, processing, and conversion to HTML of XMLNews conformant documents. This article's original source is actually an XML document tagged according to the XMLNews DTD.

These modules use XML::Parser.

XML::Writer

XML::Writer is a module used to create XML, à la CGI.pm. It allows various checks to be performed on the document, and takes care of special character encoding. You want to use XML::Writer if you have to generate XML from non-XML sources, and if you are familiar (or want to be) with CGI.pm's style. XML::Writer is in beta.

Note that XML::Generator performs a similar function; it is in alpha.

XML::PYX

XML::PYX is an implementation of the PYX notation and processing model as described in Pyxie by Sean McGrath. PYX offers an interesting approach. It delivers a minimal set of parsing information about a document in a line-oriented way. This does not necessarily allow you to output a document that's equivalent to the input one, but you can still do reports and checks on the document, as well as content extraction.

XML::PYX comes with 3 standalone tools: pyx (non-validating) and pyxv (validating) output the PYX version of the document, and pyxw writes an XML version of a PYX stream.

PYX facilitates one-liners such as:

pyx file.xml | perl -n -e '$nb{$1}++ if( m/\A\((.*)\n/); END { \

map { print "$_ used $nb{$_} time(s)\n";} sort keys \

%nb;}'

or


pyx file.xml | perl -n -e '($id)=( m/Aid \

(.*)\n/) or next; print "duplicate id: $id\n" if($id{$id}); \

$id{$id}=1;'

Here is an example using XML::PYX. The module is based on XML::Parser. You will need XML::Checker to use pyxv. It has just been released, but it is simple enough that it should already be quite robust.

XML::EP

XML::EP aims at being a Perl equivalent to Cocoon. This module is still in pre-alpha; it actually does not pass the build tests on my Solaris 2.7 box.

Other Modules

I have tried to list here most of the other XML modules, but I might have missed some, especially those not in the XML namespace on CPAN. You can go to the CPAN XML Documentation page for the latest list of modules in the XML namespace. I have tried to estimate the development state of each module, but obviously I have not used all of them, so take the ratings with a grain of salt (basically you should try them on a representative subset of your data and figure out whether they work fine, need some easy fixing, or are just unsuitable for you).

So here we go:

  • Edifact: XML::Edifact allows conversion between XML and Edifact messages; still in alpha state.

  • XML::CGI converts CGI variables to and from XML, which can be useful for debugging, or simply for exchanging data between applications in a human-readable and easy-to-process way. In beta.

  • XCatalog: XML::Catalog implements a proposal for catalog resolution, as proposed in this XML Catalog proposal. This allows a script to resolve PUBLIC identifiers into SYSTEM files before subsequent XML processing. In alpha.

  • XML serializers save and restore complex Perl data as XML documents: XML::Dumper is in mature state and performs the basic operations, while the SOAP and WDDX modules implement their respective protocols.

  • Pretty printers: XML::Handler can output canonical (XML::Handler::CanonXMLWriter) or human-readable (XML::Handler::XMLWriter) XML, and even pretty print an XML document (XML::Handler::YAWriter - Yet another Perl SAX XML Writer, based on XML::Parser::PerlSAX). These modules are in mature state.

  • Stream processing of XML documents: XML::Handler, XML::Node (allows you to register handlers on specific elements or attributes; mature state), XML::PactAct (defines a Pattern/Action framework used by various sub-modules to allow different transformation models; mature state), XML::Handler::Subs (lets you define callbacks associated with elements; mature), XML::DT (Omnimark-like transformations; alpha state; here is an example). The Stream style of XML::Parser is also a candidate here.

  • XML::RegExp adds XML extensions to Perl regular expressions: it defines the various XML tokens such as Name, AttValue so that they can be included in regular expressions working on raw XML data. Useful (but maybe dangerous!) if you don't want to use a full-blown parser.

  • XML::Stream allows a script to listen to a port for XML data, store it in a tree (an XML::Parser Tree structure), and call back a user handler every time a predefined element is found. Useful for processing a stream of XML-encoded messages.

  • GoXML::XQI is an interface to the GoXML search engine. This lets you connect to the search engine, send a query, and retrieve the results for further XML processing.

  • Frontier modules, including Frontier::Client, Frontier::Daemon and Frontier::RPC2, let you exchange XML-RPC messages between a Frontier client and server; these modules are mature.

Benchmarking Perl XML Processors

Here is a simple benchmark for all the examples. Remember that speed is not the only criteria for choosing a module (in fact it is not the most important factor as often as you'd think). This data is given just to give a feel of how fast the various modules perform a simple transformation, and how much memory they need.

The data was gathered from a Sun Sparc Ultra1, running Solaris 2.7 and perl 5.005_03, quite a slow machine nowadays (see the XML::Twig Page for XML::Twig benchmarks on different systems to get an idea of how your system compares to mine).

Stream-Oriented
Module XML::Parser XML::PYX XML::DT XML::Twig
Version v 2.27 v 0.5 v 0.11 v 1.10
Time 4s 5s 18s 5s
Memory 3.4M 3M+2M+2M 10M 4.0M
Tree-Oriented
Module XML::DOM XML::Grove XML::XPATH
Version v 1.25 v 0.46 v 0.21
Time 18s 17s 40s
Memory 14M 9.3M 13M

Notes: In all fairness I also have to say that the nature of the example test favors stream-oriented modules. Some more complex transformations are really difficult to perform using only XML::Parser or XML::PYX. If you have to load the whole document in memory (as in this example), the figures for XML::Twig become 14s and 9.6M.

Remember also that all of these modules can and will evolve in the future, so this data will soon be outdated.

Closing Comments

Most of those modules are supported on the Perl-XML mailing list, which also offers announcements, advice, tricks, and more. Requests for improvements or bug fixes are also generally well-received by module authors, so don't hesitate to ask!

XML processing should also benefit from Perl's native Unicode support in version 5.6, which will make it easy to use the full power of regular expressions on any string, even including double-byte characters.

Resources

Links

Mailing List

To subscribe to the Perl-XML mailing list, you can either use the web interface or send an email to Lyris@ActiveState.com, including the text SUBSCRIBE Perl-XML (or SET Perl-XML DIGEST for the digest). The list archives are also on the ActiveState site at http://www.activestate.com/support/mailing_lists.html#XML.