Processing XML with Perl

Processing XML with Perl - Part 2

April 5, 2000

Support for Other XML Technologies

Table of Contents

•Introduction
•XML::Parser
•SAX
•Tree Processing Modules
•Other XML Technologies
•Other Modules
•Benchmarking Processors
•Closing Comments

There are also Perl modules covering most XML-related technologies, from XSLT to XPath and XQL.

XSLT

XML::XSLT implements the W3C's XSLT specification. XML::XSLT is based on XML::DOM. It is still in alpha state and does not cover all of the XSLT specification.

RSS

XML::RSS allows for the creation and updating of RSS (Rich Site Summary) files, which are used by (amongst others) Slashdot and Freshmeat. It also allows for the conversion of RSS to HTML. RSS is primarily used by content authors who want to create a Netscape Netcenter channel, or have their content flowed into aggregators such as O'Reilly Network's Meerkat. However, that doesn't exclude us from using it in other applications. For example, you may want to distribute daily news headlines to partners and customers who convert it to some other format, like HTML.

Here is an example that uses XML::RSS to convert an RSS file to HTML.

XML::RSS is based on XML::Parser, and seems to be in beta stage.

XQL

Although the W3C has not yet standardized a query language, XQL (a query language submitted as a proposal to the W3C) support is offered by 2 modules: XML::XQL and XML::miniXQL.

XML::XQL performs XQL queries on a DOM document. It offers strict XQL support plus some extensions (such as regexp support) and allows users to define additional extensions. Alternative tree structures can also be plugged in.

XML::XQL comes with an XQL tutorial. The module offers SAX and DOM interfaces, and is part of the libxml-enno bundle. It is still in alpha.

XML::miniXQL performs stream-based XQL queries. It only offers a subset of XQL (as it does not access the whole document), but is faster than XML::XQL. It is in alpha.

Note that the XML::QL module implements an alternative query language, described in XML-QL: A Query Language for XML. (For a scary list of XML query languages, have a look at this list from the 1998 W3C Workshop.)

XPath

XML::XPath implements W3C's XPath specification, allowing users to transform XML documents. It also allows users to define additional extensions.

Here is an example using XML::XPath. The module is based on XML::Parser and has a SAX interface. It is in alpha state.

XML::Checker

XML::Checker is a validating parser that operates on XML documents or on DOM trees. It offers error processing (including a user-defined handler).

XML::Checker is still in alpha state. It has SAX and DOM interfaces, and is distributed as part of the libxml-enno bundle.

DBIx::XML_RDB

DBIx::XML_RDB exports data from a DBI database (nearly any relational database). The result of a SELECT clause is turned into an XML document. It is distributed with a standalone tool that has the same functionality and a Win32 OLE wrapper, allowing any OLE software to use the module. It is still in alpha.

XMLNews

XMLNews includes 2 sub-modules, XMLNews::HTMLTemplate and XMLNews::Meta, which perform creation, processing, and conversion to HTML of XMLNews conformant documents. This article's original source is actually an XML document tagged according to the XMLNews DTD.

These modules use XML::Parser.

XML::Writer

XML::Writer is a module used to create XML, à la CGI.pm. It allows various checks to be performed on the document, and takes care of special character encoding. You want to use XML::Writer if you have to generate XML from non-XML sources, and if you are familiar (or want to be) with CGI.pm's style. XML::Writer is in beta.

Note that XML::Generator performs a similar function; it is in alpha.

XML::PYX

XML::PYX is an implementation of the PYX notation and processing model as described in Pyxie by Sean McGrath. PYX offers an interesting approach. It delivers a minimal set of parsing information about a document in a line-oriented way. This does not necessarily allow you to output a document that's equivalent to the input one, but you can still do reports and checks on the document, as well as content extraction.

XML::PYX comes with 3 standalone tools: pyx (non-validating) and pyxv (validating) output the PYX version of the document, and pyxw writes an XML version of a PYX stream.

PYX facilitates one-liners such as:


                        
                        pyx file.xml | perl -n -e '$nb{$1}++ if( m/\A\((.*)\n/); END { \

map { print "$_ used $nb{$_} time(s)\n";} sort keys \

%nb;}'


                        
                        
pyx file.xml | perl -n -e '($id)=( m/Aid \

(.*)\n/) or next; print "duplicate id: $id\n" if($id{$id}); \

$id{$id}=1;'

Here is an example using XML::PYX. The module is based on XML::Parser. You will need XML::Checker to use pyxv. It has just been released, but it is simple enough that it should already be quite robust.

XML::EP

XML::EP aims at being a Perl equivalent to Cocoon. This module is still in pre-alpha; it actually does not pass the build tests on my Solaris 2.7 box.

Other Modules

I have tried to list here most of the other XML modules, but I might have missed some, especially those not in the XML namespace on CPAN. You can go to the CPAN XML Documentation page for the latest list of modules in the XML namespace. I have tried to estimate the development state of each module, but obviously I have not used all of them, so take the ratings with a grain of salt (basically you should try them on a representative subset of your data and figure out whether they work fine, need some easy fixing, or are just unsuitable for you).

So here we go:

Edifact: XML::Edifact allows conversion between XML and Edifact messages; still in alpha state.
XML::CGI converts CGI variables to and from XML, which can be useful for debugging, or simply for exchanging data between applications in a human-readable and easy-to-process way. In beta.
XCatalog: XML::Catalog implements a proposal for catalog resolution, as proposed in this XML Catalog proposal. This allows a script to resolve PUBLIC identifiers into SYSTEM files before subsequent XML processing. In alpha.
XML serializers save and restore complex Perl data as XML documents: XML::Dumper is in mature state and performs the basic operations, while the SOAP and WDDX modules implement their respective protocols.
Pretty printers: XML::Handler can output canonical (XML::Handler::CanonXMLWriter) or human-readable (XML::Handler::XMLWriter) XML, and even pretty print an XML document (XML::Handler::YAWriter - Yet another Perl SAX XML Writer, based on XML::Parser::PerlSAX). These modules are in mature state.
Stream processing of XML documents: XML::Handler, XML::Node (allows you to register handlers on specific elements or attributes; mature state), XML::PactAct (defines a Pattern/Action framework used by various sub-modules to allow different transformation models; mature state), XML::Handler::Subs (lets you define callbacks associated with elements; mature), XML::DT (Omnimark-like transformations; alpha state; here is an example). The Stream style of XML::Parser is also a candidate here.
XML::RegExp adds XML extensions to Perl regular expressions: it defines the various XML tokens such as Name, AttValue so that they can be included in regular expressions working on raw XML data. Useful (but maybe dangerous!) if you don't want to use a full-blown parser.
XML::Stream allows a script to listen to a port for XML data, store it in a tree (an XML::Parser Tree structure), and call back a user handler every time a predefined element is found. Useful for processing a stream of XML-encoded messages.
GoXML::XQI is an interface to the GoXML search engine. This lets you connect to the search engine, send a query, and retrieve the results for further XML processing.
Frontier modules, including Frontier::Client, Frontier::Daemon and Frontier::RPC2, let you exchange XML-RPC messages between a Frontier client and server; these modules are mature.

Benchmarking Perl XML Processors

Here is a simple benchmark for all the examples. Remember that speed is not the only criteria for choosing a module (in fact it is not the most important factor as often as you'd think). This data is given just to give a feel of how fast the various modules perform a simple transformation, and how much memory they need.

The data was gathered from a Sun Sparc Ultra1, running Solaris 2.7 and perl 5.005_03, quite a slow machine nowadays (see the XML::Twig Page for XML::Twig benchmarks on different systems to get an idea of how your system compares to mine).

	Stream-Oriented
Module	XML::Parser	XML::PYX	XML::DT	XML::Twig
Version	v 2.27	v 0.5	v 0.11	v 1.10
Time	4s	5s	18s	5s
Memory	3.4M	3M+2M+2M	10M	4.0M

	Tree-Oriented
Module	XML::DOM	XML::Grove	XML::XPATH
Version	v 1.25	v 0.46	v 0.21
Time	18s	17s	40s
Memory	14M	9.3M	13M

Notes: In all fairness I also have to say that the nature of the example test favors stream-oriented modules. Some more complex transformations are really difficult to perform using only XML::Parser or XML::PYX. If you have to load the whole document in memory (as in this example), the figures for XML::Twig become 14s and 9.6M.

Remember also that all of these modules can and will evolve in the future, so this data will soon be outdated.

Closing Comments

Most of those modules are supported on the Perl-XML mailing list, which also offers announcements, advice, tricks, and more. Requests for improvements or bug fixes are also generally well-received by module authors, so don't hesitate to ask!

XML processing should also benefit from Perl's native Unicode support in version 5.6, which will make it easy to use the full power of regular expressions on any string, even including double-byte characters.

Resources

Mailing List

To subscribe to the Perl-XML mailing list, you can either use the web interface or send an email to Lyris@ActiveState.com, including the text SUBSCRIBE Perl-XML (or SET Perl-XML DIGEST for the digest). The list archives are also on the ActiveState site at http://www.activestate.com/support/mailing_lists.html#XML.