Processing XML with Perl
Table of Contents
Perl is one of the most powerful (and even the most devout Python zealots will agree here) and widely used text processing languages. Its use on the Web is particularly widespread. It is then easy to understand why a whole host of modules have been developed so that the power of Perl (and especially its regular expression language) can be applied to XML.
In this article I will review the main Perl XML modules, from the venerable XML::Parser to DOM, XQL, XSLT, XPath implementations and more. I'll give the main characteristics of each module and, as much as possible, examples of how to use them.
XML::Parser is the ultimate ancestor and cornerstone of XML processing in Perl. Nearly all of the modules that read XML use it. It was developed initially by Larry Wall, and is now maintained by Clark Cooper. XML::Parser in turn is based on the expat non-validating parser written by James Clark.
XML::Parser can parse one or more XML documents. As it is based on a non-validating parser, it only checks for the document well-formedness, and does not fill in implied attributes. A user-defined handler can be called on each event encountered by the parser, allowing processing of the document.
Besides its basic interface, XML::Parser offers "styles" that improve its ease of use. Predefined styles include "Stream," "Object," and "Tree." More styles can be created by calling scripts.
As a lot of other modules are based on XML::Parser, I think it's worth mentioning a couple of peculiarities that may surprise the newcomer (and believe me, they will bite you!).
Following the XML specification, the parser (and usually the calling script) dies after finding an error in the XML document and displaying an error message. The solution to this is to enclose the call to the parser in an eval block, so that the error can be trapped, and processing--but not parsing--can be resumed.
All parsed strings are returned encoded in UTF-8. This is usually not a problem for English-only documents as UTF-8 and regular ASCII are identical for English characters. However, this can be a real pain for, let's say, French or Germans working with non-UTF8 systems, for which all accented characters are transcoded. It is possible though to get the original string back from XML::Parser (except you then have to manually extract attributes from the tag string). The Unicode::String modules can also be used to go from UTF-8 to extended ASCII. Also, a module associated with XML::Parser -- XML::Encoding -- lets you define additional encodings besides the built-in UTF-8, ISO-8859-1, UTF-16, and US-ASCII.
Expat is fast, I mean really fast! In order to achieve that speed it uses sophisticated caching techniques. At the same time, the XML specification states: "An XML processor must always pass all characters in a document that are not markup through to the application" (translated as "if it ain't markup it's data"). The conjunction of these two factors has the following effect on the character handler in XML::Parser:
It is called for all characters, including \n or spaces added in the markup to make it more readable for human consumption ("non-significant spaces"). It is the responsibility of the calling application to discard those characters it does not want to process.
The strings an application receives may be split arbitrarily, i.e., the content of a single element can cause several successive calls to the character handler, each with a part of the complete string, especially when the string includes entities.
Here is a simple example of a script using XML::Parser's Stream mode.
XML::Parser is in mature state.
SAX defines an event-oriented interface that allows various XML processors to communicate. XML::DOM, XML::Grove, XML::Path and XML::XQL, amongst others, offer a SAX interface.
The XML::Parser::PerlSAX module is (oddly enough!) a Perl SAX parser.
In use by various other XML modules, XML::Parser::PerlSAX can be considered quite robust. It is included in the libxml bundle, which includes a whole bunch of XML modules, including XML::Grove, XML::Hander, and XML::PatAct.
Those modules load documents (or parts of documents) into memory and allow access to the elements, attributes, sometimes the DTD, etc. They usually also facilitate the outputting of the document in XML.
XML::DOM is based on XML::Parser, and offers a SAX interface. It is distributed as part of the libxml-enno bundle. Being widely used, it is probably one of the most robust XML modules.
XML::Simple was first written to allow easy loading and updating of configuration files written in XML. It can be used to process other kinds of simple XML documents. One limitation of XML::Simple is that it does not grok mixed content (<p>this is <b>mixed</b>content</p>). You might consider using this module for configuration files as it offers a straightforward interface, much simpler than the DOM for example.
XML::Simple is based on XML::Parser, and is in beta state.
XML::Twig offers another tree-oriented interface to XML documents. It allows loading of only parts of the document in order to keep memory requirements to a minimum. If your documents are too big to fit in memory (and consider that all tree-oriented modules have a huge, typically around 10 times, expansion factor), but you still want tree access to parts of the document, then consider using XML::Twig.
<commercial-break>As the author of XML::Twig, I personally think it's a terrific module! Due to popular demand, I might add support for at least a subset of the DOM, and a SAX(2) interface.</commercial-break>
XML::Twig is based on XML::Parser, and is somewhere between beta and mature.
XML::Grove loads an XML document in memory and creates a tree of Perl objects that can be accessed and manipulated. It's interface is more perlish than the DOM one, including the capability for creating visitor classes on a Grove. XML::Grove can also be used on SGML and HTML documents. You might want to use XML::Grove if you don't care about using the DOM standard, you prefer its style over the other tree-oriented modules, and/or you want to process XML, HTML, and SGML documents.
Pages: 1, 2