Using XML::Twig
If your problem is finding a fast, memory-efficient way to handle large XML documents, but the needs of your application make using the SAX interface overly complex, the solution is to use XML::Twig.
If you've been working with XML for a while it's often tempting frame
solutions to new problems in the context of the tools you've used
successfully in the past. In other words, if you are most familiar
with the DOM interface, you're likely to approach new challenges from
a more-or-less DOMish perspective. While there's plenty to be said for
doing what you know will work, experience shows that there is no one
right way to process XML. With this in mind, Michel Rodriguez's
XML::Twig embodies Perl's penchant for borrowing the best
features of the tools that have come before. XML::Twig combines the
efficiency and small footprint of SAX processing with the power of
XPath's node selection syntax, and it adds a few clever tricks of its
own.
To use XML::Twig successfully you've got to realize that XML document trees are typically comprised of smaller tree-like structures, which are called twigs. Consider the following simplified representation of an XHTML document tree:
html
/ \
head body
/ \ \
script title div
/ \
h1 p
We see that head and body elements are branches (or twigs)
connected to the root html element, and, in turn, those elements
contain smaller tree-like structures (the script, title and div
elements), and so on. XML::Twig lets us operate on all or
part of the document tree by accessing the individual twigs
themselves. We can operate on twigs by using a subset of the XPath
syntax to select only those structures that are relevant to the task
at hand. This ability to pick and choose some of the twigs of the
larger tree, while passing over the rest, gives XML::Twig
its power, speed, and flexibility.
Passed to XML::Twig's object constructor, the
TwigRoots argument accepts a single hash reference, the
keys of which are XPath-like expressions that define the elements in
the input document one wants to include in the output tree. If one or
more TwigRoots are defined, only those elements defined
as Roots will be included in the result tree.
Let's say, for example, that we need to create a table of contents
for the XML version of one of the books available through the
Gutenberg Project. These electronic books are often quite large, but
our table of contents need only include the title of the book and the
titles of the various chapters. Fortunately, this is just the sort of
task that the TwigRoots was designed to handle.
First let's look at a simplified excerpt from Homer's Iliad:
<gutbook>
...
<book>
<frontmatter>
<titlepage>
<title>THE ILIAD</title>
<author>HOMER</author>
</titlepage>
</frontmatter>
<bookbody>
<chapter>
<title>BOOK I</title>
<para>
Sing, O goddess, the anger of Achilles son of Peleus, that
brought countless ills upon the Achaeans.
...
</para>
...
</chapter>
...
</bookbody>
</book>
</gutbook>
To capture all of the <title> elements contained in the document we need only define a single TwigRoot, passing it the expression 'title' as the key.
use XML::Twig;
my $file = $ARGV[0];
my $twig= new XML::Twig(TwigRoots => {title => 1});
$twig->parsefile($file);
$twig->print;
After processing, the output looks like this:
<gutbook> <title>THE ILIAD</title> <title>BOOK I</title> <title>BOOK II</title> <title>BOOK II</title> ... </gutbook>
This is not a very descriptive table of contents, but it illustrates
how TwigRoots allows us to capture only the elements we
need in the output tree.
Remember that the expressions that define the TwigRoots
are XPath-like, so, for example, if we wanted to build our table of
contents from only those <title>elements with a <chapter>
element as a parent, we would change the key in our TwigRoots
hash to
TwigRoots => {'chapter/title' => 1}
In the same way that TwigRoots allows us to prune the
output tree to include only those structures that we care about,
TwigHandlers allow us to operate on specific subtrees
within the document, while leaving the rest of the tree untouched. We
achieve this by binding callbacks (subroutine handlers) to the
expressions that define the twigs themselves.
Returning to our table of contents script let's set two callbacks for the two different types of <title> elements that add a descriptive attribute to each type of element:
my $twig_handlers = {'titlepage/title' => \&book_title,
'chapter/title' => \&chapter_title}
my $twig= new XML::Twig(TwigRoots => {title => 1},
TwigHandlers => $twig_handlers);
$twig->parsefile($file);
$twig->print;
sub book_title{
my ($twig, $title) = @_;
$title->set_att('type', 'book');
}
sub chapter_title {
my ($twig, $title) = @_;
$title->set_att('type', 'chapter');
}
With this addition, out output will now look something like
<gutbook> <title type="book">THE ILIAD</title> <title type="chapter">BOOK I</title> <title type="chapter">BOOK II</title> ... </gutbook>
The entire contents of the twigs are processed before passing them along to the callbacks, so any child elements they may contain (branches within the twig) are also available. So, if we had chosen to define a handler for the <chapter> elements, rather than those matching the path "chapter/title", we could access the chapter's title with
sub chapter_handler {
my ($twig_obj, $chapter_element) = @_;
my $title_element = $chapter_element->first_child('title');
...
}
In addition to TwigHandlers, XML::Twig
allows you to to set callbacks for handling DTD events, SAX-style
(start_element, character, end_element) events, and a host of
others. Each element within a twig has a wide range of possible
methods available to help make the task of processing as easy and
flexible as possible. Unfortunately, space does not permit me to
cover these in detail. I encourage you to run perldoc
XML::Twig for the complete list of possible handlers and
element methods.
For our final example, let's use what we've learned so far to build
a simple command line tool that will allow us to perform keyword
searches on the contents of an e-book. This script presumes that you
have already processed the book using the gut2xhtml.pl
script, available with this month's sample code, that translates the
Gutenberg XML files to simple XHTML and adds named anchors for each
chapter and paragraph.
use XML::Twig;
use HTML::Entities;
my ($match_word, $file) = @ARGV;
my ($current_chapter, $last_chapter, $global_match);
my $twig= new XML::Twig(TwigHandlers => { 'p' => \¶graph, 'h2' => \&chapter_title},
TwigRoots => {body => 1});
$twig->parsefile($file); # build the twig
$twig->print;
warn "Sorry, no matches found for '$match_word'\n" unless $global_match;
So far our search script is similar to the previous examples. We have
initialized a few variables and created a new XML::Twig
object, setting the <body> element as the sole
TwigRoot. We have also set TwigHandlers for
all <p> and <h2> elements in the document. Let's move on
to the TwigHandler callbacks.
sub paragraph {
my ($twig, $para) = @_;
my $para_text = $para->text;
$para_text =~ s/\n/ /g;
if ($para_text =~ /\b(.{0,30}\b$match_word.{0,30}\b)/is) {
my $snippet = $1;
$snippet = decode_entities($snippet);
$global_match++;
Here we've copied the paragraph's text into the
$para_text variable, then checked to see if
$para_text contains the word or phrase that the user
passed from the command line. If we have a match, we extract a small
snippet of the paragraph text (30 characters to the left and right of
the match, if they exist) and increment our global match counter.
my $anchor = $para->first_child;
my $para_ref = $anchor->att('name');
my $link = XML::Twig::Elt->new('a');
$link->set_att('href', $file . '#' . $para_ref);
$link->set_text($para_ref);
$para->set_text(" - ...$snippet...");
$link->paste('first_child', $para);
Now we've retrieved the value of the paragraph's named anchor attribute and created a new HTML hyperlink element (<a>); we've added an 'href' attribute that points to the paragraph's location in the original XHTML document and set the text of the link to the same value. This link lets users jump directly to the matching paragraph in the original document if they want to view the match in a broader context.
if ((!$last_chapter) || ($last_chapter ne $current_chapter)) {
my $header = XML::Twig::Elt->new('h2');
$header->set_text($current_chapter);
$header->paste('first_child', $para);
}
$last_chapter = $current_chapter;
Here we've simply checked whether or not our current match is within the previous chapter; if not, we add a new <h2> heading to keep the result visually organized.
}
else {
$para->delete;
}
}
The last part of the paragraph handler deletes the twig from the result tree if the paragraph didn't contain a match for the specified keyword. This ensures that only those paragraphs containing a match will make it into the final output.
sub chapter_heading {
my ($twig, $chapter_heading) = @_;
$current_chapter = $chapter_heading->text;
$chapter_heading->delete;
}
And now we've created a handler for the original document's chapter
headings. Here we need only set the global
$current_chapter variable for use in the paragraph
handler and delete the element from the output.
Saving this script as xhtml_search.pl, let's use it to
search our XHTML version of The Iliad for all
references to the sons of the Trojan king, Priam.
$ perl xhtml_search.pl 'son of priam' /home/books/illiad.html
<html>
<body>
<p>
<h2>BOOK IV</h2>
<a href="/home/books/illiad.html#4.36">4.36</a> - ... of the gleaming
corslet, son of Priam, hurled a spear at Ajax from ...</p>
<p>
<h2>BOOK V</h2>
<a href="/home/books/illiad.html#5.52">5.52</a> - ... besought him, saying,
"Son of Priam, let me not be here to fall ...
</p>
...
</body>
</html>
XML::Twig is an excellent example of thinking
Perlishly about XML. Developers familiar with the DOM, SAX, or XPath
interfaces may struggle a bit with some of
XML::Twig's naming conventions, but the power it
provides, combined with the ways in which it simplifies tasks that
would be troublesome using one of the standard APIs, makes
Twig a strong addition to any Perl-XML developer's bag of
tricks. If you're intrigued by this short tutorial, I suggest a visit
to Michel Rodriguez's xmltwig.com
for more information.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.