Using XML::Twig

March 21, 2001

If your problem is finding a fast, memory-efficient way to handle large XML documents, but the needs of your application make using the SAX interface overly complex, the solution is to use XML::Twig.

Why XML::Twig?

If you've been working with XML for a while it's often tempting frame solutions to new problems in the context of the tools you've used successfully in the past. In other words, if you are most familiar with the DOM interface, you're likely to approach new challenges from a more-or-less DOMish perspective. While there's plenty to be said for doing what you know will work, experience shows that there is no one right way to process XML. With this in mind, Michel Rodriguez's XML::Twig embodies Perl's penchant for borrowing the best features of the tools that have come before. XML::Twig combines the efficiency and small footprint of SAX processing with the power of XPath's node selection syntax, and it adds a few clever tricks of its own.

Understanding Twigs

To use XML::Twig successfully you've got to realize that XML document trees are typically comprised of smaller tree-like structures, which are called twigs. Consider the following simplified representation of an XHTML document tree:


          html

          /   \

       head    body

        / \      \

   script title  div

                 / \

               h1   p

We see that head and body elements are branches (or twigs) connected to the root html element, and, in turn, those elements contain smaller tree-like structures (the script, title and div elements), and so on. XML::Twig lets us operate on all or part of the document tree by accessing the individual twigs themselves. We can operate on twigs by using a subset of the XPath syntax to select only those structures that are relevant to the task at hand. This ability to pick and choose some of the twigs of the larger tree, while passing over the rest, gives XML::Twig its power, speed, and flexibility.

TwigRoots

Passed to XML::Twig's object constructor, the TwigRoots argument accepts a single hash reference, the keys of which are XPath-like expressions that define the elements in the input document one wants to include in the output tree. If one or more TwigRoots are defined, only those elements defined as Roots will be included in the result tree.

Let's say, for example, that we need to create a table of contents for the XML version of one of the books available through the Gutenberg Project. These electronic books are often quite large, but our table of contents need only include the title of the book and the titles of the various chapters. Fortunately, this is just the sort of task that the TwigRoots was designed to handle.

First let's look at a simplified excerpt from Homer's Iliad:


<gutbook>

...

  <book>

    <frontmatter>

      <titlepage>

        <title>THE ILIAD</title>

        <author>HOMER</author>

      </titlepage>

    </frontmatter>

    <bookbody>

      <chapter>

      <title>BOOK I</title>

        <para>

          Sing, O goddess, the anger of Achilles son of Peleus, that

          brought countless ills upon the Achaeans.

          ...

        </para>

        ...

       </chapter>

       ...

    </bookbody>

  </book>

</gutbook>

To capture all of the <title> elements contained in the document we need only define a single TwigRoot, passing it the expression 'title' as the key.


use XML::Twig;



my $file = $ARGV[0];

my $twig= new XML::Twig(TwigRoots => {title => 1});

$twig->parsefile($file);

$twig->print;

After processing, the output looks like this:


<gutbook>

  <title>THE ILIAD</title>

  <title>BOOK I</title>

  <title>BOOK II</title>

  <title>BOOK II</title>

  ...

</gutbook>

This is not a very descriptive table of contents, but it illustrates how TwigRoots allows us to capture only the elements we need in the output tree.

Remember that the expressions that define the TwigRoots are XPath-like, so, for example, if we wanted to build our table of contents from only those <title>elements with a <chapter> element as a parent, we would change the key in our TwigRoots hash to


TwigRoots => {'chapter/title' => 1}

TwigHandlers

In the same way that TwigRoots allows us to prune the output tree to include only those structures that we care about, TwigHandlers allow us to operate on specific subtrees within the document, while leaving the rest of the tree untouched. We achieve this by binding callbacks (subroutine handlers) to the expressions that define the twigs themselves.

Returning to our table of contents script let's set two callbacks for the two different types of <title> elements that add a descriptive attribute to each type of element:


my $twig_handlers = {'titlepage/title' =>  \&book_title,

                     'chapter/title'   =>  \&chapter_title}





my $twig= new XML::Twig(TwigRoots => {title => 1},

                        TwigHandlers => $twig_handlers);



$twig->parsefile($file);

$twig->print;



sub book_title{

    my ($twig, $title) = @_;

    $title->set_att('type', 'book');

}



sub chapter_title {

    my ($twig, $title) = @_;

    $title->set_att('type', 'chapter');

}

With this addition, out output will now look something like


<gutbook>

  <title type="book">THE ILIAD</title>

  <title type="chapter">BOOK I</title>

  <title type="chapter">BOOK II</title>

  ...

</gutbook>

The entire contents of the twigs are processed before passing them along to the callbacks, so any child elements they may contain (branches within the twig) are also available. So, if we had chosen to define a handler for the <chapter> elements, rather than those matching the path "chapter/title", we could access the chapter's title with


sub chapter_handler {

    my ($twig_obj, $chapter_element) = @_;

    my $title_element = $chapter_element->first_child('title');

    ...

}

Other Handlers and Methods

In addition to TwigHandlers, XML::Twig allows you to to set callbacks for handling DTD events, SAX-style (start_element, character, end_element) events, and a host of others. Each element within a twig has a wide range of possible methods available to help make the task of processing as easy and flexible as possible. Unfortunately, space does not permit me to cover these in detail. I encourage you to run perldoc XML::Twig for the complete list of possible handlers and element methods.

Putting It Together

For our final example, let's use what we've learned so far to build a simple command line tool that will allow us to perform keyword searches on the contents of an e-book. This script presumes that you have already processed the book using the gut2xhtml.pl script, available with this month's sample code, that translates the Gutenberg XML files to simple XHTML and adds named anchors for each chapter and paragraph.


use XML::Twig;

use HTML::Entities;



my ($match_word, $file) = @ARGV;

my ($current_chapter, $last_chapter, $global_match);



my $twig= new XML::Twig(TwigHandlers => { 'p'  => \&paragraph, 'h2' => \&chapter_title},

                        TwigRoots    => {body => 1});



$twig->parsefile($file);    # build the twig



$twig->print;



warn "Sorry, no matches found for '$match_word'\n" unless $global_match;

So far our search script is similar to the previous examples. We have initialized a few variables and created a new XML::Twig object, setting the <body> element as the sole TwigRoot. We have also set TwigHandlers for all <p> and <h2> elements in the document. Let's move on to the TwigHandler callbacks.


sub paragraph {

    my ($twig, $para) = @_;

    my $para_text = $para->text;

    $para_text =~ s/\n/ /g;



    if ($para_text =~ /\b(.{0,30}\b$match_word.{0,30}\b)/is) {

        my $snippet = $1;



        $snippet = decode_entities($snippet);

        $global_match++;

Here we've copied the paragraph's text into the $para_text variable, then checked to see if $para_text contains the word or phrase that the user passed from the command line. If we have a match, we extract a small snippet of the paragraph text (30 characters to the left and right of the match, if they exist) and increment our global match counter.


        my $anchor = $para->first_child;

        my $para_ref = $anchor->att('name');

        my $link = XML::Twig::Elt->new('a');

        $link->set_att('href', $file . '#' . $para_ref);

        $link->set_text($para_ref);

        $para->set_text(" - ...$snippet...");

        $link->paste('first_child', $para);

Now we've retrieved the value of the paragraph's named anchor attribute and created a new HTML hyperlink element (<a>); we've added an 'href' attribute that points to the paragraph's location in the original XHTML document and set the text of the link to the same value. This link lets users jump directly to the matching paragraph in the original document if they want to view the match in a broader context.


        if ((!$last_chapter) || ($last_chapter ne $current_chapter)) {

            my $header = XML::Twig::Elt->new('h2');

            $header->set_text($current_chapter);

            $header->paste('first_child', $para);

        }



        $last_chapter = $current_chapter;

Here we've simply checked whether or not our current match is within the previous chapter; if not, we add a new <h2> heading to keep the result visually organized.


    }

    else {

        $para->delete;

    }

}

The last part of the paragraph handler deletes the twig from the result tree if the paragraph didn't contain a match for the specified keyword. This ensures that only those paragraphs containing a match will make it into the final output.


sub chapter_heading {

    my ($twig, $chapter_heading) = @_;

    $current_chapter = $chapter_heading->text;

    $chapter_heading->delete;

}

And now we've created a handler for the original document's chapter headings. Here we need only set the global $current_chapter variable for use in the paragraph handler and delete the element from the output.

Saving this script as xhtml_search.pl, let's use it to search our XHTML version of The Iliad for all references to the sons of the Trojan king, Priam.

$ perl xhtml_search.pl 'son of priam' /home/books/illiad.html



<html>

  <body>

    <p>

      <h2>BOOK IV</h2>

      <a href="/home/books/illiad.html#4.36">4.36</a> - ... of the gleaming

      corslet, son of Priam, hurled a spear at Ajax from ...</p>

    <p>

      <h2>BOOK V</h2>

      <a href="/home/books/illiad.html#5.52">5.52</a> - ... besought him, saying,

      "Son of Priam, let me not be here to fall ...

    </p>

    ...

  </body>

</html>

Conclusions

XML::Twig is an excellent example of thinking Perlishly about XML. Developers familiar with the DOM, SAX, or XPath interfaces may struggle a bit with some of XML::Twig's naming conventions, but the power it provides, combined with the ways in which it simplifies tasks that would be troublesome using one of the standard APIs, makes Twig a strong addition to any Perl-XML developer's bag of tricks. If you're intrigued by this short tutorial, I suggest a visit to Michel Rodriguez's xmltwig.com for more information.