XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

High-Performance XML Parsing With SAX

February 14, 2001

The problem: The XML documents you have to parse are getting too large to load the entire document tree into memory; performance is suffering. The solution: use SAX.

Understanding Event-Driven XML Processing

SAX (Simple API for XML) is an event-driven model for processing XML. Most XML processing models (for example: DOM and XPath) build an internal, tree-shaped representation of the XML document. The developer then uses that model's API (getElementsByTagName in the case of the DOM or findnodes using XPath, for example) to access the contents of the document tree. The SAX model is quite different. Rather than building a complete representation of the document, a SAX parser fires off a series of events as it reads the document from beginning to end. Those events are passed to event handlers, which provide access to the contents of the document.

Event Handlers

There are three classes of event handlers: DTDHandlers, for accessing the contents of XML Document-Type Definitions; ErrorHandlers, for low-level access to parsing errors; and, by far the most often used, DocumentHandlers, for accessing the contents of the document. For clarity's sake, I'll only cover DocumentHandler events.

A SAX processor will pass the following events to a DocumentHandler:

  • The start of the document.
  • A processing instruction element.
  • A comment element.
  • The beginning of an element, including that element's attributes.
  • The text contained within an element.
  • The end of an element.
  • The end of the document.

Consider the following XML fragment.

<doc>
  <quote>There are more things in heaven
  and earth, Horatio, Than are dreamt
  of in your philosophy.</quote>
</doc>

Now consider the same fragment with the various DocumentHandler events drawn in.

start_document-->
start_element--->
                <doc>
start_element--->
                <quote>
            --->There are more things in heaven
characters-|    and earth, Horatio, Than are dreamt
            --->of in your philosophy.
                </quote>
end_element----->
                </doc>
end_element----->
end_document---->

The behavior of character events often misleads newcomers to SAX who usually expect the entire contents of an element to be delivered in one lump to the handler. The underlying implementation breaks up character blocks on newlines, so a single block of text may fire several character events. This means that you can't be sure you have all the text in an element until you are sent the end-element event.

Using XML::Parser::PerlSAX

For purposes of this article, we'll imagine that we have a XML document that acts as a sort of mail queue. Since we cannot know how many messages the queue will hold, nor the length of the messages, we can avoid swamping the system's memory by using Ken MacLeod's XML::Parser::PerlSAX and a custom SAX handler rather than DOM or XPath.

With humblest apologiest to Will S., a sample of the mail queue XML document looks like

<?xml version="1.0"?>
<messages>
 <message>
   <from>claudius@elsinore.gov</from>
   <to>maddog@elsinore.gov</to>
   <subject>Re: [RSVP] Impromptu Theatrical Performance Today!</subject>
   <body>
     Hamlet,

     The Queen and I sincerely look forward to attending your play.
     Glad to see that you're feeling better.

     Your Uncle and King,
     Claudius
   </body>
 </message>
 <message>
   <from>rosencrantz@elsinore.gov</from>
   <to>claudius@elsinore.gov</to>
   <subject>Project Update</subject>
   <body>
     My King,

     He suspects nothing. Guildenstern and I should be home
     within the week.

     -rosey
   </body>
 </message>
</messages>

Let's write a script to send the messages from our queue. The main body of the script consists of the initialization of the XML::Parser::PerlSAX and SAXMailHandler (our custom handler) objects and a method call to set the parser in motion.

# SAX mail - A simple SAX handler that sends e-mail

use strict;
use XML::Parser::PerlSAX;

my $handler = SAXMailHandler->new();
my $parser = XML::Parser::PerlSAX->new(Handler => $handler);
my $file = "mail.xml";

my %parser_args = (Source => {SystemId => $file});
$parser->parse(%parser_args);

exit;

In the same file, we add the handler as an in-line Perl package. The handler is a simple one. In a nutshell, it copies the contents of the various children of the <message> element into a hash named %mail_args; then, upon reaching the end of the <message> parent, passes that hash as the argument to the sendmail function in Mail::Sendmail.

# begin the in-line package
package SAXMailHandler;
use strict;
use Mail::Sendmail;

my (%mail_args, $current_element, $message_count, $sent_count);

sub new {
    my $type = shift;
    return bless {}, $type;
}

After a bit of initialization and a simplified constructor method, we begin the handler methods. Remember that a SAX parser does not keep any of the document tree in memory. Thus, even while handling the character content of a particular element, the SAX API does not offer access to the name of that parent whose text we are processing. So, in the start_element method of the handler, we set the package-wide $current_element to the name of the current element so we can access the name further downstream.

Note, too, that the XML document holds the main contents of each message as the child of the <body> element, while the sendmail function expects that information to be passed as the value for a key named message. We work around this by hard-coding $current_element to the value "message" if the current element's name is "body".

sub start_element {
    my ($self, $element) = @_;

    if ($element->{Name} eq 'message') {
        %mail_args = ();
        $message_count++;
    }
    elsif ($element->{Name} eq 'body') {
        $current_element = 'message';
    }
    else {
        $current_element = $element->{Name};
    }
}

In the characters handler we first strip all the leading and trailing whitespace from the data, then append what's left, if anything, to the appropriate key in %mail_args. Note that we avoid altering the data if $current_element is set to "message", since we want the contents of the message to be passed to the mailer without modification.

sub characters {
    my ($self, $characters) = @_;
    my $text = $characters->{Data};
    unless ($current_element eq 'message') {
        $text =~ s/^\s*//;
        $text =~ s/\s*$//;
    }
    $mail_args{$current_element} .= $text if $text;
}

In the end_element handler we determine if we've reached the end of a <message> element, and, if so, we pass %mail_args to the sendmail function to send the message. If any errors occur while sending, they are printed to the terminal.

sub end_element {
    my ($self, $element) = @_;
    if ($element->{Name} eq 'message')  {
        Mail::Sendmail::sendmail(%mail_args) or 
           warn "Mail Error: $Mail::Sendmail::error";
        $sent_count++ unless $Mail::Sendmail::error;
    }

}

Finally, we add a little user-friendly sugar to our script by using the start_document and end_document handlers to print informative messages to STDOUT.

sub start_document {
    my ($self) = @_;
    print "Starting SAX Mailer\n";
}

sub end_document {
    my ($self) = @_;
    print "SAX Mailer Finished\n$sent_count of $message_count message(s) sent\n";
}

1; #Ye Olde 'Return True' for the in-line package...

SAX handlers can be, and often are, far more complex than the one here, but this example illustrates the fundamentals of SAX processing. Run perldoc XML::Parser::PerlSAX for more detailed coverage.

The Future of SAX in Perl

tt>XML::Parser::PerlSAX offers a complete SAX1 API but, as you may be aware, SAX2 is now considered the standard. If you're wondering about SAX2 support for Perl, you should know that Ken MacLeod, author of XML::Parser::PerlSAX, as well as other top-notch XML Perl modules, has announced full SAX2 support for Perl using his excellent Orchard project.

Orchard provides a lightning-fast element/property model upon which developers can easily implement a wide range of XML APIs (or, for that matter, any node-based property set, not just XML). In addition to SAX2, the 2.0 beta release of Matt Sergeant's XML::XPath is also built upon Orchard and the performance gains are quite astonishing. If you are serious about high-performance XML processing in Perl, I strongly encourage you to visit the Orchard project for more information.

Resources