Menu

Writing SAX Drivers for Non-XML Data

September 19, 2001

Kip Hampton

In a previous column, we covered the basics of the Simple API for XML (SAX) and the modules that implement that interface in Perl. Over the course of the next two months we will move beyond these basic topics to look at two slightly more advanced ones: creating drivers that generate SAX events from non-XML sources and writing custom SAX filters. If you are not familiar with the way SAX works, please read High-Performance XML Parsing With SAX before proceeding.

What Is A SAX Driver, And Why Would You Want One?

SAX is an event-driven API in which the contents of an XML document are accessed through callback subroutines that fire based on various XML parsing events (the beginning of an element, the end of an element,character data, etc.) For the purpose of this article, a SAX driver (sometimes called a SAX generator) can be understood to mean any Perl class that can generate these SAX events.

In the most common case, a SAX driver acts as a proxy between an XML parser and the one or more handler classes written by the developer. The handler methods detailed in the SAX API are called as the parser makes its way through the document, thereby providing access to the contents of that XML document. In fact, this is precisely what SAX was designed for: to provide a simple means to access information stored in XML. As we will see, however, it is often handy to be able to generate these events from data sources other than XML documents.

A Simple Example: Dumping A Perl Hash As An XML Document

Before we look at our first example, it's important to note that a SAX driver without a handler that receives the generated events and does something with the data passed is useless. While the basics of writing SAX handlers are quite easy to grasp, the handlers themselves can sometimes be quite complex. Our focus here is on generating events, not handling them; so, for simplicity's sake, we will use Ken MacLeod's XML::Handler::XMLWriter (which takes a SAX event stream and prints it to STDOUT as a XML document) as the default handler throughout this article.

To show how to write a SAX driver we will create a simple inlined class that translates a typical Perl hash into a well-formed XML document where the keys are the element names and the values are the character data contained by those elements.

use strict;

use XML::Handler::XMLWriter;

The main portion of the script consists of nothing more than initialization of the new handler and driver objects and the call to the driver's parse method. We use the parse method to kick off the SAX event stream, passing the hash we wish to dump to XML as the sole argument. In this case we will use Perl's venerable built-in system environment dictionary %ENV.

my $writer = XML::Handler::XMLWriter->new();

my $driver = SAXDriverHash2XML->new(Handler => $writer);



$driver->parse(%ENV);

Next we create our driver class beginning with a typical constructor method.

package SAXDriverHash2XML;

# generate SAX1 events from a simple Perl HASH.

use strict;



# standard constructor

sub new {

    my ($proto, %args) = @_;

    my $class = ref($proto) || $proto;

    my $self = \%args;

    bless ($self, $class);

    return $self;

}

Finally we get to the substantial part of our driver, the parse method.

# generate the events

sub parse {

    my $self = shift;

    my %passed_hash = @_;

After slurping the sole argument into the local %passed_hash, we begin firing off the necessary SAX events to create our XML document. Recall that we passed a blessed instance of XML::Handler::XMLWriter as the default handler for our driver. Generating the SAX events is as simple as calling the appropriate handler methods on that object and passing the data through as arguments in the format that the handler expects. This is the essence of writing a custom SAX driver.

We begin the SAX event stream by calling the required start_document handler.

$self->{Handler}->start_document();

Now a Perl hash is a list of key-value pairs; but for our XML document to be well-formed, it must have a single top-level element. To meet the well-formedness requirement, we will add a top-level wrapper element named "root".

Pay special attention to the arguments we pass to the start_element handler. Perl SAX1 implementations expect a hash reference of named properties where the Name property is a string containing the element's name, and the Attributes property is another hash reference that contains the XML attributes attached to that element (empty in this case).

$self->{Handler}->start_element({Name => 'root', Attributes => {}});

Next we loop over the elements of %passed_hash using the each function. As we loop over each entry, we fire a start_element, characters, and end_element handler event for each record. Note that the argument to the characters is a hash reference with a single property (Data that contains the character data that will become the text content of the surrounding element.

while (my ($name, $value) = each(%passed_hash)) {
  $name = lc($name); # we like lower-case tag names
  $self->{Handler}->start_element({Name => $name,Attributes => {}});
  $self->{Handler}->characters({Data => $value});
  $self->{Handler}->end_element({Name => $name});
}

Finally we call the end_element event on the "root" wrapper element, followed by the end_document handler which signals the handler class that the "parse" is complete.

    $self->{Handler}->end_element({Name => 'root'});

    $self->{Handler}->end_document();

}

Running this script on my machine yields an XML document that is similar to the following:

<?xml version="1.0"?>
<root>
  <bash_env>/home/kip/.bashrc</bash_env>
  <ostype>linux</ostype>
  <histsize>1000</histsize>
  <hostname>duchamp.hampton.ws</hostname>
  <user>kip</user>
  <hosttype>i386</hosttype>
  <home>/home/kip</home>
  <term>linux</term>
  <logname>kip</logname>
  <path>/usr/local/bin:/bin:/usr/bin:/usr/x11r6/bin:
/home/kip/bin</path>
  <shell>/bin/bash</shell>
  <mail>/var/spool/mail/kip</mail>
  <lang>en_us</lang>
</root>

Note that I said that the output of this script is similar to the snippet above. Actually, the resulting document puts all the elements and data on a single line. Remember, XML elements can have mixed content (containing both child elements and text) so all character data is important. The spaces and newline characters added to this example to make it more readable here are, in truth, text data contained by the "root" element and would have to be explicitly added via calls to the characters handler to produce an exact match.

Simplifying Event Generation

As we have seen, generating SAX events is as simple as calling the appropriate method on the handler object from within our driver class. However, writing $obj->{Handler}->method_name($appropriate_hashref) for each event can be cumbersome and error-prone. Not only is it a lot of typing, it requires an intimate knowledge of the properties that each event expects and that we get those properties right each and every time. If we do not mind a little extra overhead, we can make life a little easier by creating wrapper methods within our driver class which allow us to write our parse() method in a more simple, Perlish way, while ensuring that the handler receives the data passed by the event in the format that it expects.

For our second and final example we will write a simple driver that produces an XML document from a genetic sequence record stored in the FASTA file format. We will use the Bio::SeqIO module from the bioperl project to read the sequence record, calling convenience methods in our driver class to generate the required SAX events to translate that record into an XML format. Again, we will use XML::Handler::XMLWriter as the default handler for our driver.

The main portion of our script is more or less identical to that of the previous example; we create new instances of the handler and driver classes and call the parse method on the driver object. This time, though, we pass the location of the FASTA file that we want to translate to XML as the sole argument to parse.

use strict;

use XML::Handler::XMLWriter;



my $sequence_file = 'files/seq1.fasta';



my $writer = XML::Handler::XMLWriter->new();

my $driver = SAXDriverFastaFile->new(Handler => $writer);



$driver->parse($sequence_file);

Now we begin our driver class.

package SAXDriverFastaFile;

# generate SAX1 events from a fasta sequence record.

use strict;

use Bio::SeqIO;

use vars qw($AUTOLOAD);



sub new {

    my ($proto, %args) = @_;

    my $class = ref($proto) || $proto;

    my $self = \%args;

    bless ($self, $class);

    return $self;

}

We have decided that we would rather pass simple structures to the event generators that are used most often (rather than the hash references that the SAX handler expects), so we will implement the start_element, end_element, and characters methods inside our driver class to accept these simpler arguments and forward the data to the handler in the expected format.

sub start_element {

    my $self = shift;

    my $element_name = shift;

    my %attributes = @_;



    $self->{Handler}->start_element({Name => $element_name,

                              Attributes => \%attributes});

}



sub end_element {

    my ($self, $element_name) = @_;

    $self->{Handler}->end_element({Name => $element_name});

    $self->newline;

}



sub characters {

    my ($self, $data) = @_;

    $self->{Handler}->characters({Data => $data});

}

With these methods in place we can now write:

$obj->start_element('element_name', (attr1 => 'some value', attr2 => 'some other value'))

rather than the more verbose:

$obj->{Handler}->start_element({Name => 'element_name',

                          Attributes => {attr1 => 'some value'

                                  attr2 => 'some other value'}

                                   })

In analyzing the task at hand, we notice that we often want to produce simple XML data elements in the format <name>value</name>. To make generating these events easier, we will add the following data_element method to our driver class which will allow us to produce these elements by calling $obj->data_element('name', 'value').

sub data_element {

    my ($self, $element_name, $data) = @_;



    $self->{Handler}->start_element({Name       => $element_name,

                                     Attributes => {}});



    $self->{Handler}->characters({Data => $data});



    $self->{Handler}->end_element({Name => $element_name});

    $self->newline;

}

Did you notice the call to the mysterious newline method? The handler for this driver does nothing more than present the SAX events as an XML document, and we have decided that the resulting document should have at least some sort of minimal formatting to make it easier to look at in a text editor. In this case, having each element on a separate line will suffice. Inserting newlines into the document is likely to be very common, so, rather than calling $obj->characters("\n") for every line break we have created the following newline method that does that for us.

sub newline {

    my $self = shift;

    $self->{Handler}->characters({Data => "\n"});

}

With the convenience methods out of the way, we have only to write the code that translates the FASTA record into XML. To keep things nice and tidy, we will break things up a bit. The parse method initializes the Bio::SeqIO object that processes the file passed from the main section of the script, starts the SAX event stream with the call to start_document, and opens the required top-level element ( <fasta_sequence>). After looping over the gene sequences contained in the file, and passing them off to the seq2sax1 method to handle the details, parse then closes the root element and ends the event stream with a call to end_document. Note the calls to our newline method along the way to ensure that the document produced is in the proper format.

sub parse {
  my $self = shift;
  my $seq_file = shift;
  my $seq_in = Bio::SeqIO->new(-file => $seq_file, -format => 'fasta');
  $self->start_document();
  $self->start_element('fasta_sequence');
  $self->newline;
  while (my $seq = $seq_in->next_seq()) {
    $self->seq2sax1($seq->{primary_seq});
  }
  $self->end_element('fasta_sequence');
  $self->newline;
  $self->end_document();

}

The seq2sax1 method is very similar to the parse method from the earlier example. Each sequence is represented as a hash reference of key-value pairs and we need only loop over the elements of that hash, calling our various convenience methods as we go. Note that each sequence is wrapped in a <primary_seq> element to ensure that the resulting XML data reflects the information captured by Bio::SeqIO.

sub seq2sax1 {
  my ($self, $seq) = @_;
  my %attrs;
  $attrs{display_id} = $seq->{display_id};
  $attrs{primary_id} = $seq->{primary_id};
  $self->start_element('primary_seq', %attrs);
  $self->newline;
  while ( my ($name, $value) = each (%{$seq})) {
    next if $name =~ /_id$/; # display_id and primary_id are already attributes
    $self->data_element($name, $value);
  }
  $self->end_element('primary_seq');
}

Careful readers will have noticed that we call the start_document and end_document methods on our driver object but the driver class does not implement these methods. This would normally cause Perl to die with an error about its inability to locate these object methods. We have kept the event generator interface localized to the driver class using the Perl's built-in AUTOLOAD subroutine to forward these methods to the handler for us.

# expensive, but handy
sub AUTOLOAD {
  my $self = shift;
  my $called_sub = $AUTOLOAD;
  $called_sub =~ s/.+:://; # snip pkg name...
  if (my $method = $self->{Handler}->can($called_sub)) {
    $method->($self->{Handler}, @_);
  }
  else {
    warn "Method '$called_sub' not implemented by handler $self->{Handler}\n";
  }
}

Below is an abbreviated snippet of the document produced by running this script on the sample FASTA record file that ships with the bioperl distribution. The complete file can seen in this month's source code.

<?xml version="1.0"?>
<fasta_sequence>
<primary_seq display_id="gi|2981175" primary_id="gi|2981175">
<moltype>protein</moltype>
<desc>deltex</desc>
<seq>MSRPGHGGLMPVNGLGFPPQNVARVVVWECLNEHSRWR...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
<primary_seq display_id="gi|927067" primary_id="gi|927067">
<moltype>protein</moltype>
<desc>longation factor 1-alpha 1</desc>
<seq>MQSERGITIDISLWKFETSKYYVTIIDAPGHRDFIQNM...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
...
</fasta_sequence>

The result is a bit scary to look at, perhaps, but it is accurate. Most importantly, we did not have to reinvent the wheel to translate our data to an XML format.

Avoiding Common Traps

We have seen how easy it can be to produce SAX event streams (and, hence, XML documents) from non-XML data, but there are a few common gotchas to be aware of before you begin writing your own custom SAX drivers.

In the standard SAX model, where an XML parser is the application that fires the events, the parser is responsible for making sure that the incoming data meets the requirements of well-formed XML; that is, that the document contains a single root element, that each start tag has a corresponding end tag, that the tags are nested properly, and so on. By removing the XML parser from the equation and calling the methods of a SAX handler directly, there is no such safety net. It is entirely possible to write SAX drivers whose resulting documents would not meet XML's well-formedness requirements and driver authors should take care to ensure that the documents event streams being produced actually meet those requirement.

Similarly, driver developers need to make sure that any characters that XML treats as special are replaced by their corresponding entities or wrapped in CDATA sections before passing the data on to a characters handler. Specifically, the characters &, <, >, and ' should be replaced by &amp;, &lt;, &gt;, and &apos; respectively, (or declared as CDATA). \

The easiest way to ensure that the event streams produced by a given driver are legitimate XML is to hook a writer handler to the driver (as in the examples above), save the results to a file, and attempt to parse the resulting document using your favorite XML parser. For example, if you have the Gnome Project's libxml2 installed, you can check the resulting output by typing xmllint --noout myfile.xml, or, using XML::Parser, perl -MXML::Parser -e 'XML::Parser->new(ErrorHandler =>2 )-> parsefile(q('myfile.xml'))' at the command line. In both cases, if the parser does not complain, then you know that your driver is producing well-formed XML.

Conclusions

I'm certain that there are XML purists out there for whom this technique -- using a non-XML class to produce SAX event streams -- will seem like heresy. Indeed you do need to be a bit more careful when letting your own custom module stand in for an XML parser (for the reasons stated above), but, in my opinion, the benefits far outweigh the costs. Writing custom SAX drivers provides a predictable, memory-efficient, easy to take advantage of Perl's advanced built-in data handling capabilities and vast collection of non-XML parsers and other data interfaces to create XML document streams. Could bypassing an XML parser and calling a SAX handler's methods directly be considered a hack? Perhaps. If so, it is a darn good and useful one.

If you are intrigued by the notion of custom SAX drivers, or think that you may have a place for them in your work, I strongly encourage you to have a look at the code for Ilya Sterin's XML::SAXDriver::Excel and XML::SAXDriver::CSV as well as Matt Sergeant's XML::Generator::DBI, and Petr Cimprich's XML::Directory::SAXGenerator for ideas.

Be sure to tune in next month for part two of our advanced SAX series where we will learn how to write our own custom SAX filters.

Resources