
Writing SAX Drivers for Non-XML Data
In a previous column, we covered the basics of the Simple API for XML (SAX) and the modules that implement that interface in Perl. Over the course of the next two months we will move beyond these basic topics to look at two slightly more advanced ones: creating drivers that generate SAX events from non-XML sources and writing custom SAX filters. If you are not familiar with the way SAX works, please read High-Performance XML Parsing With SAX before proceeding.
What Is A SAX Driver, And Why Would You Want One?
SAX is an event-driven API in which the contents of an XML document are accessed through callback subroutines that fire based on various XML parsing events (the beginning of an element, the end of an element,character data, etc.) For the purpose of this article, a SAX driver (sometimes called a SAX generator) can be understood to mean any Perl class that can generate these SAX events.
In the most common case, a SAX driver acts as a proxy between an XML parser and the one or more handler classes written by the developer. The handler methods detailed in the SAX API are called as the parser makes its way through the document, thereby providing access to the contents of that XML document. In fact, this is precisely what SAX was designed for: to provide a simple means to access information stored in XML. As we will see, however, it is often handy to be able to generate these events from data sources other than XML documents.
A Simple Example: Dumping A Perl Hash As An XML Document
Before we look at our first example, it's important to note that a
SAX driver without a handler that receives the generated events and
does something with the data passed is useless. While the
basics of writing SAX handlers are quite easy to grasp, the handlers
themselves can sometimes be quite complex. Our focus here is on
generating events, not handling them; so, for simplicity's sake, we
will use Ken MacLeod's XML::Handler::XMLWriter (which
takes a SAX event stream and prints it to STDOUT as a XML
document) as the default handler throughout this article.
To show how to write a SAX driver we will create a simple inlined class that translates a typical Perl hash into a well-formed XML document where the keys are the element names and the values are the character data contained by those elements.
use strict;
use XML::Handler::XMLWriter;
The main portion of the script consists of nothing
more than initialization of the new handler and driver objects and the
call to the driver's parse method. We use the
parse method to kick off the SAX event stream, passing
the hash we wish to dump to XML as the sole argument. In this case we
will use Perl's venerable built-in system environment dictionary
%ENV.
my $writer = XML::Handler::XMLWriter->new();
my $driver = SAXDriverHash2XML->new(Handler => $writer);
$driver->parse(%ENV);
Next we create our driver class beginning with a typical constructor method.
package SAXDriverHash2XML;
# generate SAX1 events from a simple Perl HASH.
use strict;
# standard constructor
sub new {
my ($proto, %args) = @_;
my $class = ref($proto) || $proto;
my $self = \%args;
bless ($self, $class);
return $self;
}
Finally we get to the substantial part of our driver, the
parse method.
# generate the events
sub parse {
my $self = shift;
my %passed_hash = @_;
|
|
After slurping the sole argument into the local
%passed_hash, we begin firing off the necessary SAX
events to create our XML document. Recall that we passed a blessed
instance of XML::Handler::XMLWriter as the default
handler for our driver. Generating the SAX events is as simple as
calling the appropriate handler methods on that object and passing the
data through as arguments in the format that the handler expects. This
is the essence of writing a custom SAX driver.
We begin the SAX event stream by calling the required
start_document handler.
$self->{Handler}->start_document();
Now a Perl hash is a list of key-value pairs; but for our XML document to be well-formed, it must have a single top-level element. To meet the well-formedness requirement, we will add a top-level wrapper element named "root".
Pay special attention to the arguments we pass to the
start_element handler. Perl SAX1 implementations expect a
hash reference of named properties where the Name
property is a string containing the element's name, and the
Attributes property is another hash reference that
contains the XML attributes attached to that element (empty in this
case).
$self->{Handler}->start_element({Name => 'root', Attributes => {}});
Next we loop over the elements of %passed_hash using
the each function. As we loop over each entry, we fire a
start_element, characters, and
end_element handler event for each record. Note that the
argument to the characters is a hash reference with a
single property (Data that contains the character data
that will become the text content of the surrounding element.
while (my ($name, $value) = each(%passed_hash)) {
$name = lc($name); # we like lower-case tag names
$self->{Handler}->start_element({Name => $name,Attributes => {}});
$self->{Handler}->characters({Data => $value});
$self->{Handler}->end_element({Name => $name});
}
Finally we call the end_element event on the "root"
wrapper element, followed by the end_document handler
which signals the handler class that the "parse" is complete.
$self->{Handler}->end_element({Name => 'root'});
$self->{Handler}->end_document();
}
Running this script on my machine yields an XML document that is similar to the following:
<?xml version="1.0"?>
<root>
<bash_env>/home/kip/.bashrc</bash_env>
<ostype>linux</ostype>
<histsize>1000</histsize>
<hostname>duchamp.hampton.ws</hostname>
<user>kip</user>
<hosttype>i386</hosttype>
<home>/home/kip</home>
<term>linux</term>
<logname>kip</logname>
<path>/usr/local/bin:/bin:/usr/bin:/usr/x11r6/bin:
/home/kip/bin</path>
<shell>/bin/bash</shell>
<mail>/var/spool/mail/kip</mail>
<lang>en_us</lang>
</root>
Note that I said that the output of this script is similar
to the snippet above. Actually, the resulting document puts all the
elements and data on a single line. Remember, XML elements can have
mixed content (containing both child elements and text) so
all character data is important. The spaces and newline
characters added to this example to make it more readable here are, in
truth, text data contained by the "root" element and would have to be
explicitly added via calls to the characters handler to
produce an exact match.
Simplifying Event Generation
As we have seen, generating SAX events is as simple as calling the
appropriate method on the handler object from within our driver class.
However, writing
$obj->{Handler}->method_name($appropriate_hashref)
for each event can be cumbersome and error-prone. Not only is it a lot
of typing, it requires an intimate knowledge of the properties that
each event expects and that we get those properties right each and
every time. If we do not mind a little extra overhead, we can make
life a little easier by creating wrapper methods within our driver
class which allow us to write our parse() method in a
more simple, Perlish way, while ensuring that the handler receives the
data passed by the event in the format that it expects.
For our second and final example we will write a simple driver
that produces an XML document from a genetic sequence record stored in
the FASTA file format. We will use the Bio::SeqIO module
from the bioperl project to read the
sequence record, calling convenience methods in our driver class to
generate the required SAX events to translate that record into an XML
format. Again, we will use XML::Handler::XMLWriter as the
default handler for our driver.
The main portion of our script is more or less
identical to that of the previous example; we create new instances of
the handler and driver classes and call the parse method
on the driver object. This time, though, we pass the location of the
FASTA file that we want to translate to XML as the sole argument to
parse.
use strict;
use XML::Handler::XMLWriter;
my $sequence_file = 'files/seq1.fasta';
my $writer = XML::Handler::XMLWriter->new();
my $driver = SAXDriverFastaFile->new(Handler => $writer);
$driver->parse($sequence_file);
Now we begin our driver class.
package SAXDriverFastaFile;
# generate SAX1 events from a fasta sequence record.
use strict;
use Bio::SeqIO;
use vars qw($AUTOLOAD);
sub new {
my ($proto, %args) = @_;
my $class = ref($proto) || $proto;
my $self = \%args;
bless ($self, $class);
return $self;
}
We have decided that we would rather pass simple structures to the
event generators that are used most often (rather than the hash
references that the SAX handler expects), so we will implement the
start_element, end_element, and
characters methods inside our driver class to accept
these simpler arguments and forward the data to the handler in the
expected format.
sub start_element {
my $self = shift;
my $element_name = shift;
my %attributes = @_;
$self->{Handler}->start_element({Name => $element_name,
Attributes => \%attributes});
}
sub end_element {
my ($self, $element_name) = @_;
$self->{Handler}->end_element({Name => $element_name});
$self->newline;
}
sub characters {
my ($self, $data) = @_;
$self->{Handler}->characters({Data => $data});
}
With these methods in place we can now write:
$obj->start_element('element_name', (attr1 => 'some value', attr2 => 'some other value'))
rather than the more verbose:
$obj->{Handler}->start_element({Name => 'element_name',
Attributes => {attr1 => 'some value'
attr2 => 'some other value'}
})
In analyzing the task at hand, we notice that we often want to
produce simple XML data elements in the format
<name>value</name>. To make generating these
events easier, we will add the following data_element
method to our driver class which will allow us to produce these
elements by calling $obj->data_element('name',
'value').
sub data_element {
my ($self, $element_name, $data) = @_;
$self->{Handler}->start_element({Name => $element_name,
Attributes => {}});
$self->{Handler}->characters({Data => $data});
$self->{Handler}->end_element({Name => $element_name});
$self->newline;
}
Did you notice the call to the mysterious newline
method? The handler for this driver does nothing more than present
the SAX events as an XML document, and we have decided that the
resulting document should have at least some sort of minimal
formatting to make it easier to look at in a text editor. In this
case, having each element on a separate line will suffice. Inserting
newlines into the document is likely to be very common, so, rather
than calling $obj->characters("\n") for every line break
we have created the following newline method that does
that for us.
sub newline {
my $self = shift;
$self->{Handler}->characters({Data => "\n"});
}
With the convenience methods out of the way, we have only to write
the code that translates the FASTA record into XML. To keep things
nice and tidy, we will break things up a bit. The parse
method initializes the Bio::SeqIO object that processes
the file passed from the main section of the script,
starts the SAX event stream with the call to
start_document, and opens the required top-level element
( <fasta_sequence>). After looping over the gene
sequences contained in the file, and passing them off to the
seq2sax1 method to handle the details, parse
then closes the root element and ends the event stream with a call to
end_document. Note the calls to our newline
method along the way to ensure that the document produced is in the
proper format.
sub parse {
my $self = shift;
my $seq_file = shift;
my $seq_in = Bio::SeqIO->new(-file => $seq_file, -format => 'fasta');
$self->start_document();
$self->start_element('fasta_sequence');
$self->newline;
while (my $seq = $seq_in->next_seq()) {
$self->seq2sax1($seq->{primary_seq});
}
$self->end_element('fasta_sequence');
$self->newline;
$self->end_document();
}
The seq2sax1 method is very similar to the
parse method from the earlier example. Each sequence is
represented as a hash reference of key-value pairs and we need only
loop over the elements of that hash, calling our various convenience
methods as we go. Note that each sequence is wrapped in a
<primary_seq> element to ensure that the resulting
XML data reflects the information captured by
Bio::SeqIO.
sub seq2sax1 {
my ($self, $seq) = @_;
my %attrs;
$attrs{display_id} = $seq->{display_id};
$attrs{primary_id} = $seq->{primary_id};
$self->start_element('primary_seq', %attrs);
$self->newline;
while ( my ($name, $value) = each (%{$seq})) {
next if $name =~ /_id$/; # display_id and primary_id are already attributes
$self->data_element($name, $value);
}
$self->end_element('primary_seq');
}
Careful readers will have noticed that we call the
start_document and end_document methods on
our driver object but the driver class does not implement these
methods. This would normally cause Perl to die with an error about its
inability to locate these object methods. We have kept the event
generator interface localized to the driver class using the Perl's
built-in AUTOLOAD subroutine to forward these methods to
the handler for us.
# expensive, but handy
sub AUTOLOAD {
my $self = shift;
my $called_sub = $AUTOLOAD;
$called_sub =~ s/.+:://; # snip pkg name...
if (my $method = $self->{Handler}->can($called_sub)) {
$method->($self->{Handler}, @_);
}
else {
warn "Method '$called_sub' not implemented by handler $self->{Handler}\n";
}
}
Below is an abbreviated snippet of the document produced by running this script on the sample FASTA record file that ships with the bioperl distribution. The complete file can seen in this month's source code.
<?xml version="1.0"?>
<fasta_sequence>
<primary_seq display_id="gi|2981175" primary_id="gi|2981175">
<moltype>protein</moltype>
<desc>deltex</desc>
<seq>MSRPGHGGLMPVNGLGFPPQNVARVVVWECLNEHSRWR...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
<primary_seq display_id="gi|927067" primary_id="gi|927067">
<moltype>protein</moltype>
<desc>longation factor 1-alpha 1</desc>
<seq>MQSERGITIDISLWKFETSKYYVTIIDAPGHRDFIQNM...</seq>
<_rootI_verbose>0</_rootI_verbose>
</primary_seq>
...
</fasta_sequence>
The result is a bit scary to look at, perhaps, but it is accurate. Most importantly, we did not have to reinvent the wheel to translate our data to an XML format.
Avoiding Common Traps
We have seen how easy it can be to produce SAX event streams (and, hence, XML documents) from non-XML data, but there are a few common gotchas to be aware of before you begin writing your own custom SAX drivers.
In the standard SAX model, where an XML parser is the application that fires the events, the parser is responsible for making sure that the incoming data meets the requirements of well-formed XML; that is, that the document contains a single root element, that each start tag has a corresponding end tag, that the tags are nested properly, and so on. By removing the XML parser from the equation and calling the methods of a SAX handler directly, there is no such safety net. It is entirely possible to write SAX drivers whose resulting documents would not meet XML's well-formedness requirements and driver authors should take care to ensure that the documents event streams being produced actually meet those requirement.
Similarly, driver developers need to make sure that any characters
that XML treats as special are replaced by their corresponding
entities or wrapped in CDATA sections before passing the data on to a
characters handler. Specifically, the characters &,
<, >, and ' should be replaced by &, <, >,
and ' respectively, (or declared as CDATA). \
The easiest way to ensure that the event streams produced by a
given driver are legitimate XML is to hook a writer handler to the
driver (as in the examples above), save the results to a file, and
attempt to parse the resulting document using your favorite XML
parser. For example, if you have the Gnome Project's
libxml2 installed, you can check the resulting output by
typing xmllint --noout myfile.xml, or, using
XML::Parser, perl -MXML::Parser -e
'XML::Parser->new(ErrorHandler =>2 )-> parsefile(q('myfile.xml'))'
at the command line. In both cases, if the parser does not complain, then
you know that your driver is producing well-formed XML.
Conclusions
I'm certain that there are XML purists out there for whom this technique -- using a non-XML class to produce SAX event streams -- will seem like heresy. Indeed you do need to be a bit more careful when letting your own custom module stand in for an XML parser (for the reasons stated above), but, in my opinion, the benefits far outweigh the costs. Writing custom SAX drivers provides a predictable, memory-efficient, easy to take advantage of Perl's advanced built-in data handling capabilities and vast collection of non-XML parsers and other data interfaces to create XML document streams. Could bypassing an XML parser and calling a SAX handler's methods directly be considered a hack? Perhaps. If so, it is a darn good and useful one.
If you are intrigued by the notion of custom SAX drivers, or think that
you may have a place for them in your work, I strongly encourage you to have
a look at the code for Ilya Sterin's XML::SAXDriver::Excel and
XML::SAXDriver::CSV as well as Matt Sergeant's
XML::Generator::DBI, and Petr Cimprich's
XML::Directory::SAXGenerator for ideas.
Be sure to tune in next month for part two of our advanced SAX series where we will learn how to write our own custom SAX filters.
Resources
- Download the sample code.
- Perl XML Quickstart: The Standard XML Interfaces
- High-Performance XML Parsing With SAX
- David Megginson's SAX Pages
- Encoding issues
2001-10-10 14:02:57 Mike Dierken - Excellent, but one bug...
2001-09-20 01:43:43 Matt Sergeant