by Kip Hampton
Last month we began our exploration of more advanced SAX topics with a look at how SAX events can be generated from non-XML data. This month, we conclude the series with a short introduction to SAX filters and how they can be used to transform XML data.
A SAX filter is simply a class that is passed as the event handler to another class that generates SAX events, then forwards all or some of those events on the next handler (or filter) in the processing chain. A filter may prune the document tree by not forwarding events for elements with a given name (or that meet some other condition), while in other cases, a filter might generate its own new events to add parent or child elements to certain elements the existing document stream. Also, element attributes can be added or removed, or the character data altered is some way. Really, any class that is able to receive SAX events, then call event methods on another SAX handler in a way that alters the document stream can be defined as a SAX filter.
In practice, SAX filters may be thought of as the conceptual cousins of many of the standard UNIX tools. By themselves, these tools often perform only a single, simple task, but when piped together they are capable of astonishing feats. In the same way, the real power of SAX filters is derived from the fact that simpler, easy-to-maintain filters may be chained together to produce complex XML data transformations.
For our first example we will create a simple SAX filter that transforms the character data
passed from XML::Parser::PerlSAX
then hands it on to Michael Koehne's
XML::Handler::YAWriter
to produce the final XML document.
use strict; use XML::Parser::PerlSAX; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a file name to process\n";
With the necessary modules included, we get to the section that reveals just exactly how SAX
filters work. Notice that we create a new instance of XML::YAWriter
then pass
that object as the Handler
for our custom filter, the instance of which is passed
as the Handler
to XML::Parser::PerlSAX
. When the script is executed,
the parser will call its SAX events on the methods in our FilterPorcus
class,
which, in turn will call the event methods on the writer class to print the result to
STDOUT
.
Note that when defining event chains, the classes are created in reverse order, with the first handler being the last class that is actually called. This may seem a bit confusing at first but with a little practice, you will get the hang of it.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" )); my $filter = FilterPorcus->new(Handler => $writer); my $parser = XML::Parser::PerlSAX->new(Handler => $filter); my %parser_args = (Source => {SystemId => $file}); $parser->parse(%parser_args); # end main
Next we create our custom filter class as an inline Perl package. Pay special attention to
the fact that our class inherits from Matt Sergeant's XML::Filter::Base
class.
This allows us to implement only those handler methods that are relevant to our filter since
XML::Filter::Base
.by default, automatically forwards all SAX to the next
handler class in the chain. If our class were not a subclass of Filter::Base
we
would have to explicitly forward each and every event that the previous class could
potentially generate.
# silly text transformer package FilterPorcus; use strict; use base qw(XML::Filter::Base); sub new { my $class = shift; my %options = @_; return bless \%options, $class; }
Our filter is only interested in transforming the text nodes of the input document, so we will
only implement the characters
method. After passing the character data to the local
porcus
subroutine for transformation, we forward the result to the next handler by
calling the characters
event on that hander.
sub characters { my ($self, $chars) = @_; my $out = $self->porcus($chars->{Data}); $self->{Handler}->characters({Data => $out}); }
Finally we get to the porcus
method that returns the string passed to it
transformed into the desired format using a little regular expression voodoo.
sub porcus { my ($self, $chars) = @_; $chars =~ tr/A-Z/a-z/; $chars =~ s/\b([aeiou])/w$1/g; my $cons = q{[bcfghjklmnpqrstvwxz]}; $chars =~ s/\b(qu|$cons($cons$cons?)?|[a-z])([a-z]*)/$3$1ay/g; return $chars; }
Feeding this script snippet of Larry Wall's latest Perl 6 Apocalypse produces the following result:
<html> <body> <p> otay emay, oneway ofway ethay ostmay agonizingway aspectsway ofway anguage lay esignday isway omingcay upway ithway away usefulway ystemsay ofway operatorsway. otay otherway anguagelay esignersday, isthay aymay eemsay ikelay away illysay ingthay otay agonizeway overway. afterway allway, ouyay ancay iewvay allway operatorsway asway eremay yntacticsay ugarsay -- operatorsway areway ustjay unnyfay ookinglay unctionfay allscay. </p> </body> </html>
Okay, the result is admittedly pretty silly -- there may even be those who would argue that converting Uncle Larry's prose to pig latin is a bit redundant -- but the script does illustrate the basics of creating a simple SAX filter:
If we also wanted to transform the element names and attribute names and values in addition
to the text data we need only add the following start_element
and
end_element
handlers.
sub start_element { my ($self, $element) = @_; my %attrs = %{$element->{Attributes}}; while ( my ($name, $value) = (each (%attrs))) { my $orig_name = $name; $name = $self->porcus($name); $value = $self->porcus($value); $attrs{$name} = $value; delete $attrs{$orig_name}; } $element->{Attributes} = \%attrs; my $elname = $self->porcus($element->{Name}); $element->{Name} = $elname; $self->{Handler}->start_element($element); } sub end_element { my ($self, $element) = @_; my $elname = $self->porcus($element->{Name}); $element->{Name} = $elname; $self->{Handler}->end_element($element); }
Again, the principles are the same: accept events, alter the data, then forward that altered data by calling events on the filter's designated handler.
Enough silliness, let's look at a more practical example.
For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.
XInclude suggests a compact DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example:
<?xml version="1.0"> <article xmlns="http://localhost/myns" xmlns:xi="http://www.w3.org/2001/XInclude"> <para> All brontosauruses are thin at one end, much much thicker in the middle, and then thin again at the far end. </para> <xi:include href="disclaimer.xml"/> </article>
would signal an XInclude-aware processor to include the contents of the file
disclaimer.xml
into the current document between the end tag of para element and
the end tag of the top-level article element.
And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft in that it will only allow inclusion of complete documents from the local file system -- XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.
use strict; use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";
After the required imports we are read to build our SAX filter/handler chain. The chain is
more complex in this case since XML::Parser::PerlSAX
generates SAX1 events and
XML::Handler::YAWriter
expects SAX1 events, but our XInclude filter requires
the more sophisticated namespace processing provided by SAX2. We work around this by adding
the filters XML::Filter::SAX1toSAX2
and XML::Filter::SAX2toSAX1
to
the chain immediately before and after our custom filter. This allows for proper namespace
processing while ensuring that the other parts of the handler chain are able to generate and
receive the data for the given events in the format that each expects.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" )); $writer->{Pretty}->{NoProlog} = 1; my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer); my $handler = FilterXInclude->new(Handler => $sax1_filter); my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler); my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter); my %parser_args = (Source => {SystemId => $file}); $parser->parse(%parser_args); # end main
We now begin our XInclude filter module. Note that, again, we inherit from
XML::Filter::Base
to make our lives easier. Also notice that we add a
BaseURI
property to the filter object. This gives us a place to store
the path that provides the context in which to resolve any relative URIs offered by
the include elements. We set the default for this property to the current directory
that the script is being executed in.
# minimal XInclude Implementation package FilterXInclude; use strict; use base qw(XML::Filter::Base); use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; sub new { my $class = shift; my %options = @_; $options{BaseURI} ||= './'; return bless \%options, $class; } sub start_element { my ($self, $element) = @_; my %attrs = %{$element->{Attributes}};
As we begin the start_element
handler, we first check for an xml:base
attribute in the current element. The xml:base
attribute is the recommended
way to set the base URI for applications that are expected to cope with relative URIs. In this
case if an xml:base
attribute is found, we set the value of the filter object's
BaseURI
property to its value.
It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.
After much discussion on the perl-xml mailing list, it was decided that in SAX2 implementations
attributes should remain a HASH, but should employ a notation first advanced by James
Clark where the insufficient name => value
structure is replaced by
{namepace_uri}localname = \%attribute_properties
. So, in the following block,
when we say $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value}
this
can be understood to mean "give me the 'Value' property of the attribute that is bound to the
'http://www.w3.org/XML/1998/namespace' namespace whose local name is 'base'".
if (defined $attrs{'{http://www.w3.org/XML/1998/namespace}base'}) { $self->{BaseURI} = $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value}; $self->{BaseURI} =~ s|^file://||; }
Next, we check to see if the current element is in the XIinclude namespace and has the local
name of 'include' and, if so, we send the value that element's href
attribute
off to our include_proc
method to include the document at that URI into the
current document stream.
Also notice that we do not forward the events for the include
elements
since we do not want those elements to actually appear in the result document.
This, coupled with the results included from the include_proc
method, have the
effect of replacing the include
elements with the documents that
they point to.
if ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude' and $element->{LocalName} eq 'include') { $self->include_proc($attrs{'{}href'}->{Value}); } else { $self->{Handler}->start_element($element); } }
It is not enough to exclude the include
elements from being forwarded in the
start_element
handler; we must also do the same in the end_element
handler as well. Otherwise, the resulting document would still contain the end tags for the
include
elements, and would cause the resulting XML document to be ill-formed.
sub end_element { my ($self, $element) = @_; unless ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude' and $element->{LocalName} eq 'include') { $self->{Handler}->end_element($element); } }
I should also point out that if you are wanting to prune elements that may contain
character data from a document, you must also implement a characters
handler
that conditionally blocks the forwarding of text events. Otherwise the text contained by the
excluded elements will become part of the text of the nearest parent element, which is not likely to
produce the desired result! We need not worry in this case since all of the
include
elements are empty.
Finally we get to the include_proc
method which is responsible for parsing and including the requested
documents. Here we simply create a new instance of XML::Filter::SAX1toSAX2
, passing
the current instance of our filter as the handler, then pass that as the handler for a new
instance of XML::Parser::PerlSAX
, and tell the parser to parse the document passed
to the subroutine in the context of the BaseURI
property.
The result of this is that the events fired from these included documents are inserted into the
current document stream at the precise location previously taken by the include
elements.
sub include_proc { my ($self, $file) = @_; $file = $self->{BaseURI} . $file; my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $self); my $parser = XML::Parser::PerlSAX->new({Handler => $sax2_filter, Source => {SystemId => $file} }); $parser->parse; }
Passing the following XML document to this script. . .
<?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" xml:base="file://files/"> <head> <title> Templating With XInclude and SAX2 </title> </head> <body> <xi:include href="header.xml"/> <hr width="80%"/> <xi:include href="content.xml"/> <hr width="80%"/> <xi:include href="footer.xml"/> </body> </html>
might result in a document like the following:
<html xml:base="file://files/" xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude"> <head> <title> Templating With XInclude and SAX2 </title> </head> <body> <div class="header"> <h1>Common Header</h1> </div> <hr width="80%"></hr> <div class="content"> <p> Now is the winter of our discontent made glorious summer by the son of York. </p> </div> <hr width="80%"></hr> <div class="footer"> <p>Common Footer</p> </div> </body> </html>
SAX is an important XML technology that, like Perl, keeps simple things simple and makes hard thing possible. Knowing how to generate SAX events from non-XML data, and using SAX filters to transform existing document streams are key to a mature understanding of the power that SAX can offer. We have only scratched the surface of what SAX filters and generators can do, but I hope that we have at least covered the basics well enough to pique your curiosity and get you to experiment on your own.