Transforming XML With SAX Filters
Last month we began our exploration of more advanced SAX topics with a look at how SAX events can be generated from non-XML data. This month, we conclude the series by introducing SAX filters and their use in XML data transformation.
A SAX filter is simply a class that is passed as the event handler to another class that generates SAX events, then forwards all or some of those events on the next handler (or filter) in the processing chain. A filter may prune the document tree by not forwarding events for elements with a given name (or that meet some other condition), while in other cases, a filter might generate its own new events to add parent or child elements to certain elements the existing document stream. Also, element attributes can be added or removed or the character data altered in some way. Really any class that is able to receive SAX events, then call event methods on another SAX handler in a way that alters the document stream can be seen as a SAX filter.
In practice, SAX filters are like conceptual cousins of many of the standard UNIX tools. By themselves, these tools often perform only a single, simple task, but when piped together they are capable of astonishing feats. In the same way, the real power of SAX filters is derived from the fact that simpler, easy-to-maintain filters may be chained together to produce complex XML data transformations.
|
|
| Post your comments |
For our first example we will create a simple SAX filter that transforms
the character data passed from XML::Parser::PerlSAX then hands
it on to Michael Koehne's XML::Handler::YAWriter to produce the
final XML document.
use strict; use XML::Parser::PerlSAX; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a file name to process\n";
With the necessary modules included, we get to the section that reveals
just exactly how SAX filters work. Notice that we create a new instance of
XML::YAWriter, then pass that object as the Handler
for our custom filter, the instance of which is passed as the
Handler to XML::Parser::PerlSAX. When the script
is executed, the parser will call its SAX events on the methods in our
FilterPorcus class, which, in turn will call the event methods
on the writer class to print the result to STDOUT.
Note that when defining event chains, the classes are created in reverse order, with the first handler being the last class that is actually called. This may seem a bit confusing at first but with a little practice, you will get the hang of it.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));
my $filter = FilterPorcus->new(Handler => $writer);
my $parser = XML::Parser::PerlSAX->new(Handler => $filter);
my %parser_args = (Source => {SystemId => $file});
$parser->parse(%parser_args);
# end main
Next we create our custom filter class as an inline Perl package. Pay
special attention to the fact that our class inherits from Matt Sergeant's
XML::Filter::Base class. This allows us to implement only those
handler methods that are relevant to our filter since
XML::Filter::Base automatically forwards, by default, all SAX
to the next handler class in the chain. If our class were not a subclass of
Filter::Base we would have to explicitly forward each and every
event that the previous class could potentially generate.
# silly text transformer
package FilterPorcus;
use strict;
use base qw(XML::Filter::Base);
sub new {
my $class = shift;
my %options = @_;
return bless \%options, $class;
}
|
Our filter is only interested in transforming the text nodes of the
input document, so we will only implement the characters
method. After passing the character data to the local porcus
subroutine for transformation, we forward the result to the next handler by
calling the characters event on that handler.
sub characters {
my ($self, $chars) = @_;
my $out = $self->porcus($chars->{Data});
$self->{Handler}->characters({Data => $out});
}
Finally we get to the porcus method that returns the string
passed to it transformed into the desired format using a little regular
expression voodoo.
sub porcus {
my ($self, $chars) = @_;
$chars =~ tr/A-Z/a-z/;
$chars =~ s/\b([aeiou])/w$1/g;
my $cons = q{[bcfghjklmnpqrstvwxz]};
$chars =~ s/\b(qu|$cons($cons$cons?)?|[a-z])([a-z]*)/$3$1ay/g;
return $chars;
}
Feeding this script a snippet of Larry Wall's latest Perl 6 Apocalypse produces the following result:
<html> <body> <p> otay emay, oneway ofway ethay ostmay agonizingway aspectsway ofway anguage lay esignday isway omingcay upway ithway away usefulway ystemsay ofway operatorsway. otay otherway anguagelay esignersday, isthay aymay eemsay ikelay away illysay ingthay otay agonizeway overway. afterway allway, ouyay ancay iewvay allway operatorsway asway eremay yntacticsay ugarsay -- operatorsway areway ustjay unnyfay ookinglay unctionfay allscay. </p> </body> </html>
Okay, the result is admittedly pretty silly -- there may even be those who would argue that converting Uncle Larry's prose to pig latin is a bit redundant -- but the script does illustrate the basics of creating a simple SAX filter:
If we also wanted to transform the element and attribute names and
values in addition to the text data we would only need to add the following
start_element and end_element handlers.
sub start_element {
my ($self, $element) = @_;
my %attrs = %{$element->{Attributes}};
while ( my ($name, $value) = (each (%attrs))) {
my $orig_name = $name;
$name = $self->porcus($name);
$value = $self->porcus($value);
$attrs{$name} = $value;
delete $attrs{$orig_name};
}
$element->{Attributes} = \%attrs;
my $elname = $self->porcus($element->{Name});
$element->{Name} = $elname;
$self->{Handler}->start_element($element);
}
sub end_element {
my ($self, $element) = @_;
my $elname = $self->porcus($element->{Name});
$element->{Name} = $elname;
$self->{Handler}->end_element($element);
}
Again, the principles are the same: accept events, alter the data, then forward that altered data by calling events on the filter's designated handler.
Enough silliness, let's look at a more practical example.
|
For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.
XInclude suggests a compact, DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example,
<?xml version="1.0">
<article
xmlns="http://localhost/myns"
xmlns:xi="http://www.w3.org/2001/XInclude">
<para>
All brontosauruses are thin at one end,
much much thicker in the middle, and
then thin again at the far end.
</para>
<xi:include href="disclaimer.xml"/>
</article>
would signal an XInclude-aware processor to include the contents of the file
disclaimer.xml into the current document between the end tag of para element and
the end tag of the top-level article element.
And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft; it will only allow inclusion of complete documents from the local file system. XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.
use strict; use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";
After the required imports we are ready to build our SAX filter-handler
chain. The chain is more complex in this case since
XML::Parser::PerlSAX generates SAX1 events and
XML::Handler::YAWriter expects SAX1 events, but our XInclude
filter requires the more sophisticated namespace processing provided by
SAX2. We work around this by adding the filters
XML::Filter::SAX1toSAX2 and
XML::Filter::SAX2toSAX1 to the chain immediately before and
after our custom filter. This allows for proper namespace processing while
ensuring that the other parts of the handler chain are able to generate and
receive the data for the given events in the format that each expects.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));
$writer->{Pretty}->{NoProlog} = 1;
my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer);
my $handler = FilterXInclude->new(Handler => $sax1_filter);
my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler);
my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter);
my %parser_args = (Source => {SystemId => $file});
$parser->parse(%parser_args);
# end main
We now begin our XInclude filter module. Note that, again, we inherit
from XML::Filter::Base to make life a little easier. Also
notice that we add a BaseURI property to the filter object.
This gives us a place to store the path that provides the context in which
to resolve any relative URIs offered by the include elements. We set the
default for this property to the current directory that the script is being
executed in.
# minimal XInclude Implementation
package FilterXInclude;
use strict;
use base qw(XML::Filter::Base);
use XML::Parser::PerlSAX;
use XML::Filter::SAX2toSAX1;
use XML::Filter::SAX1toSAX2;
sub new {
my $class = shift;
my %options = @_;
$options{BaseURI} ||= './';
return bless \%options, $class;
}
sub start_element {
my ($self, $element) = @_;
my %attrs = %{$element->{Attributes}};
As we begin the start_element handler, we first check for
an xml:base attribute in the current element. The
xml:base attribute is the recommended way to set the base URI
for applications that are expected to cope with relative URIs. In this case
if an xml:base attribute is found, we set the value of the
filter object's BaseURI property to its value.
It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.
|
After much discussion on the perl-xml mailing list, it was decided that
in SAX2 implementations attributes should remain a HASH, but should employ a
notation first advanced by James Clark where the insufficient name =>
value structure is replaced by {namepace_uri}localname =
\%attribute_properties. So, in the following block, when we say
$attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value}
this should be understood to mean "give me the 'Value' property of the
attribute that is bound to the 'http://www.w3.org/XML/1998/namespace'
namespace whose local name is 'base'".
if (defined $attrs{'{http://www.w3.org/XML/1998/namespace}base'}) {
$self->{BaseURI} =
$attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value};
$self->{BaseURI} =~ s|^file://||;
}
Next, we check to see if the current element is in the XIinclude
namespace and has the local name of 'include' and, if so, we send the value
that element's href attribute off to our
include_proc method to include the document at that URI into
the current document stream.
Also notice that we do not forward the events for the
include elements since we do not want those elements to
actually appear in the result document. This, coupled with the results
included from the include_proc method, has the effect of
replacing the include elements with the documents that
they point to.
if ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude'
and $element->{LocalName} eq 'include') {
$self->include_proc($attrs{'{}href'}->{Value});
}
else {
$self->{Handler}->start_element($element);
}
}
It is not enough to exclude the include elements from being
forwarded in the start_element handler; we must also do the
same in the end_element handler as well. Otherwise, the
resulting document would still contain the end tags for the
include elements, causing the resulting XML document to be
ill-formed.
sub end_element {
my ($self, $element) = @_;
unless ($element->{NamespaceURI} eq
'http://www.w3.org/2001/XInclude'
and $element->{LocalName} eq 'include') {
$self->{Handler}->end_element($element);
}
}
I should also point out that if you want to prune elements that may
contain character data from a document, you must also implement a
characters handler that conditionally blocks the forwarding of
text events. Otherwise the text contained by the excluded elements will
become part of the text of the nearest parent element, which is not likely
to produce the desired result. We need not worry in this case since all of
the include elements are empty.
Finally we get to the include_proc method which is
responsible for parsing and including the requested documents. Here we
simply create a new instance of XML::Filter::SAX1toSAX2,
passing the current instance of our filter as the handler, then pass that as
the handler for a new instance of XML::Parser::PerlSAX, and
tell the parser to parse the document passed to the subroutine in the
context of the BaseURI property.
The result of this is that the events fired from these included documents
are inserted into the current document stream at the precise location
previously taken by the include elements.
sub include_proc {
my ($self, $file) = @_;
$file = $self->{BaseURI} . $file;
my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $self);
my $parser = XML::Parser::PerlSAX->new({Handler => $sax2_filter,
Source => {SystemId => $file}
});
$parser->parse;
}
Passing the following XML document to this script. . .
Resources |
|
Download the sample code. |
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:xi="http://www.w3.org/2001/XInclude"
xml:base="file://files/">
<head>
<title>
Templating With XInclude and SAX2
</title>
</head>
<body>
<xi:include href="header.xml"/>
<hr width="80%"/>
<xi:include href="content.xml"/>
<hr width="80%"/>
<xi:include href="footer.xml"/>
</body>
</html>
might result in a document like
Also in Perl and XML |
|
OSCON 2002 Perl and XML Review PDF Presentations Using AxPoint |
<html
xml:base="file://files/"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xi="http://www.w3.org/2001/XInclude">
<head>
<title>
Templating With XInclude and SAX2
</title>
</head>
<body>
<div class="header">
<h1>Common Header</h1>
</div>
<hr width="80%"></hr>
<div class="content">
<p>
Now is the winter of our
discontent made glorious
summer by the son of York.
</p>
</div>
<hr width="80%"></hr>
<div class="footer">
<p>Common Footer</p>
</div>
</body>
</html>
SAX is an important XML technology that, like Perl, keeps simple things simple and makes hard thing possible. Knowing how to generate SAX events from non-XML data and using SAX filters to transform existing document streams are key to a mature understanding of the power that SAX offers. We have only scratched the surface of what SAX filters and generators can do, but I hope that we have at least covered the basics well enough to pique your curiosity and provoke experimentation.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.