Transforming XML With SAX Filters
by Kip Hampton
|
Pages: 1, 2, 3, 4
Transforming Document Structure
For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.
XInclude suggests a compact, DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example,
<?xml version="1.0">
<article
xmlns="http://localhost/myns"
xmlns:xi="http://www.w3.org/2001/XInclude">
<para>
All brontosauruses are thin at one end,
much much thicker in the middle, and
then thin again at the far end.
</para>
<xi:include href="disclaimer.xml"/>
</article>
would signal an XInclude-aware processor to include the contents of the file
disclaimer.xml into the current document between the end tag of para element and
the end tag of the top-level article element.
And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft; it will only allow inclusion of complete documents from the local file system. XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.
use strict; use XML::Parser::PerlSAX; use XML::Filter::SAX2toSAX1; use XML::Filter::SAX1toSAX2; use XML::Handler::YAWriter; use IO::File; my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";
After the required imports we are ready to build our SAX filter-handler
chain. The chain is more complex in this case since
XML::Parser::PerlSAX generates SAX1 events and
XML::Handler::YAWriter expects SAX1 events, but our XInclude
filter requires the more sophisticated namespace processing provided by
SAX2. We work around this by adding the filters
XML::Filter::SAX1toSAX2 and
XML::Filter::SAX2toSAX1 to the chain immediately before and
after our custom filter. This allows for proper namespace processing while
ensuring that the other parts of the handler chain are able to generate and
receive the data for the given events in the format that each expects.
my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));
$writer->{Pretty}->{NoProlog} = 1;
my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer);
my $handler = FilterXInclude->new(Handler => $sax1_filter);
my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler);
my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter);
my %parser_args = (Source => {SystemId => $file});
$parser->parse(%parser_args);
# end main
We now begin our XInclude filter module. Note that, again, we inherit
from XML::Filter::Base to make life a little easier. Also
notice that we add a BaseURI property to the filter object.
This gives us a place to store the path that provides the context in which
to resolve any relative URIs offered by the include elements. We set the
default for this property to the current directory that the script is being
executed in.
# minimal XInclude Implementation
package FilterXInclude;
use strict;
use base qw(XML::Filter::Base);
use XML::Parser::PerlSAX;
use XML::Filter::SAX2toSAX1;
use XML::Filter::SAX1toSAX2;
sub new {
my $class = shift;
my %options = @_;
$options{BaseURI} ||= './';
return bless \%options, $class;
}
sub start_element {
my ($self, $element) = @_;
my %attrs = %{$element->{Attributes}};
As we begin the start_element handler, we first check for
an xml:base attribute in the current element. The
xml:base attribute is the recommended way to set the base URI
for applications that are expected to cope with relative URIs. In this case
if an xml:base attribute is found, we set the value of the
filter object's BaseURI property to its value.
It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.