XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Transforming XML With SAX Filters
by Kip Hampton | Pages: 1, 2, 3, 4

Transforming Document Structure

For our final example, we will demonstrate how a SAX filter can be used to alter the structure of an XML document by creating a filter that partially implements the current version of the W3C's XInclude working draft.

XInclude suggests a compact, DTD- and Schema-agnostic way to include external XML documents (or document fragments) into the current document being processed. For example,

<?xml version="1.0">
<article
  xmlns="http://localhost/myns"
  xmlns:xi="http://www.w3.org/2001/XInclude">
  <para>
    All brontosauruses are thin at one end,
    much much thicker in the middle, and
    then thin again at the far end.
  </para>
  <xi:include href="disclaimer.xml"/>
</article>

would signal an XInclude-aware processor to include the contents of the file disclaimer.xml into the current document between the end tag of para element and the end tag of the top-level article element.

And speaking of disclaimers, it should be pointed out that our implementation here by no means covers the requirements of the full XInclude draft; it will only allow inclusion of complete documents from the local file system. XInclude itself is far more flexible and robust. Our goal here is merely to demonstrate the principles of writing SAX filters.

use strict;
use XML::Parser::PerlSAX;
use XML::Filter::SAX2toSAX1;
use XML::Filter::SAX1toSAX2;
use XML::Handler::YAWriter;
use IO::File;

my $file = $ARGV[0] || die "Please pass a filename to process. . .\n";

After the required imports we are ready to build our SAX filter-handler chain. The chain is more complex in this case since XML::Parser::PerlSAX generates SAX1 events and XML::Handler::YAWriter expects SAX1 events, but our XInclude filter requires the more sophisticated namespace processing provided by SAX2. We work around this by adding the filters XML::Filter::SAX1toSAX2 and XML::Filter::SAX2toSAX1 to the chain immediately before and after our custom filter. This allows for proper namespace processing while ensuring that the other parts of the handler chain are able to generate and receive the data for the given events in the format that each expects.

my $writer = XML::Handler::YAWriter->new(Output => IO::File->new( ">-" ));
$writer->{Pretty}->{NoProlog} = 1;
my $sax1_filter = XML::Filter::SAX2toSAX1->new(Handler => $writer);
my $handler = FilterXInclude->new(Handler => $sax1_filter);
my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $handler);
my $parser = XML::Parser::PerlSAX->new(Handler => $sax2_filter);

my %parser_args = (Source => {SystemId => $file});
$parser->parse(%parser_args);

# end main

We now begin our XInclude filter module. Note that, again, we inherit from XML::Filter::Base to make life a little easier. Also notice that we add a BaseURI property to the filter object. This gives us a place to store the path that provides the context in which to resolve any relative URIs offered by the include elements. We set the default for this property to the current directory that the script is being executed in.

# minimal XInclude Implementation
package FilterXInclude;
use strict;
use base qw(XML::Filter::Base);
use XML::Parser::PerlSAX;
use XML::Filter::SAX2toSAX1;
use XML::Filter::SAX1toSAX2;

sub new {
    my $class = shift;
    my %options = @_;
    $options{BaseURI} ||= './';
    return bless \%options, $class;
}

sub start_element {
  my ($self, $element) = @_;
  my %attrs = %{$element->{Attributes}};

As we begin the start_element handler, we first check for an xml:base attribute in the current element. The xml:base attribute is the recommended way to set the base URI for applications that are expected to cope with relative URIs. In this case if an xml:base attribute is found, we set the value of the filter object's BaseURI property to its value.

It is worth noting here that the structure of SAX2 attributes differs significantly from that of SAX1. In Perl implementations of SAX1, attributes are a simple HASH reference of name/value pairs. This causes problems with more modern documents that employ XML namespaces since they allow for cases where two attributes may have the same name, but are bound to different namespace URIs. Simple key => value pairs are not enough to capture the "X, in namespace Y, equals Z" relationships provided by namespaced attributes.

Pages: 1, 2, 3, 4

Next Pagearrow