XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Transforming XML With SAX Filters
by Kip Hampton | Pages: 1, 2, 3, 4

After much discussion on the perl-xml mailing list, it was decided that in SAX2 implementations attributes should remain a HASH, but should employ a notation first advanced by James Clark where the insufficient name => value structure is replaced by {namepace_uri}localname = \%attribute_properties. So, in the following block, when we say $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value} this should be understood to mean "give me the 'Value' property of the attribute that is bound to the 'http://www.w3.org/XML/1998/namespace' namespace whose local name is 'base'".

  if (defined $attrs{'{http://www.w3.org/XML/1998/namespace}base'}) {
    $self->{BaseURI} =
        $attrs{'{http://www.w3.org/XML/1998/namespace}base'}->{Value};
    $self->{BaseURI} =~ s|^file://||;
  }

Next, we check to see if the current element is in the XIinclude namespace and has the local name of 'include' and, if so, we send the value that element's href attribute off to our include_proc method to include the document at that URI into the current document stream.

Also notice that we do not forward the events for the include elements since we do not want those elements to actually appear in the result document. This, coupled with the results included from the include_proc method, has the effect of replacing the include elements with the documents that they point to.

  if ($element->{NamespaceURI} eq 'http://www.w3.org/2001/XInclude'
      and $element->{LocalName} eq 'include') {
      $self->include_proc($attrs{'{}href'}->{Value});
  }
  else {
    $self->{Handler}->start_element($element);
  }
}

It is not enough to exclude the include elements from being forwarded in the start_element handler; we must also do the same in the end_element handler as well. Otherwise, the resulting document would still contain the end tags for the include elements, causing the resulting XML document to be ill-formed.

sub end_element {
  my ($self, $element) = @_;
  unless ($element->{NamespaceURI} eq
          'http://www.w3.org/2001/XInclude'
      and $element->{LocalName} eq 'include') {
      $self->{Handler}->end_element($element);
  }
}

I should also point out that if you want to prune elements that may contain character data from a document, you must also implement a characters handler that conditionally blocks the forwarding of text events. Otherwise the text contained by the excluded elements will become part of the text of the nearest parent element, which is not likely to produce the desired result. We need not worry in this case since all of the include elements are empty.

Finally we get to the include_proc method which is responsible for parsing and including the requested documents. Here we simply create a new instance of XML::Filter::SAX1toSAX2, passing the current instance of our filter as the handler, then pass that as the handler for a new instance of XML::Parser::PerlSAX, and tell the parser to parse the document passed to the subroutine in the context of the BaseURI property.

The result of this is that the events fired from these included documents are inserted into the current document stream at the precise location previously taken by the include elements.

sub include_proc {
  my ($self, $file) = @_;
  $file = $self->{BaseURI} . $file;
  my $sax2_filter = XML::Filter::SAX1toSAX2->new(Handler => $self);
  my $parser = XML::Parser::PerlSAX->new({Handler => $sax2_filter,
                                          Source => {SystemId => $file}
                                        });
  $parser->parse;
}

Passing the following XML document to this script. . .

Resources

Download the sample code.

Writing SAX Drivers for Non-XML Data

Perl XML Quickstart: The Standard XML Interfaces

High-Performance XML Parsing With SAX

David Megginson's SAX Pages

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xi="http://www.w3.org/2001/XInclude"
      xml:base="file://files/">
  <head>
    <title>
      Templating With XInclude and SAX2
    </title>
  </head>
  <body>
   <xi:include href="header.xml"/>
   <hr width="80%"/>
   <xi:include href="content.xml"/>
   <hr width="80%"/>
   <xi:include href="footer.xml"/>
  </body>
</html>

might result in a document like

Also in Perl and XML

OSCON 2002 Perl and XML Review

XSH, An XML Editing Shell

PDF Presentations Using AxPoint

Multi-Interface Web Services Made Easy

Perl and XML on the Command Line

<html
  xml:base="file://files/"
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:xi="http://www.w3.org/2001/XInclude">
  <head>
    <title>
      Templating With XInclude and SAX2
    </title>
  </head>
  <body>
<div class="header">
 <h1>Common Header</h1>
</div>
<hr width="80%"></hr>
<div class="content">
 <p>
   Now is the winter of our
   discontent made glorious
   summer by the son of York.
 </p>
</div>
<hr width="80%"></hr>
<div class="footer">
 <p>Common Footer</p>
</div>
  </body>
</html>

Conclusions

SAX is an important XML technology that, like Perl, keeps simple things simple and makes hard thing possible. Knowing how to generate SAX events from non-XML data and using SAX filters to transform existing document streams are key to a mature understanding of the power that SAX offers. We have only scratched the surface of what SAX filters and generators can do, but I hope that we have at least covered the basics well enough to pique your curiosity and provoke experimentation.