XPathScript: An Alternative To XSLT

July 5, 2000

Introduction

Table of Contents

• The Syntax
• The XPathScript API
• Declarative Templates
• A Complete Example
• Stepping Through
The Example
• The Template Hash
• The "testcode" Option
• Copying Styles
• Conclusion

XPathScript is a stylesheet language for transforming XML documents into other formats. It has only a few features, but by combining those features with the power and flexibility of Perl, XPathScript is a very capable system. Like all XML stylesheet languages, including XSLT, an XPathScript style sheet is always executed in the context of a source XML file. In many cases, the source XML file will actually define which style sheets to use via the <?xml-stylesheet?> processing instruction.

XPathScript was conceived as part of AxKit--an application server environment for Apache servers running mod_perl (see my Introduction to AxKit article). XPathScript's primary goal was to achieve the kind of transformations that XSLT can do, without being restricted by XSLT's XML based syntax, and to provide full programming facilities within that environment. I also wanted it to be completely agnostic about output formats, without having to program in special after-effect filters. The result is a language for server-side transformation that provides the power and flexibility of XSLT, combined with the full capabilities of the Perl language, and the ability to produce style sheets in any ASP-capable or ordinary text editor. The above Introduction to AxKit is recommended reading before continuing on with this article.

The Syntax

XPathScript follows the basic ASP syntax of introducing code with the <% %> delimiters. Here's a brief example of a fully compatible XPathScript style sheet:


<html>

 <body>

  <%= 5+5 %>

 </body>

</html>

This simply outputs the value 10 in an HTML document. The delimiters used here are the <%= %> delimiters, which are slightly different in that they send the results of the expression to the browser (or to the next processing stage in AxKit). This example does absolutely nothing with the source XML file, which is completely separate from this style sheet. Here's another example:


<html>

 <body>

  <% $foo = 'World' %>

Hello

  <%= $foo %> !!!

 </body>

</html>

This outputs the text "Hello World !!!". Again, we're not actually doing anything here with our source document, so all XML files using this style sheet will look identical. This seems rather uninteresting, until we discover the library of functions that are accessible to our XPathScript style sheets for accessing the source document contents.

The XPathScript API

Along with the code delimiters, XPathScript provides stylesheet developers with a full API for accessing and transforming the source XML file. This API can be used in conjunction with the delimiters above to provide a stylesheet language that is as powerful as XSLT, and yet provides all the features of a full programming language (in this case, Perl, but I'm certain that other implementations such as Python or Java would be possible).

Extracting Values

A simple example to get us started is to use the API to bring in the title from a DocBook article. A DocBook article title looks like this:


<article>

 <artheader>

  <title>XPathScript: An Alternative To XSLT</title>

  ...

The XPath expression to retrieve the text in the title element is


/article/artheader/title/text()

To make this text into the HTML title, we need the following XPathScript style sheet:


<html>

 <head>

  <title><%= findvalue("/article/artheader/title/text()") %></title>

 </head>

 <body>

  This was a DocBook Article. We're only extracting the title for now!

  <p>

  The title was: <%= findvalue("/article/artheader/title/text()") %>

 </body>

</html>

The syntax we are using to find the document node we wanted is XPath. XPath is a W3C Recommendation for finding and matching XML document nodes. The specification is fairly readable and is at http://www.w3.org/TR/xpath. Alternatively I can recommend Norm Walsh's XPath introduction, which covers a slightly older version of the specification, but I didn't notice anything in the article that is missing or different from the current recommendation.

Extracting Nodes

The above example showed us how to extract single values, but what if we wish to extract a list of values? Here's how we might get a table of contents from DocBook article sections:


...

<%

for my $sect1 (findnodes("/article/sect1")) {

 print findvalue("title/text()", $sect1), "<br>\n";

 for my $sect2 (findnodes("sect2", $sect1)) {

  print " + ", findvalue("title/text(), $sect2), "<br>\n";

  for my $sect3 (findnodes("sect3", $sect2)) {

   print " + + ", findvalue("title/text(), $sect3), "<br>\n";

  }

 }

}

%>

...

This gives us a table of contents down to three levels (adding links to the actual part of the document is left as an exercise). The first call to findnodes gives us all sect1 nodes that are children of the root element (article). The XPath expressions following that are relative to the current node. You can see that by the absence of the leading /.

Note in the above how we specify the current $sectX variable in the calls to the API. This is the context for the XPath expression, and it is vital so that we get the right values for the expression. The context in XPathScript is never set automatically. This is something that XSLT authors might miss, and expect to be done for them. This way, however, we have some added flexibility, in that you can always specify your own context, and pass context nodes around in your script.

Declarative Templates

The examples up to now have all been based around a single global template with search/replace type functionality from the source XML document. This is a powerful concept in itself, especially when combined with loops and the ability to change the context of searches. But that style of template is limited in its utility to well-structured data, rather than processing large documents. In order to ease the processing of documents, XPathScript includes a declarative template processing model too, so that you can simply specify the format for a particular element and let XPathScript do the work for you.

In order to support this method, XPathScript introduces one more API function: apply_templates(). The name is intended to appeal to people already familiar with XSLT. The apply_templates() function takes either a list of start nodes, or an XPath expression (which must result in a node set) and optional context. Starting at the start nodes, it traverses the document tree applying the templates defined by the $t hash reference.

First, a simple example to introduce this feature. Let's assume for a moment that our source XML file is valid XHTML, and we want to change all anchor links to italics. Here is the very simple XPathScript template that will do that:


<%

$t->{'a'}{pre} = '<i>';

$t->{'a'}{post} = '</i>';

$t->{'a'}{showtag} = 1;

%>

<%= apply_templates() %>

Note that apply_templates() has to be called using <%= %>. That's because apply_templates() actually returns a string representation of the transformation--it doesn't do the output to the browser for you.

The first thing this example does is set up a hash reference $t that XPathScript knows about. The keys of $t are element names (including namespace prefix, if we are using namespaces). The hash can have the following sub-keys:

pre
post
showtag
testcode

We'll cover testcode in more depth later in The Template Hash, but we'll note here that it is a place holder for code that allows for more complex templates.

Unlike XSLT's declarative transformation syntax, the keys of $t do not specify XPath match expressions. Instead they are simple element names. This is a trade-off between speed of execution and flexibility. Perl hash lookups are extremely quick compared to XPath matching. Luckily, because of the testcode option, more complex matches are quite possible with XPathScript.

The simple explanation for now is that pre specifies output to appear before the tag, post specifies output to appear after the tag, and showtag specifies that the tag itself should be output as well as the pre and post values.

Now let's put all these ideas together into an (almost) complete example. This is part of the style sheet I use to process my DocBook articles online:


<!--#include file="docbook_tags.xps"-->

<%



my %links;

my $linkid = 0;

$t->{'ulink'}{testcode} = sub { 

  my $node = shift;

  my $t = shift;

  my $url = findvalue('@url', $node);

  if (!exists $links{$url}) {

   $linkid++;

   $links{$url} = $linkid;

  }

  my $link_number = $links{$url};

  $t->{pre} = "<i><a href=\"$url\">";

  $t->{post} = " [$link_number]</a></i>";

  return 1;

 };



%>

<html>

<head>

 <title><%= findvalue('/article/artheader/title/text()') %></title>

</head>

<body bgcolor="white">



<%

# display title/TOC page

print apply_templates('/article/artheader/*');

%>



<hr>



<%

# display particular page

foreach my $section (findnodes("/article/sect1")) {

 print apply_templates($section);

}

%>



<h1>List of Links</h1>

<table border="1">

<th>URL</th>

<%

for my $link (sort {$links{$a} <=> $links{$b}} keys %links) {

%>

<tr>

<td><%= "[$links{$link}] $link" %></td>

</tr>

<% } %>

</table>



</body>

</html>

The first line imports a library of tags that are shared between this style sheet and one that is easier for web viewing with clickable links between sections (which can be downloaded here). The import system is based on Server Side Includes (SSI), although only SSI file includes are supported at this time (SSI virtual includes can be implemented using mod_include). Here is part of the docbook_tags.xps file:


<%



$t->{'attribution'}{pre} = "<i>";

$t->{'attribution'}{post} = "</i><br>\n";



$t->{'para'}{pre} = '<p>';

$t->{'para'}{post} = '</p>';



$t->{'ulink'}{testcode} = sub { 

  my $node = shift;

  my $t = shift;

  $t->{pre} = "<i><a href=\"" .

      findvalue('./@url', $node) . "\">";

  $t->{post} = '</a></i>';

  return 1;

 };



$t->{'title'}{testcode} = sub { 

  my $node = shift;

  my $t = shift;

  if (findvalue('parent::blockquote', $node)) {

   $t->{pre} = "<b>";

   $t->{post} = "</b><br>\n";

  }

  elsif (findvalue('parent::artheader', $node)) {

   $t->{pre} = "<h1>";

   $t->{post} = "</h1>";

  }

  else {

   my $parent = findvalue('name(..)', $node);

   if (my ($level) = $parent =~ m/sect(\d+)$/) {

    $t->{pre} = "<h$level>";

    $t->{post} = "</h$level>";

   }

  }



  return 1;

 };



%>

Stepping Through The Example

Careful readers will note that the first thing we see is a $t specification for <ulink> tags, and that the included docbook_tags.xps file also contains a specification for <ulink>. This is to override the default behavior for <ulink tags in the print version of my articles, in order to contain a reference that we can use later in a list of links. We can also see that this specification uses a testcode parameter that we haven't encountered before. We'll see how and why that's used later in The Template Hash.

Next, we see the findvalue() function used exactly as we saw above in Extracting Values. Then we have a section with a comment marked "display Title/TOC page." This uses the apply_templates() function with an XPath expression. Note that rather than use the <%= %> delimiters around the apply_templates() call, we simply use the print function. This has the same effect, and is used here to show the flexibility in this approach.

The main part of the code loops through all sect1 tags, and calls apply_templates() on those nodes. Note how this is another demonstration of Perl's TMTOWTDI (There's More Than One Way To Do It) approach--the same code could have been written as follows:


<%= apply_templates("/article/sect1") %>

Finally, because this is the print version of our article, we provide a list of links so that people viewing a printed version can type in those links, and so that they can also refer to the link by reference number, as we saw earlier. We use the hash of links in the %links variable that we built in the testcode handler for our ulink template.

The other file, docbook_tags.xps, is included (only in part here) to demonstrate a few of the transformations we're applying to various DocBook article tags. We can see that we're turning <para> tags into <p> tags, and doing some more complex processing with testcode to <title> tags. The next section provides more detail on what can be achieved with testcode.

The Template Hash

The apply_templates() function iterates over the nodes specified as parameters, applying the templates in the $t hash reference. This is the most important feature of XPathScript, because it allows you to define the appearance for individual tags without having to do it programmatically. This is the declarative part of XPathScript.

There is an important point to make here: XSLT is a purely declarative syntax, and people are having to work procedural code into XSLT via work arounds. XPathScript takes a much more pragmatic approach (much like Perl itself)--it is both declarative and procedural, allowing you the flexibility to use real code for real problems. It is important to note that apply_templates() returns a string, so you must either use print apply_templates('path') if using it from a Perl section of code, or via <%= apply_templates('path') %>.

The keys of $t are the names of the elements, including namespace prefixes. When you call apply_templates(), every element visited is looked up in the $t hash, and the template items stored in that hash are applied to the node. It's worth noting at this point that, unlike XSLT, XPathScript does not perform tree transformations from one tree to another. It simply sends its output to the browser directly. This has advantages and disadvantages, a discussion of which is beyond the scope of this article.

The following sub-keys define the transformation:

pre - the output to occur before the tag.
post - the output to occur after the tag.
prechildren - the output to occur before the children of this tag are output.
postchildren - the output to occur after the children of this tag are output.
prechild - the output to occur before each child of this tag.
postchild - the output to occur after each child of this tag.
showtag - set to a true value to display the tag as well as the pre and post values. If unset or false, the tag itself is not displayed.
testcode - code to execute upon visiting this tag. See below.

The showtag option is mostly equivalent to the XSLT <xsl:copy> tag, only less verbose. The pre and post options are useful, because generally in transformations we want to specify what comes before and after a tag. For example, to change an HTML A tag to be in italics but still have the link, we would use the following:


$t->{A}{pre} = "<i>";

$t->{A}{post} = "</i>";

$t->{A}{showtag} = 1;

The "testcode" Option

The testcode option is where we perform really powerful transformations. It's how we can do more complex tests on the nodes, and locally modify the transformation based on what we find.

The value stored in testcode is simply a reference to a subroutine. In Perl, these are incredibly simple to create using the anonymous sub keyword. The sub is called every time one of these elements is visited. The subroutine is passed two parameters: the node itself, and an empty hash reference that you can populate using the pre, post, prechildren, prechild, postchildren, postchild and showtag values that we've discussed already. Unlike the global $t hashref, you don't have to first specify the element name as a key. Here's the <ulink> example from the global tags code above:


$t->{'ulink'}{testcode} = sub { 

 my ($node, $t) = @_;

 $t->{pre} = '<i><a href="' . findvalue('@url', $node) . '">';

 $t->{post} = '</a></i>';

 return 1;

};

The equivalent XSLT code looks like this:


<xsl:template match="ulink">

 <i><a>

  <xsl:attribute name="href">

   <xsl:value-of select="@url"/>

  </xsl:attribute>

  <xsl:apply-templates/>

 </a></i>

</xsl:template>

Note in the XPathScript above that the inner $t is lexically scoped, so changes to it don't affect the outer $t. To save some confusion we might have named that variable $localtransforms, but some people, like me, hate typing....

The return value from the testcode subroutine is important. A return value of 1 means to process this node and continue processing all the children of this node. A return value of -1 means to process this node and stop, and a return value of 0 means do not process this node at all. This is useful in conditional tests, where you may not wish to process the nodes under certain conditions.

We can do things in XPathScript based on XPath lookups, just as we can in XSLT. While it is a little more verbose than a simple XSLT pattern match, the trade-off is in performance. An example: in XSLT you might match artheader/title and elsewhere you might match title[name(..) != "artheader"]. In XPathScript we can only match "title" in the template hash. But we can use the testcode section to extend the match:


$t->{'title'}{testcode} = sub { 

 my $node = shift;

 my $t = shift;

 if (findvalue('parent::blockquote', $node)) {

  $t->{pre} = "<b>";

  $t->{post} = "</b><br>\n";

 }

 elsif (findvalue('parent::artheader', $node)) {

  $t->{pre} = "<h1>";

  $t->{post} = "</h1>";

 }

 else {

  my $parent = findvalue('name(..)', $node);

  if (my ($level) = $parent =~ m/sect(\d+)$/) {

   $t->{pre} = "<h$level>";

   $t->{post} = "</h$level>";

  }

 }



 return 1;

};

In this code, we check the parent node before performing our modification to the local $t hashref. Particularly useful is the ability to use Perl regular expressions to extract values.

Copying Styles

One feature of XPathScript that is really hard to do with XSLT is to be able to copy a style completely:


<%

$t->{'foo'}{pre} = "<i>";

$t->{'foo'}{post} = "</i>";

$t->{'foo'}{showtag} = 1;



$t->{'bar'} = $t->{'foo'};

%>

While this would be possible in XSLT using entities, it's certainly not very practical or neat. With XPathScript, many tags can share the same template. Be careful though--this is a reference copy, not a deep copy, so the following may not do what you think it should:


<%

$t->{'foo'}{pre} = "<i>";

$t->{'foo'}{post} = "</i>";

$t->{'foo'}{showtag} = 1;



$t->{'bar'} = $t->{'foo'};

$t->{'bar'}{post} = "</i><br>";

%>

Because this is a reference, the last line changes the values for 'foo' as well as 'bar'.

A "Catch All"?

Does XPathScript have a "catch all" option for elements that don't have a $t entry? Yes indeed! Simply set $t->{'*'} to the template you want to use. You can even do some really clever things, such as using the testcode section to output a warning to the Apache error log about an unrecognized tag, rather than having to place some output in the resulting document and bother your users!

Conclusion

Resources

• AxKit
• Introduction to AxKit
• XPath
• Norm Walsh's XPath
Introduction

XPathScript brings the power of XPath into a more familiar environment for most web developers. It enables developers to retain their existing investment in mod_perl pages while moving to using XML for underlying content. Its pragmatic mix of the declarative and procedural ensures flexibility and performance.