
XSLT 2 and Delimited Lists
As part of his work as the editor of the XSLT 2.0 specification, Michael Kay has been prototyping the new features of XSLT 2.0 and XPath 2.0 in a separate development branch of his well-known Saxon XSLT processor. As I write this, his most recent prototype release is 7.4. (His recommended stable implementation of XSLT 1.0 is at release 6.5.2; see the project homepage for details on the progress of these two branches.) 7.4 lets us play with many of XSLT 2.0's new features.
The XSLT 2.0 specification is still a Working Draft, so you don't want to build production code around it, but it's still fun to try out some of the new features offered by the next generation of XSLT and XPath. In the next few columns, I'll look at some of these features. Most functions have been separated into their own specification, separate from the XPath 2.0 spec, because they're shared with XQuery: XQuery 1.0 and XPath 2.0 Functions and Operators.
One class of "pervasive
changes" from XSLT 1.0 to 2.0 is "support for sequences as a
replacement for the node-sets of XPath 1.0." Three functions that take
advantage of this let you manipulate tokenized strings:
tokenize(), item-at(), and index-of(). In
theory, start-tags and end-tags are the only delimiters anyone ever needs
in XML, but in practice, plenty of data out there uses other delimiters,
if only for size reasons. Compare the following SVG polygon
element
<polygon points="100,100 140,220 40,145 160,145 60,220"/>
with one that delimits everything with tags:
<poly>
<point><x>100</x><y>100</y></point>
<point><x>140</x><y>220</y></point>
<point><x>40</x><y>145</y></point>
<point><x>160</x><y>145</y></point>
<point><x>60</x><y>220</y></point>
</poly>
The nearly four-fold increase in size makes a big difference for
pictures of any complexity. XSLT developers have longed for some
equivalent of Perl and Python's split functions, which take a string and
an indication of the delimiter to look for and then returns an array of
the substrings it found between the delimiters. While some XSLT processors
offered an equivalent as an extension function, the tokenize()
function's place on the W3C-specified list of required XSLT 2.0 functions
lets us count on wide, consistent implementation of this function.
Let's look at a demonstration of the tokenize() and two other
new functions that work very nicely with it. The following stylesheet
works with any input, because it executes all of its instructions upon
seeing the root of the source document and ignores the document's
contents. (All sample stylesheets, input, and output are available in this zip file).
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:variable name="sampleString">XML,XSLT,XPath,SVG,XPointer</xsl:variable>
<xsl:variable name="tokenizedSample" select="tokenize($sampleString,',')"/>
<xsl:for-each select="$tokenizedSample">
<xsl:value-of select="."/>
<xsl:text>! </xsl:text>
</xsl:for-each>
Second item in tokenizedSample:
{<xsl:value-of select="item-at($tokenizedSample,2)"/>}
Tenth item in tokenizedSample:
{<xsl:value-of select="item-at($tokenizedSample,10)"/>}
Position of SVG in tokenizedSample:
{<xsl:value-of select="index-of($tokenizedSample,'SVG')"/>}
Position of XSL-FO in tokenizedSample:
{<xsl:value-of select="index-of($tokenizedSample,'XSL-FO')"/>}
End of test.
</xsl:template>
</xsl:stylesheet>
The single template rule stores a comma-delimited list of W3C XML
standard names in a variable called sampleString and then passes
that as a parameter to the tokenize() function used to create the
tokenizedSample variable, which stores a sequence of strings. The
second parameter passed to the function, which tells it where to split the
string in the first parameter, is a one-character string consisting of a
comma. You don't have to pass a single character as the
tokenize() function's second parameter; you can even use a
regular expression such as "\s+" for "one or more spaces," with an
optional third parameter to the function giving you greater control over
the regular expression's behavior.
The stylesheet's xsl:for-each loop iterates through the string
sequence, outputting an exclamation point and a single space after
each. The comma does not show up in any of the strings, because the
tokenize() function that split the string at the commas throws
these delimiters out.
The next two instructions in the stylesheet try to pull out specific
strings from the sequence based on their position there. As the
stylesheet's output below illustrates, the first call to item-at()
is successful, returning the string "XSLT". The lack of text between the
second pair of curly braces in the output show that the second call to
item-at() returns an empty string, because the
tokenizedSample sequence has no tenth item.
XML! XSLT! XPath! SVG! XPointer!
Second item in tokenizedSample:
{XSLT}
Tenth item in tokenizedSample:
{}
Position of SVG in tokenizedSample:
{4}
Position of XSL-FO in tokenizedSample:
{}
End of test.
The last two instructions in the stylesheet call the
index-of() function, which returns a number showing the
position of the second parameter in the first one. In the first call of
this function, it returns a 4 for "SVG" as the fourth string in the input
sequence, and it returns an empty string in the second call because it
didn't find "XSL-FO" in the sequence.
The index-of() and item-at() weren't defined by the
XSLT 2.0 spec to only be used with sequences of strings. You can also use
them with sequences of nodes, making all kinds of element searching and
manipulation tasks easier. For example, with the following input,
<colors>
<color>red</color>
<color>green</color>
<color>blue</color>
<color>yellow</color>
</colors>
this template rule
<xsl:template match="colors">
{<xsl:value-of select="item-at((color),3)"/>}
{<xsl:value-of select="index-of((color),'green')"/>}
</xsl:template>
produces the following output, because "(color)" represents the
sequence of color elements within the colors context
node:
{blue}
{2}
The XPath 2.0 spec has more about the new sequences.
Tokenizing an SVG Attribute
Let's look at a tokenizing example that attacks a more realistic problem, the SVG polygon element shown above.
<xsl:template match="polygon">
<poly>
<xsl:for-each select="tokenize(@points,'\s+')">
<point>
<x><xsl:value-of select="substring-before(.,',')"/></x>
<y><xsl:value-of select="substring-after(.,',')"/></y>
</point>
</xsl:for-each>
</poly>
</xsl:template>
The beginning of this column shows the input and output. The input is
an SVG polygon element that has a space between each pair of
numbers that represent a point of the polygon and a comma between the x
and y coordinates of each pair. Without storing the tokenized sequence in
a separate variable as the previous example did, this example's
tokenize() function splits them up and its xsl:for-each
loop iterates through the sequence of returned strings, outputting the
contents of each inside of a point element. The
tokenize() function would have worked on the polygon input if the
second parameter passed to it had been a simple, one-character string of a
single space, but the regular expression "\s+" is even better, because
specifying that the delimiter is one or more space characters in a row
lets the function handle any combination of carriage returns, tabs, and
spacebar spaces between each number pair.
|
Also in Transforming XML | |
Within the point element, the template could have used the
tokenize() function to split apart the x and y values, but it's
less code to just use the XPath 1.0 substring-before() and
substring-after() functions. The tokenizing function is great
when you don't know how long a list is, but when there are always two
items on either side of a single delimiter, it only takes two function
calls to pull them out.
Tokenizing Past and Future XML Data
The combination of tokenize(), item-at(), and
index-of() let you take advantage of something that's always been
around in XML 1.0, but that you couldn't do much with before: attributes
of type
NMTOKENS. You could always declare an attribute to be of this type and
then store multiple values in it separated by spaces, but splitting up
these lists required either the Perl split function, its equivalent in
another language, or lots of code to split it up when using a language
that didn't offer such a function, like XSLT 1.0. Now a single function
can split it for us, another can check the list for a particular value,
and another can pull out a particular item from the list based on its
order in the list. I know I'll be using these functions often.
- item-at() question
2006-10-10 18:21:58 mvc - item-at() question
2006-10-11 05:01:53 Bob DuCharme - Wondering whether brackets are needed to have a sequence
2003-05-20 10:31:11 Martin Honnen - Wondering whether brackets are needed to have a sequence
2003-05-20 18:06:17 Bob DuCharme - Wondering whether brackets are needed to have a sequence
2003-05-21 02:44:46 Martin Honnen