XSLT 2 and Delimited Lists

May 7, 2003

As part of his work as the editor of the XSLT 2.0 specification, Michael Kay has been prototyping the new features of XSLT 2.0 and XPath 2.0 in a separate development branch of his well-known Saxon XSLT processor. As I write this, his most recent prototype release is 7.4. (His recommended stable implementation of XSLT 1.0 is at release 6.5.2; see the project homepage for details on the progress of these two branches.) 7.4 lets us play with many of XSLT 2.0's new features.

The XSLT 2.0 specification is still a Working Draft, so you don't want to build production code around it, but it's still fun to try out some of the new features offered by the next generation of XSLT and XPath. In the next few columns, I'll look at some of these features. Most functions have been separated into their own specification, separate from the XPath 2.0 spec, because they're shared with XQuery: XQuery 1.0 and XPath 2.0 Functions and Operators.

One class of "pervasive changes" from XSLT 1.0 to 2.0 is "support for sequences as a replacement for the node-sets of XPath 1.0." Three functions that take advantage of this let you manipulate tokenized strings: tokenize(), item-at(), and index-of(). In theory, start-tags and end-tags are the only delimiters anyone ever needs in XML, but in practice, plenty of data out there uses other delimiters, if only for size reasons. Compare the following SVG polygon element

<polygon points="100,100 140,220 40,145 160,145 60,220"/>

with one that delimits everything with tags:

<poly>
  <point><x>100</x><y>100</y></point>
  <point><x>140</x><y>220</y></point>
  <point><x>40</x><y>145</y></point>
  <point><x>160</x><y>145</y></point>
  <point><x>60</x><y>220</y></point>
</poly>

The nearly four-fold increase in size makes a big difference for pictures of any complexity. XSLT developers have longed for some equivalent of Perl and Python's split functions, which take a string and an indication of the delimiter to look for and then returns an array of the substrings it found between the delimiters. While some XSLT processors offered an equivalent as an extension function, the tokenize() function's place on the W3C-specified list of required XSLT 2.0 functions lets us count on wide, consistent implementation of this function.

Let's look at a demonstration of the tokenize() and two other new functions that work very nicely with it. The following stylesheet works with any input, because it executes all of its instructions upon seeing the root of the source document and ignores the document's contents. (All sample stylesheets, input, and output are available in this zip file).

<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text"/>

  <xsl:template match="/">

    <xsl:variable name="sampleString">XML,XSLT,XPath,SVG,XPointer</xsl:variable>

    <xsl:variable name="tokenizedSample" select="tokenize($sampleString,',')"/>

    <xsl:for-each select="$tokenizedSample">
      <xsl:value-of select="."/>
      <xsl:text>! </xsl:text>
    </xsl:for-each>

Second item in tokenizedSample: 
    {<xsl:value-of select="item-at($tokenizedSample,2)"/>}

Tenth item in tokenizedSample: 
    {<xsl:value-of select="item-at($tokenizedSample,10)"/>}

Position of SVG in tokenizedSample:
    {<xsl:value-of select="index-of($tokenizedSample,'SVG')"/>}

Position of XSL-FO in tokenizedSample:
    {<xsl:value-of select="index-of($tokenizedSample,'XSL-FO')"/>}

End of test.
  </xsl:template>

</xsl:stylesheet>

The single template rule stores a comma-delimited list of W3C XML standard names in a variable called sampleString and then passes that as a parameter to the tokenize() function used to create the tokenizedSample variable, which stores a sequence of strings. The second parameter passed to the function, which tells it where to split the string in the first parameter, is a one-character string consisting of a comma. You don't have to pass a single character as the tokenize() function's second parameter; you can even use a regular expression such as "\s+" for "one or more spaces," with an optional third parameter to the function giving you greater control over the regular expression's behavior.

The stylesheet's xsl:for-each loop iterates through the string sequence, outputting an exclamation point and a single space after each. The comma does not show up in any of the strings, because the tokenize() function that split the string at the commas throws these delimiters out.

The next two instructions in the stylesheet try to pull out specific strings from the sequence based on their position there. As the stylesheet's output below illustrates, the first call to item-at() is successful, returning the string "XSLT". The lack of text between the second pair of curly braces in the output show that the second call to item-at() returns an empty string, because the tokenizedSample sequence has no tenth item.

XML! XSLT! XPath! SVG! XPointer! 

Second item in tokenizedSample: 
    {XSLT}

Tenth item in tokenizedSample: 
    {}

Position of SVG in tokenizedSample:
    {4}

Position of XSL-FO in tokenizedSample:
    {}

End of test.

The last two instructions in the stylesheet call the index-of() function, which returns a number showing the position of the second parameter in the first one. In the first call of this function, it returns a 4 for "SVG" as the fourth string in the input sequence, and it returns an empty string in the second call because it didn't find "XSL-FO" in the sequence.

The index-of() and item-at() weren't defined by the XSLT 2.0 spec to only be used with sequences of strings. You can also use them with sequences of nodes, making all kinds of element searching and manipulation tasks easier. For example, with the following input,

<colors>
  <color>red</color>
  <color>green</color>
  <color>blue</color>
  <color>yellow</color>
</colors>

this template rule

  <xsl:template match="colors">
    {<xsl:value-of select="item-at((color),3)"/>}
    {<xsl:value-of select="index-of((color),'green')"/>}
  </xsl:template>

produces the following output, because "(color)" represents the sequence of color elements within the colors context node:

    {blue}
    {2}

The XPath 2.0 spec has more about the new sequences.

Tokenizing an SVG Attribute

Let's look at a tokenizing example that attacks a more realistic problem, the SVG polygon element shown above.

<xsl:template match="polygon">
    <poly>
    <xsl:for-each select="tokenize(@points,'\s+')">
      <point>
        <x><xsl:value-of select="substring-before(.,',')"/></x>
        <y><xsl:value-of select="substring-after(.,',')"/></y>
      </point>
    </xsl:for-each>
  </poly>
</xsl:template>

The beginning of this column shows the input and output. The input is an SVG polygon element that has a space between each pair of numbers that represent a point of the polygon and a comma between the x and y coordinates of each pair. Without storing the tokenized sequence in a separate variable as the previous example did, this example's tokenize() function splits them up and its xsl:for-each loop iterates through the sequence of returned strings, outputting the contents of each inside of a point element. The tokenize() function would have worked on the polygon input if the second parameter passed to it had been a simple, one-character string of a single space, but the regular expression "\s+" is even better, because specifying that the delimiter is one or more space characters in a row lets the function handle any combination of carriage returns, tabs, and spacebar spaces between each number pair.

Also in Transforming XML

Automating Stylesheet Creation

Within the point element, the template could have used the tokenize() function to split apart the x and y values, but it's less code to just use the XPath 1.0 substring-before() and substring-after() functions. The tokenizing function is great when you don't know how long a list is, but when there are always two items on either side of a single delimiter, it only takes two function calls to pull them out.

Tokenizing Past and Future XML Data

The combination of tokenize(), item-at(), and index-of() let you take advantage of something that's always been around in XML 1.0, but that you couldn't do much with before: attributes of type NMTOKENS. You could always declare an attribute to be of this type and then store multiple values in it separated by spaces, but splitting up these lists required either the Perl split function, its equivalent in another language, or lots of code to split it up when using a language that didn't offer such a function, like XSLT 1.0. Now a single function can split it for us, another can check the list for a particular value, and another can pull out a particular item from the list based on its order in the list. I know I'll be using these functions often.