Menu

New and Improved String Handling

August 6, 2003

Bob DuCharme

In my June column last year, I discussed XSLT 1.0 techniques for comparing two strings for equality and doing the equivalent of a "search and replace" on your source document. XSLT 2.0 makes both of these so much easier that describing the new techniques won't quite fill up a column, so I'll also describe some 1.0 and 2.0 functions for concatenating strings. Notice that I say "1.0" and "2.0" without saying "XSLT"; that's because these are actually XPath functions available to XQuery users as well as XSLT 2.0 users. The examples we'll look at demonstrate what they bring to XSLT development.

String Comparison

The string comparison techniques described before were really boolean tests that told you whether two strings were equal or not. The new compare() function does more than that: it tells whether the first string is less than, equal to, or greater than the second according to the rules of collation used. "Rules of collation" refers to the sorting rules, which can apparently be tweaked to account for the spoken language of the content. (The XQuery 1.0 and XPath 2.0 Functions and Operators document tells us that "Some collations, especially those based on the Unicode Collation Algorithm can be 'tailored' for various purposes. This document does not discuss such tailoring.")

The following stylesheet, which can be run with any document as a source document, has six calls to the compare() function. (All XSLT 2.0 examples were tested with version 7.6.5 of Michael Kay's Saxon XSLT processor.)

<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:variable name="color">red</xsl:variable>
  <xsl:template match="/">
    1. 'qed' and 'red': <xsl:value-of select="compare('qed','red')"/>
    2. 'red' and $color: <xsl:value-of select="compare('red',$color)"/>
    3. 'red' and ' red ': <xsl:value-of select="compare('red',' red ')"/>
    4. 'red' and normalize-space(' red '): <xsl:value-of
              select="compare('red',normalize-space(' red '))"/>
    5. 'RED' and $color: <xsl:value-of select="compare('RED',$color)"/>
    6. upper-case('RED') and upper-case($color): <xsl:value-of
              select="compare(upper-case('RED'),upper-case($color))"/>
  </xsl:template>
</xsl:stylesheet>

Before discussing the individual calls, let's look at the result of running the stylesheet:

    1. 'qed' and 'red': -1
    2. 'red' and $color: 0
    3. 'red' and ' red ': 1
    4. 'red' and normalize-space(' red '): 0
    5. 'RED' and $color: -1
    6. upper-case('RED') and upper-case($color): 0

The compare() function returns a -1 if a sort would put the string in its first argument before the one in its second argument, 1 if it would come after, and 0 if the two arguments are equal. (The function only works with strings. Use the <, =, and > operators to compare other data types such as numbers, dates, and booleans.) Line 1 of the stylesheet result shows that "qed" is alphabetically less than "red" because "q" comes before "r" in the alphabet. Line 2 shows the use of a variable as an argument to compare(); a variable storing the string "red" is equal to the literal string "red".

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

Lines 3 and 4 demonstrate an issue from my earlier column on comparing strings: dealing with extra spaces. A space character gets sorted after any letters of the alphabet, so the call to compare() in line 3 returns a 1. Line 4 shows that enclosing the string " red " in a call to the normalize-space() function trims the leading and following spaces, thereby passing the string "red" to the compare() function. This is particularly handy when comparing the contents of an element to another string because the use of spaces in XML documents is often inconsistent.

The last two lines demonstrate the effect of case on string comparison. Line 5 shows that a sort would put the upper-case string "RED" before the lower-case string "red". While the compare() function offers no option for a case-insensitive string comparison, it's easy enough to do: use the new upper-case() function to convert both arguments to upper-case and compare those. This way, whether your two arguments are "red" and "RED" or "rEd" and "ReD", the string comparison won't care about the case of the letters.

Search and Replace

The XPath 1.0 translate() function lets you map individual characters to other characters, but if your search target or replacement string are more than one character long, it isn't much help. A recursive named template can do the job, but it's a lot of trouble for programmers used to text manipulation languages such as awk, Perl, and Python, where arbitrary string replacement can be done with much less code. The XPath 2.0 replace() function makes this much easier.

The function takes three required parameters: the string to act on, the target string to search for in the first argument's string, and the string to replace any occurrences of the second argument's string. An optional fourth parameter lets you specify two flags: an "m" to operate in multiline mode and an "i" to ignore case.

The function returns a copy of the first argument after making any replacements. This is so much simpler than the XSLT 1.0 hack for doing the same thing (which certainly didn't bother with multiline mode or case sensitivity options) that the 43-line stylesheet from my earlier column on comparing and replacing strings can be rewritten in 15 lines using 2.0:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="2.0">
  <xsl:output method="xml" omit-xml-declaration="yes"/>

  <xsl:template match="text()">
    <xsl:value-of select="replace(.,'finish','FINISH')"/>
  </xsl:template>

  <xsl:template match="@*|*">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

The input and output are the identical to the 1.0 version shown in the earlier column.

Performing Multiple Search and Replace Operations

What if, in addition to replacing "finish" with "FINISH" as shown above, I also want to replace the string "flavors" with "tastes" and "11" with "22"? In a procedural programming language, you might do this to a string called "testString" with code following this model:

  ; pseudo-code. NOT XSLT!
  testString = replace(testString,'finish','FINISH');
  testString = replace(testString,'flavors','tastes');
  testString = replace(testString,'11','22');

XSLT, however, is not a procedural language. Like its ancestors Lisp and Scheme, it's a functional language. We don't write a series of instructions to be executed one after the other; we combine functions into expressions that return values. In the following revision of the match="text" template rule from above, the string returned by each call to replace() is passed as the first argument to another call:

<xsl:template match="text()">
  <xsl:value-of select="replace(
                          replace(
                            replace(.,'11','22'),
                            'flavors',
                            'tastes'
                          ),
                          'finish',
                          'FINISH'
                        )"/>
</xsl:template>

I tried to use some Lisp/Scheme whitespace conventions to make it more readable, but as you can see, it wasn't entirely successful.

Concatenating Strings

The XPath 1.0 concat() function returns the two or more strings passed to it as one string. We saw its use in the column on the XSLT 1.0 version of search and replace, as well as in the column on Setting and Using Variables and Parameters. Of course, adding two text nodes to the source tree one right after the other essentially concatenates them, and this is used even more often than the concat() function.

One classic XML element manipulation problem is the output of a collection of nodes as a delimited list. For example, to output the values of the color elements in the following source document as a comma-delimited list, we can't just output each one with a comma after it, because we don't want to put a comma after the last one.

<colors>
  <color>red</color>
  <color>blue</color>
  <color>yellow</color>
  <color>green</color>
</colors>

A typical XSLT 1.0 approach is to use an xsl:for-each element to output them and an xsl:if to output a comma after each if it's not the last child of its parent.

<xsl:template match="colors">
  <xsl:for-each select="color">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">
      <xsl:text>, </xsl:text>
    </xsl:if>
  </xsl:for-each>
</xsl:template>

XPath 2.0's string-join() function lets you do this much more concisely. It takes two arguments: a sequence (an "ordered collection of zero or more items" according to the XQuery 1.0 and XPath 2.0 Data Model document) and a delimiter to use when returning the list. Look how much less code is necessary to achieve the same result in XSLT 2.0:

<xsl:template match="colors">
  <xsl:value-of select="string-join(color,', ')"/>
</xsl:template>

This is really doing the opposite of the tokenize() function that we learned about in the May column.

New features such as data typing and a new data model may make XSLT and XPath 2.0 look radically different from their 1.0 counterparts, but many of these new features are straightforward functions that are familiar from other popular programming languages. The compare(), replace(), and string-join() functions, which will make common coding tasks go more quickly with less room for error, are great examples of this.