Menu

Splitting and Manipulating Strings

May 1, 2002

Bob DuCharme

XSLT is a language for manipulating XML documents, and XML documents are text. When you're manipulating text, functions for searching strings and pulling out substrings are indispensable for rearranging documents to create new documents. The XPath string functions incorporated by XSLT give you a lot of power when you're manipulating element character data, attribute values, and any other strings of text that your stylesheet can access. We'll start by looking at ways to use these functions to split up strings and how a PCDATA element might be split into subelements.

To demonstrate the first few functions, we'll use the following simple document:

<poem>
  <verse>Seest thou yon dreary Plain, forlorn and wild,</verse>
  <verse>

       The seat of desolation, void of light,

</verse>
</poem>

(Note how the second verse element begins and ends with some extra spaces and carriage returns -- we'll learn about a function that tells the XSLT processor to ignore them.) The following template adds the complete contents of each verse element in the sample document above to the result tree at line 1 and then demonstrates various ways to pull substrings out of them. Curly braces in the result make it easier to see exactly which substrings are getting pulled out of the verse elements. (Complete stylesheets with these sample templates, along with the input and output used to demonstrate them, are available in this zip file.)

<!-- xq319.xsl: converts xq318.xml into xq320.txt -->

<xsl:template match="verse">
  1. By itself: {<xsl:value-of select="."/>}
  2. {<xsl:value-of select="substring(.,7,6)"/>}
  3. {<xsl:value-of select="substring(.,12)"/>}
  4. {<xsl:value-of select="substring-before(.,'dreary')"/>}
  5. {<xsl:value-of select="substring-after(.,'desolation')"/>}
</xsl:template>

Before talking about the individual functions, let's look at what this stylesheet does to the sample document:

  1. By itself: {Seest thou yon dreary Plain, forlorn and wild,}
  2. {thou y}
  3. {yon dreary Plain, forlorn and wild,}
  4. {Seest thou yon }
  5. {}

  
  1. By itself: {

       The seat of desolation, void of light,

}
  2. {   The}
  3. {e seat of desolation, void of light,

}
  4. {}
  5. {, void of light,

}

The source document has two verse elements, so the "verse" template rule adds two sets of lines 1 through 5 to the result. Each line 1 in the result shows the complete contents of the verse element. For the second verse element, line 1 includes the extra whitespace around the source document's text.

Lines 2 and 3 of the stylesheet demonstrate the substring() function. In line 2, the function call substring(.,7,6) takes the verse element's contents (because "." abbreviates self::node()) and, starting at its seventh character, gets six characters. For the first verse element, it skips the first six characters ("Seest ") to start at the seventh and get the six-character string "thou y". For the second verse element, the six characters to skip on the way to that seventh character are two carriage returns and four spaces, so that the six-character string starting at the seventh character is "   The" (three spaces followed by the three letters you see). Line 3 of the stylesheet has no third parameter to specify the length of the substring to extract, so the substring(.,12) function call starts at the twelfth character and gets everything to the end of the string. For the second verse element, this includes the two carriage returns that end it.

The function call substring-before(.,'dreary') in line 4 of the stylesheet looks for the string passed as the second argument in the string passed as the first argument (., or self::node()). If it finds it, it returns everything in the first parameter's string before that occurrence of the second string. When looking for "dreary" in the first verse element, the function finds it and returns the string "Seest thou yon "; in the second verse element, it doesn't find it, and nothing appears between the curly braces of the fourth line for that element.

The function call substring-after(.,'desolation') resembles substring-before except that if it finds the second argument in the first argument's text, it returns the string after that text. The first verse element doesn't have the string "desolation", so nothing appears between the curly braces of the first line 5. The second verse element does have this string, and the XSLT processor puts the characters after it (the string ", void of light," followed by two carriage returns) between the curly braces of the result document's second line 5.

The next stylesheet demonstrates a more diverse group of XPath string functions.

<!-- xq321.xsl: converts xq318.xml into xq322.txt -->

<xsl:template match="verse">
  1. {<xsl:value-of select="concat('length: ',string-length(.))"/>}
  2. <xsl:if test="contains(.,'light')">
       <xsl:text>light: yes!</xsl:text>
     </xsl:if>
  3. <xsl:if test="starts-with(.,'Seest')">
       <xsl:text>Yes, starts with "Seest"</xsl:text>
     </xsl:if>
  4. {<xsl:value-of select="normalize-space(.)"/>}
  5. {<xsl:value-of select="translate(.,'abcde','ABCD')"/>}

</xsl:template>

With the same source document as the previous example, this new stylesheet creates this result:

  1. {length: 46}
  2. 
     
  3. Yes, starts with "Seest"
     
  4. {Seest thou yon dreary Plain, forlorn and wild,}
  5. {Sst thou yon DrAry PlAin, forlorn AnD wilD,}


  
  1. {length: 49}
  2. light: yes!
     
  3. 
     
  4. {The seat of desolation, void of light,}
  5. {

       Th sAt of DsolAtion, voiD of light,

}

Line 1 of this stylesheet demonstrates two functions: string-length(), which returns the number of characters in the string passed as an argument, and concat(), which concatenates its argument strings into one string. The function call concat('length: ',string-length(.)) shows that its arguments don't have to be literal strings; you can use functions that return strings (or can easily be converted into strings, like the integer returned by the string-length() function) as arguments as well. This, along with its ability to accept any number of arguments greater than one, make concat() a very flexible function.

Lines 2 and 3 of the stylesheet (which each take up more than one line of the stylesheet) each have an xsl:if instruction that uses a boolean string function -- functions that evaluate a certain condition about a string or strings and return a boolean true if the condition is true. The first function call, contains(.,'light'), checks whether its first argument contains the string passed as the second argument and returns a boolean true if it does. For the source document's first verse element it doesn't, so nothing appears after the first "2" in the result. The second verse element does, so the message "light: yes!" appears in the result.

Line 3's xsl:if instruction has a similar function call in its test attribute: starts-with(.,'Seest'), which only returns true if the string in its first argument starts with the string in its second. This is true for the first verse element, so the message 'Yes, starts with "Seest"' appears on the result tree, but the second verse element doesn't, so there is nothing after its "3".

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

Line 4's normalize-space(.) function call accepts one argument, strips whitespace at its beginning and end, replaces any sequence of whitespace in the string with a single space character, and returns the resulting string. In English, the targeted whitespace characters are the spacebar space, the tab character, and the carriage return. The first verse element's text looks the same when processed by this function, but the second verse element's text is definitely different: all the leading and trailing space characters have been removed. An XML processor does this to the spaces in most kinds of attributes, and it's handy to be able to do it to element character data as well, especially when you want to compare two strings of element character data whose only difference may be the spacing around them in their source document, as we'll see in next month's column.

Line 5's translate() function gives you a way to map one set of characters to another. It goes through the string in the first argument and replaces any characters that are also in the second argument with the corresponding character in the third argument. If the third argument has no corresponding character, then the XSLT processor deletes the one found in the first string. In the example, the function call translate(.,'abcde','ABCD') maps the letters "a", "b", "c", and "d" to their upper-case equivalents. Because the letter "e" is in the second argument but not the third, it's mapped to nothing; any occurrences of it are removed from the copy of the first argument's string that the function returns.

Let's look at a more realistic example of some of these string manipulation functions. In the following, the binCode element represents a wine brand's location on the wine store shelf. The first two characters are its row, the third character its shelf, and the text after the hyphen is its product number.

<winelist>
    <wine>
      <winery>Lindeman's</winery>
      <product>Bin 65</product>
      <year>1998</year>
      <price>6.99</price>
      <binCode>15A-7</binCode>
   </wine>
   <wine>
      <winery>Benziger</winery>
      <product>Carneros</product>
      <year>1997</year>
      <price>7.55</price>
      <binCode>15C-5</binCode>
   </wine>
   <wine>
      <winery>Duckpond</winery>
      <product>Merit Selection</product>
      <year>1996</year>
      <price>14.99</price>
      <binCode>12D-1</binCode>
   </wine>
</winelist>

The following template rule separates the three components of the binCode element type into separate elements: row, shelf, and prodNum, all inside of a productLocation container element.

<!-- xq324.xsl: converts xq323.xml to xq325.xml -->
   
  <xsl:template match="binCode">
    <productLocation>
      <row><xsl:value-of select="substring(text(),1,2)"/>
    </row>
      <shelf><xsl:value-of select="substring(.,3,1)"/>
    </shelf>
      <prodNum><xsl:value-of select="substring-after(text(),'-')"/>
    </prodNum>
    </productLocation>
  </xsl:template>

The call to substring() that creates the row element has text() as its first argument. For the purposes of this stylesheet, this means the same thing as ".". (Technically, text() refers to the text node child of the context node and "." refers to a string representation of the node's contents when used as the first parameter to the substring() function.) The result XML looks like the input except that the XSLT processor has replaced each binCode element with the productLocation element and its three child elements:

<?xml version="1.0" encoding="UTF-8"?>
<winelist>
    <wine>
      <winery>Lindeman's</winery>
      <product>Bin 65</product>
      <year>1998</year>
      <price>6.99</price>
      <productLocation><row>15</row><shelf>A</shelf>
      <prodNum>7</prodNum></productLocation>
   </wine>
   <wine>
      <winery>Benziger</winery>
      <product>Carneros</product>
      <year>1997</year>
      <price>7.55</price>
      <productLocation><row>15</row><shelf>C</shelf>
  <prodNum>5</prodNum></productLocation>
   </wine>
   <wine>
      <winery>Duckpond</winery>
      <product>Merit Selection</product>
      <year>1996</year>
      <price>14.99</price>
      <productLocation><row>12</row><shelf>D</shelf>
  <prodNum>1</prodNum></productLocation>
   </wine>
</winelist>

Next month, we'll look at how to compare two elements to see if they're the same. We'll also look at a way to implement a global string replace with an XSLT stylesheet. (If you can't wait until then, see my book, XSLT Quickly, from which these columns are excerpted.)