Menu

From One String to Many

April 28, 2004

John E. Simpson

This month, I found two questions from two different people on essentially the same subject:

Q: Can I use XSLT to parse a string of characters?

Question 1: Right now I'm sending values to our XSLT stylesheet which is defining some xsl:params we have. Here's the way the command line is typically structured:

foo.xsl -param Name "'Chuck'"

And here's the XSLT to acquire the value of the Name parameter:

<xsl:param name="Name" select="''" />

Is there a way to send any number of multiple values for Name and have them define, say, the XSLT equivalent of an array? I was thinking of using something like this variation on the command line above, with multiple names delimited by semicolons:

foo.xsl -param Name "'Chuck;Steve;Sara;Jane;'"

Question 2: I have an element like this:

<para one-of="|11|22|33|44|55|"/>

I want to get the value list of its attribute one-of, but I don't know how to parse it,

A: Both of these questions ask of XSLT that it perform what might be called a sidebar task; that is, something apart from its central mission, which is the manipulation of a source tree. They both want to use XSLT to examine a text string, breaking it apart into substrings at locations defined by some delimiting character. In Question 1, the values passed in by way of the Name parameter are delimited by semicolons; in Question 2, the values contained in the one-of attribute are delimited by pipes ("vertical bars").

The choice of delimiting character is completely arbitrary, although it may be forced on you by some outside constraint such as the interface with another application. You might just as well choose hyphens, underscores, even slashes or backslashes; e.g., for parsing the path of a file in a local directory or a URL. The separator might be just a blank space, for that matter, enabling you to extract words from a sentence. Or use periods to extract sentences from a paragraph. And so on.

There are some minor differences in the two questions:

  • Question 1 expects a list of four items, while Question 2 expects five. The actual number is irrelevant: the "parsing" ideally will work with a list of two items as well as a list of two hundred. That is, the stop-parsing trigger is simply "nothing else to parse," not that a particular number of tokens (the individual units of text between successive delimiters) has been extracted.
  • Possibly more problematic is the subtle difference in how the delimiters (whatever they may be) are used in the two questions. In the first case, each name in the list is followed by a delimiter; in the second, too, each list item is followed by a delimiter... but the first item is also preceded by a delimiter.

I don't think this second issue matters much. At worst, you could simply trim off the opening delimiter and then process the rest of the string exactly as when dealing with Question 1's more conventional list form. It's an atypical way to delimit a list of items, to be sure. But it does highlight the need to know your data, as always.

Enough talk about the details. What's the answer?

XPath 2.0

When XPath finally makes the leap to its second version, dragging in its wake new versions of both XQuery and XSLT, you should be able to take advantage of a new string function called tokenize(). This function takes at least two arguments: the first is the string to be broken up, and the second is the delimiter character(s) which mark the boundaries between adjacent tokens. For instance, to handle the first question's Name parameter and its semi-colon delimiters, a call to this function might look as follows:

tokenize($Name, ";")

You might be curious what exactly the tokenize() function returns at the point of the call. What it returns is an XML Schema sequence: that is, a series of discrete values. While you can't do much with this sequence by itself, you can use the likewise new XPath 2.0 functions item-at() and index-of() to process it, including extracting or enumerating the individual values.

See Bob DuCharme's "Transforming XML" column of May, 2003, "XSLT 2 and Delimited Lists," for more information and some examples of these new features of XPath 2.0.

EXSLT

Right about now, you may be thinking to yourself something along these lines: having XPath 2.0 on the horizon is all well and good, but what about now? After all, there aren't many XPath/XSLT processors today capable of handling XPath 2.0 novelties (however useful) -- Saxon being the notable exception.)

Another alternative to consider is the EXSLT extension function str:tokenize().

EXSLT is, as its home page states plainly, "a community initiative to provide extensions to XSLT." These extensions are of three kinds: named templates, extension functions, and extension elements.

Consider the EXSLT extension functions category. These work like other functions you might be familiar with from XPath/XSLT 1.0, such as name(), count(), translate(), document(), and key(). They do, however, require you to take a few extra steps: declaring EXSLT-specific namespaces and importing EXSLT stylesheets into your own. The exact steps to take depend on the function(s) you're interested in using, and the XSLT processor in your environment.

EXSLT offers several implementations of a tokenization routine, including a JavaScript version and a named template as well as processor-specific functions. The EXSLT str:tokenize() function (note the namespace prefix), like XPath 2.0's version, takes two arguments (or parameters, if you're using a template-based solution such as Jeni Tennison's); the first is the string to be tokenized, the second is the delimiter. What it returns to you at the point of the function call, though, isn't anything exotic like an XML Schema sequence, requiring that your XSLT processor include support not only for XPath 2.0 but for XML Schema as well. What it returns to your stylesheet is a simple node-set, consisting of N token elements, the value of each of which is a token extracted from the first argument. (If you're using a template-based call to str:tokenize(), what you get back is a result tree fragment, or RTF, rather than a true node-set.)

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

Getting in Touch with XML Contacts

Little Back Corners

For instance, to handle the first questioner's situation with the EXSLT str:tokenize function, your stylesheet would include (in addition to any requisite namespace declarations, xsl:import elements, and so on, depending on the version of the function you're using) a call like this:

str:tokenize($Name, ";")

If you're using the template-based str:tokenize, the call would look like this:

<xsl:call-template name="str:tokenize">
  <xsl:with-param name="string" select="$Name" />
  <xsl:with-param name="delimiters" select="';'" />?
</xsl:call-template>

Note that the double-quoting necessary in the value of the second select attribute; this ensures the XSLT processor will treat the value as a string, rather than as an XPath expression.

What you'd get back in either case would be a node-set (or RTF) like this:

<token>Chuck</token>
<token>Steve</token>
<token>Sarah</token>
<token>Jane</token>

Such a node-set/RTF, of course, can be processed by any old XSLT processor.

By the way, don't be shy about appropriating EXSLT functions and named templates for your own use if they're not exactly what you need; simply download the code and modify it to your own purposes. Give credit where credit is due, though: include a reference in your code's documentation to the work of the EXSLT folks.