From One String to Many
April 28, 2004
This month, I found two questions from two different people on essentially the same subject:
Q: Can I use XSLT to parse a string of characters?
Question 1: Right now I'm sending values to our XSLT stylesheet which is defining
xsl:params we have. Here's the way the command line is typically
foo.xsl -param Name "'Chuck'"
And here's the XSLT to acquire the value of the Name parameter:
<xsl:param name="Name" select="''" />
Is there a way to send any number of multiple values for
Name and have them
define, say, the XSLT equivalent of an array? I was thinking of using something like
variation on the command line above, with multiple names delimited by semicolons:
foo.xsl -param Name "'Chuck;Steve;Sara;Jane;'"
Question 2: I have an element like this:
I want to get the value list of its attribute
one-of, but I don't know how to
A: Both of these questions ask of XSLT that it perform what might be called a sidebar
that is, something apart from its central mission, which is the manipulation of a
tree. They both want to use XSLT to examine a text string, breaking it apart into
at locations defined by some delimiting character. In Question 1, the values passed
way of the
Name parameter are delimited by semicolons; in Question 2, the
values contained in the
one-of attribute are delimited by pipes ("vertical
The choice of delimiting character is completely arbitrary, although it may be forced on you by some outside constraint such as the interface with another application. You might just as well choose hyphens, underscores, even slashes or backslashes; e.g., for parsing the path of a file in a local directory or a URL. The separator might be just a blank space, for that matter, enabling you to extract words from a sentence. Or use periods to extract sentences from a paragraph. And so on.
There are some minor differences in the two questions:
- Question 1 expects a list of four items, while Question 2 expects five. The actual number is irrelevant: the "parsing" ideally will work with a list of two items as well as a list of two hundred. That is, the stop-parsing trigger is simply "nothing else to parse," not that a particular number of tokens (the individual units of text between successive delimiters) has been extracted.
- Possibly more problematic is the subtle difference in how the delimiters (whatever they may be) are used in the two questions. In the first case, each name in the list is followed by a delimiter; in the second, too, each list item is followed by a delimiter... but the first item is also preceded by a delimiter.
I don't think this second issue matters much. At worst, you could simply trim off the opening delimiter and then process the rest of the string exactly as when dealing with Question 1's more conventional list form. It's an atypical way to delimit a list of items, to be sure. But it does highlight the need to know your data, as always.
Enough talk about the details. What's the answer?
When XPath finally makes the leap to its second version, dragging in its wake new
of both XQuery and XSLT, you should be able to take advantage of a new string function called
tokenize(). This function takes at least two arguments: the first is the
string to be broken up, and the second is the delimiter character(s) which mark the
boundaries between adjacent tokens. For instance, to handle the first question's
Name parameter and its semi-colon delimiters, a call to this function might
look as follows:
You might be curious what exactly the
tokenize() function returns at the point
of the call. What it returns is an XML Schema sequence: that is, a series of discrete
values. While you can't do much with this sequence by itself, you can use the likewise
XPath 2.0 functions
index-of() to process it,
including extracting or enumerating the individual values.
See Bob DuCharme's "Transforming XML" column of May, 2003, "XSLT 2 and Delimited Lists," for more information and some examples of these new features of XPath 2.0.
Right about now, you may be thinking to yourself something along these lines: having XPath 2.0 on the horizon is all well and good, but what about now? After all, there aren't many XPath/XSLT processors today capable of handling XPath 2.0 novelties (however useful) -- Saxon being the notable exception.)
Another alternative to consider is the EXSLT extension function
EXSLT is, as its home page states plainly, "a community initiative to provide extensions to XSLT." These extensions are of three kinds: named templates, extension functions, and extension elements.
Consider the EXSLT extension functions category. These work like other functions you
be familiar with from XPath/XSLT 1.0, such as
key(). They do,
however, require you to take a few extra steps: declaring EXSLT-specific namespaces
importing EXSLT stylesheets into your own. The exact steps to take depend on the function(s)
you're interested in using, and the XSLT processor in your environment.
EXSLT offers several
template as well as processor-specific functions. The EXSLT
function (note the namespace prefix), like XPath 2.0's version, takes two arguments
parameters, if you're using a template-based solution such as Jeni
Tennison's); the first is the string to be tokenized, the second is the delimiter.
What it returns to you at the point of the function call, though, isn't anything exotic
an XML Schema sequence, requiring that your XSLT processor include support not only
XPath 2.0 but for XML Schema as well. What it returns to your stylesheet is a simple
node-set, consisting of N
token elements, the value of each of which is a token extracted from the first
argument. (If you're using a template-based call to
str:tokenize(), what you
get back is a result tree fragment, or RTF, rather than a true node-set.)
Also in XML Q&A
For instance, to handle the first questioner's situation with the EXSLT
str:tokenize function, your stylesheet would include (in addition to any
requisite namespace declarations,
xsl:import elements, and so on, depending on
the version of the function you're using) a call like this:
If you're using the template-based
str:tokenize, the call would look like
<xsl:with-param name="string" select="$Name" />
<xsl:with-param name="delimiters" select="';'" />?
Note that the double-quoting necessary in the value of the second
attribute; this ensures the XSLT processor will treat the value as a string, rather
an XPath expression.
What you'd get back in either case would be a node-set (or RTF) like this:
Such a node-set/RTF, of course, can be processed by any old XSLT processor.
By the way, don't be shy about appropriating EXSLT functions and named templates for your own use if they're not exactly what you need; simply download the code and modify it to your own purposes. Give credit where credit is due, though: include a reference in your code's documentation to the work of the EXSLT folks.