From One String to Many
April 28, 2004
This month, I found two questions from two different people on essentially the same subject:
Q: Can I use XSLT to parse a string of characters?
Question 1: Right now I'm sending values to our XSLT stylesheet which is defining
some xsl:params
we have. Here's the way the command line is typically
structured:
foo.xsl -param Name "'Chuck'"
And here's the XSLT to acquire the value of the Name parameter:
<xsl:param name="Name" select="''" />
Is there a way to send any number of multiple values for Name
and have them
define, say, the XSLT equivalent of an array? I was thinking of using something like
this
variation on the command line above, with multiple names delimited by semicolons:
foo.xsl -param Name "'Chuck;Steve;Sara;Jane;'"
Question 2: I have an element like this:
<para one-of="|11|22|33|44|55|"/>
I want to get the value list of its attribute one-of
, but I don't know how to
parse it,
A: Both of these questions ask of XSLT that it perform what might be called a sidebar
task;
that is, something apart from its central mission, which is the manipulation of a
source
tree. They both want to use XSLT to examine a text string, breaking it apart into
substrings
at locations defined by some delimiting character. In Question 1, the values passed
in by
way of the Name
parameter are delimited by semicolons; in Question 2, the
values contained in the one-of
attribute are delimited by pipes ("vertical
bars").
The choice of delimiting character is completely arbitrary, although it may be forced on you by some outside constraint such as the interface with another application. You might just as well choose hyphens, underscores, even slashes or backslashes; e.g., for parsing the path of a file in a local directory or a URL. The separator might be just a blank space, for that matter, enabling you to extract words from a sentence. Or use periods to extract sentences from a paragraph. And so on.
There are some minor differences in the two questions:
- Question 1 expects a list of four items, while Question 2 expects five. The actual number is irrelevant: the "parsing" ideally will work with a list of two items as well as a list of two hundred. That is, the stop-parsing trigger is simply "nothing else to parse," not that a particular number of tokens (the individual units of text between successive delimiters) has been extracted.
- Possibly more problematic is the subtle difference in how the delimiters (whatever they may be) are used in the two questions. In the first case, each name in the list is followed by a delimiter; in the second, too, each list item is followed by a delimiter... but the first item is also preceded by a delimiter.
I don't think this second issue matters much. At worst, you could simply trim off the opening delimiter and then process the rest of the string exactly as when dealing with Question 1's more conventional list form. It's an atypical way to delimit a list of items, to be sure. But it does highlight the need to know your data, as always.
Enough talk about the details. What's the answer?
XPath 2.0
When XPath finally makes the leap to its second version, dragging in its wake new
versions
of both XQuery and XSLT, you should be able to take advantage of a new string function called
tokenize()
. This function takes at least two arguments: the first is the
string to be broken up, and the second is the delimiter character(s) which mark the
boundaries between adjacent tokens. For instance, to handle the first question's
Name
parameter and its semi-colon delimiters, a call to this function might
look as follows:
tokenize($Name, ";")
You might be curious what exactly the tokenize()
function returns at the point
of the call. What it returns is an XML Schema sequence: that is, a series of discrete
values. While you can't do much with this sequence by itself, you can use the likewise
new
XPath 2.0 functions item-at()
and index-of()
to process it,
including extracting or enumerating the individual values.
See Bob DuCharme's "Transforming XML" column of May, 2003, "XSLT 2 and Delimited Lists," for more information and some examples of these new features of XPath 2.0.
EXSLT
Right about now, you may be thinking to yourself something along these lines: having XPath 2.0 on the horizon is all well and good, but what about now? After all, there aren't many XPath/XSLT processors today capable of handling XPath 2.0 novelties (however useful) -- Saxon being the notable exception.)
Another alternative to consider is the EXSLT extension function
str:tokenize()
.
EXSLT is, as its home page states plainly, "a community initiative to provide extensions to XSLT." These extensions are of three kinds: named templates, extension functions, and extension elements.
Consider the EXSLT extension functions category. These work like other functions you
might
be familiar with from XPath/XSLT 1.0, such as name()
, count()
,
translate()
, document()
, and key()
. They do,
however, require you to take a few extra steps: declaring EXSLT-specific namespaces
and
importing EXSLT stylesheets into your own. The exact steps to take depend on the function(s)
you're interested in using, and the XSLT processor in your environment.
EXSLT offers several
implementations of a tokenization routine, including a JavaScript version and a named
template as well as processor-specific functions. The EXSLT str:tokenize()
function (note the namespace prefix), like XPath 2.0's version, takes two arguments
(or
parameters, if you're using a template-based solution such as Jeni
Tennison's); the first is the string to be tokenized, the second is the delimiter.
What it returns to you at the point of the function call, though, isn't anything exotic
like
an XML Schema sequence, requiring that your XSLT processor include support not only
for
XPath 2.0 but for XML Schema as well. What it returns to your stylesheet is a simple
node-set, consisting of N
token
elements, the value of each of which is a token extracted from the first
argument. (If you're using a template-based call to str:tokenize()
, what you
get back is a result tree fragment, or RTF, rather than a true node-set.)
![]() |
|
Also in XML Q&A |
|
For instance, to handle the first questioner's situation with the EXSLT
str:tokenize
function, your stylesheet would include (in addition to any
requisite namespace declarations, xsl:import
elements, and so on, depending on
the version of the function you're using) a call like this:
str:tokenize($Name, ";")
If you're using the template-based str:tokenize
, the call would look like
this:
<xsl:call-template name="str:tokenize">
<xsl:with-param
name="string" select="$Name" />
<xsl:with-param
name="delimiters" select="';'" />?
</xsl:call-template>
Note that the double-quoting necessary in the value of the second select
attribute; this ensures the XSLT processor will treat the value as a string, rather
than as
an XPath expression.
What you'd get back in either case would be a node-set (or RTF) like this:
<token>Chuck</token>
<token>Steve</token>
<token>Sarah</token>
<token>Jane</token>
Such a node-set/RTF, of course, can be processed by any old XSLT processor.
By the way, don't be shy about appropriating EXSLT functions and named templates for your own use if they're not exactly what you need; simply download the code and modify it to your own purposes. Give credit where credit is due, though: include a reference in your code's documentation to the work of the EXSLT folks.