XML.com

String analysis with XSLT's analyze-string

July 22, 2024

Mukul Gandhi

Mukul Gandhi gives us a tutorial on the uses of, and differences between, XSLT’s analyze-string instruction and XPath’s analyze-string function.

XSLT 3.0 xsl:analyze-string instruction and XPath 3.1 analyze-string function

1) Introduction

XSLT (XSL Transformations) is a template-oriented markup language to do transformations of XML (Extensible Markup Language) and text documents, to result formats like XML, text, HTML and XHTML. XSLT programs (also known as XSLT stylesheets), are written using an XML syntax together with XML namespaces. XSLT 3.0 specification defines an XML Schema for XSLT stylesheets, and also defines the syntax and semantics of various XSLT language instructions in descriptive form.

This article assumes that the reader is familiar with XML and XML namespaces technologies.

An XSLT stylesheet transformation essentially can add, modify, delete and filter information from input documents that an XSL stylesheet processes, which are the primary objective of XSLT stylesheets.

XSLT 1.0 which is the first version of XSLT, became a W3C recommendation on 16 Nov 1999. Two subsequent W3C recommendations of XSLT language are versions 2.0 (W3C recommendation 23 January 2007) and 3.0 (W3C recommendation 8 June 2017) of XSLT. All XSLT language versions are widely used in software applications. As with any other software technology, the latest version of XSLT (i.e, 3.0) has lots of new XSLT language features than versions 1.0 and 2.0 of XSLT. XSLT 3.0 is largely compatible with XSLT 2.0, in terms of the data model used by XSLT. The data model used by XSLT, is known as XPath (XML Path Language) data model (which for e.g, defines the various XPath nodes that are available, and the data types).

This article explains in detail about XSLT 3.0's xsl:analyze-string instruction and the XPath 3.1 analyze-string function, which are useful XSL language features for XML and text string information analysis.

2) Features common between the XSLT xsl:analyze-string instruction and the XPath analyze-string function, and their similarities and differences with the XPath tokenize function

XSLT's xsl:analyze-string instruction and the XPath function analyze-string, both have similar objectives. Both of these XSL language features, essentially require an input string to be analyzed and split into substrings, and a regular expression. A regular expression (which is very often mentioned as regex), is a string pattern that may match zero or many strings (for example, the regex [a-z]+ matches any word formed with any characters comprising the lower-case alphabet characters 'a' to 'z'). Both of these XSL language features, are conceptually similar to a string tokenizer like the XPath "tokenize" function, but with certain important differences that are explained below in this section.

It's useful to remember the functionality of the XPath "tokenize" function when deciding whether to use one or both of the xsl:analyze-string instruction and the XPath analyze-string function. The XPath "tokenize" function takes as input a 'string to be tokenized', and a 'regular expression' that breaks an input string into various substrings (from an input string's left to right direction) around an input string's character indexes identified by the regular expression. The XPath "tokenize" function produces as output a sequence of substrings, that are identified by string tokenizer's regex. Very often, a string tokenizer is needed in software applications to split an input string into a sequence of tokens that are words in an input string.

Both the xsl:analyze-string instruction and the XPath analyze-string function can do the same tasks as an XPath tokenize function, but that is a subset of features of the xsl:analyze-string instruction and the XPath analyze-string function.

The XSLT xsl:analyze-string instruction and the XPath analyze-string function have features to emit both matching and non-matching substrings of an input string at the regex boundaries, from left to right direction of an input string. An XPath tokenize function can emit only substrings of an input string, where these emitted substrings are parts of an input string that are not substrings matched by the tokenize function's regex argument. i.e, an input string's parts that are matched by XPath tokenize function's regex are not available in the tokenize function's output.

We'll study XSLT's xsl:analyze-string instruction and XPath's analyze-string function with examples in detail.

3) XSLT 3.0 xsl:analyze-string instruction

The XSLT xsl:analyze-string instruction has the following syntax, which an XSLT stylesheet author needs to use when using the xsl:analyze-string instruction in XSL stylesheets:

<xsl:analyze-string select="..." regex="..." flags="...">
    <xsl:matching-substring>
        ...
    </xsl:matching-substring>
    <xsl:non-matching-substring>
        ...
    </xsl:non-matching-substring>
</xsl:analyze-string>

An XSLT xsl:analyze-string instruction has the following requirements:

1) An xsl:analyze-string stylesheet element can have the following attributes : 'select', 'regex' and 'flags'. The 'select' and 'regex' attributes are mandatory on an xsl:analyze-string element, whereas the 'flags' attribute is optional.
2) An xsl:analyze-string stylesheet element, must have one or both of the elements xsl:matching-substring and xsl:non-matching-substring as child elements. Both the elements xsl:matching-substring and xsl:non-matching-substring can appear only once in an xsl:analyze-string element. If both the elements xsl:matching-substring and xsl:non-matching-substring are present in an xsl:analyze-string element, then the xsl:matching-substring element must be written prior to xsl:non-matching-substring element.

It's useful to know that when an xsl:analyze-string instruction contains only an xsl:non-matching-substring as its child element, the xsl:analyze-string instruction functions very similarly to XPath's 'tokenize' function.

Both the XSL stylesheet elements xsl:matching-substring and xsl:non-matching-substring can produce an arbitrary stylesheet output structure (for example, XML or HTML data information), a sequence of data values or even a single atomic value. Any of these XSL stylesheet output contents may be constructed dynamically or statically by the stylesheet.

An xsl:analyze-string instruction's output can start with either the matching (produced by the XSL instruction xsl:matching-substring) or non-matching (produced by the XSL instruction xsl:non-matching-substring) string content from an input string (the computed value of the xsl:analyze-string element's 'select' attribute) that is processed by an xsl:analyze-string instruction instruction. An xsl:analyze-string instruction's output shall start with the matching information if the beginning of the input string matches the xsl:analyze-string instruction's regex. An xsl:analyze-string instruction's output shall start with the non-matching information if the beginning of the input string does not match the xsl:analyze-string instruction's regex.

The xsl:analyze-string instruction's output alternates with matching and non-matching substring information (this is because of how a regex naturally tokenizes an input string. A matching part of an input string will always be followed by a non-matching part, and vice-versa). As mentioned in the previous paragraph, an xsl:analyze-string instruction's output can either start with the matching substring information or the non-matching substring information.

Let's study the xsl:analyze-string instruction's behavior further with a few XSLT stylesheet examples, shown below in this section.

XML document [XML1]:

<?xml version="1.0" encoding="UTF-8"?>
<info>XSLT xsl:analyze-string instruction</info>

XSL stylesheet document [XSL1]:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform”
                          version="3.0">

        <xsl:output method="xml" indent="yes"/>

        <!-- The regex used by the xsl:analyze-string instruction below matches
                a contiguous sequence of one or more whitespace characters -->

        <xsl:template match="/">
            <stringRegexAnalysis>
                <xsl:analyze-string select="info" regex="\s+">
                   <xsl:matching-substring>
                        <matchPart>
                             <xsl:value-of select="."/>
                         </matchPart>
                    </xsl:matching-substring>
                    <xsl:non-matching-substring>
                        <nonMatchPart>
                            <xsl:value-of select="."/>
                        </nonMatchPart>
                    </xsl:non-matching-substring>
               </xsl:analyze-string>
            </stringRegexAnalysis>
        </xsl:template>

</xsl:stylesheet>

When the XSL stylesheet [XSL1] transforms an XML input document [XML1], the following stylesheet output is produced:

<?xml version="1.0" encoding="UTF-8"?>
<stringRegexAnalysis>
    <nonMatchPart>XSLT</nonMatchPart>
    <matchPart> </matchPart>
    <nonMatchPart>xsl:analyze-string</nonMatchPart>
    <matchPart> </matchPart>
    <nonMatchPart>instruction</nonMatchPart>
</stringRegexAnalysis>

The XSL stylesheet's output shown above should appear self-explanatory. With this XSL stylesheet transformation example, an xsl:analyze-string instruction has produced an alternating sequence of substring matching and non-matching information. For this example, an xsl:analyze-string instruction's output starts with the non-matching substring information, because that is the first part of the input string.

It's useful to remember that a regex always corresponds to zero or more matching substrings of an input string. For an xsl:analyze-string instruction, by changing the regex value, we can produce the same substrings of an input string as matching substrings as were produced as non-matching substrings with a different regex. This depends on how an XSL stylesheet author chooses the regex value to be used with the xsl:analyze-string instruction.

Let's assume that we change the stylesheet XSL1's regex to [\w|:|\-]+ (which specifies a contiguous sequence of word characters, i.e. [a-zA-Z_0-9], and additionally includes the characters ':' and '-'), then that produces the following XSL transformation output for the XML input document XML1:

<?xml version="1.0" encoding="UTF-8"?>
<stringRegexAnalysis>
    <matchPart>XSLT</matchPart>
    <nonMatchPart> </nonMatchPart>
    <matchPart>xsl:analyze-string</matchPart>
    <nonMatchPart> </nonMatchPart>
    <matchPart>instruction</matchPart>
</stringRegexAnalysis>

With the regex [\w|:|\-]+, if a substring is found that matches the regex \s+, then with the new regex that same substring is a non-matching substring. Similarly, for a non-matching substring found by the previous regex, the new regex identifies the same substring as a matching substring.

An XSL stylesheet using an xsl:analyze-string instruction doesn't necessarily have to output technical names for XML elements, for example "matchPart", "nonMatchPart" etc. The following XSLT stylesheet illustrates producing user friendly XML element names in the XSL stylesheet's transformation output.

Let's say we have the following XSL stylesheet document ([XSL2]):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                          version="3.0">

      <xsl:output method="xml" indent="yes"/>

      <xsl:template match="/">
          <wordsWithinAString>
               <xsl:analyze-string select="info" regex="[\w|:|\-]+">
                    <xsl:matching-substring>
                         <word>
                             <xsl:value-of select="."/>
                         </word>
                   </xsl:matching-substring> 
              </xsl:analyze-string>
          </wordsWithinAString>
      </xsl:template>

</xsl:stylesheet>

The XSL stylesheet XSL2 shown above, when transforming an XML input document XML1, produces the following output:

<?xml version="1.0" encoding="UTF-8"?>
<wordsWithinAString>
    <word>XSLT</word>
    <word>xsl:analyze-string</word>
    <word>on</word>
</wordsWithinAString>

The regex value used in the XSL stylesheet XSL2 shown above is the same as that used in one of the previous XSL stylesheet examples, but the XSL stylesheet XSL2 produces a more user-friendly XSL transformation output. It also produces only the non-null words (because only the xsl:matching-substring element is present as a child of the xsl:analyze-string element) that are found in the input string.

The regex [\w|:|\-]+ used in the previous XSL stylesheet example used the regex characters ':' and '-' for technical illustration. For similar requirements, an XSL stylesheet author more often uses the regex value [\w]+, which identifies substrings formed with word characters. Using this simpler regex produces a few additional matching substrings in an XSL transformation output. The XSLT stylesheet language has various other features by which an XSL stylesheet author can post-process the result of the xsl:analyze-string instruction if needed.

When the regex value used with xsl:analyze-string instruction is \w+ as for the previous XSL transformation example illustrated in this section, the XSL stylesheet's output is the following:

<?xml version="1.0" encoding="UTF-8"?>
<wordsWithinAString>
    <word>XSLT</word>
    <word>xsl</word>
    <word>analyze</word>
    <word>string</word>
    <word>instruction</word>
</wordsWithinAString>

4) XPath 3.1 analyze-string function

The XPath analyze-string function has the same purpose as the XSLT xsl:analyze-string instruction. An obvious difference between these two XSL language features is that an xsl:analyze-string instruction is an XSLT instruction that may be used in an XSL stylesheet, whereas XPath has the library function named analyze-string.

When authoring XSLT 3.0 stylesheets, the XPath 3.1 processing environment is available in an XSLT 3.0 processor. The availability of the XSL analyze-string feature in XSLT 3.0 as xsl:analyze-string instruction and the XPath 3.1 function analyze-string doesn't mean that either of these is preferable over the other when authoring an XSL stylesheet. In an XSLT 3.0 stylesheet when there is a requirement to use the analyze-string feature, the XSL stylesheet author can use either the XSLT xsl:analyze-string instruction or the XPath analyze-string function.

As we'll see with the XPath function analyze-string examples in this section, it's probably somewhat simpler to use the xsl:analyze-string instruction than the XPath analyze-string function. This is my personal opinion as the author of this article, but different XSL stylesheet authors have different preferences whether to use the xsl:analyze-string instruction or the XPath analyze-string function.

Let's say that we have an XSL stylesheet document [XSL3] as follows, that uses the XPath function analyze-string:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                        version="3.0">

     <xsl:output method="xml" indent="yes"/>

     <xsl:template match="/">
         <xsl:copy-of select="analyze-string(info,'\s+')"/> 
     </xsl:template>

</xsl:stylesheet>

When the stylesheet document XSL3 transforms the XML document XML1 specified earlier in this article, the XSL transformation produces the following output:

<?xml version="1.0" encoding="UTF-8"?>
<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
    <non-match>XSLT</non-match>
    <match> </match>
    <non-match>xsl:analyze-string</non-match>
    <match> </match>
    <non-match>instruction</non-match>
</analyze-string-result>

As we can see, this output contains similar string analysis information to the result output of one of the previously specified XSL stylesheets that uses the xsl:analyze-string instruction.

The XSLT 3.0 specification provides an XML Schema definition for the result of the XPath analyze-string's function call (the stylesheet XSL3's output conforms to this XML Schema document). Every XPath function call to the function analyze-string produces an XML document output conforming to this specified XML Schema. For reference, the XML Schema document for the XPath function analyze-string's result is available at : https://www.w3.org/TR/xpath-functions-31/#schema-for-analyze-string.

To summarize, the essential semantics of the XML document structure of the result of the XPath function call analyze-string are the following:

1) The XPath function call analyze-string's result has a topmost XML node with the XDM (XPath Data Model) type element fn:analyze-string-result, where the namespace of the element analyze-string-result is http://www.w3.org/2005/xpath-functions (which is commonly bound to the XML namespace prefix "fn").

2) In the function call analyze-string's result, the XML element analyze-string-result's children form a strictly alternating sequence of the XML elements for "match" and "non-match". Either of an XML element "match" or "non-match" can appear as the first sibling. This is due to the same reasons as for the result of the XSLT xsl:analyze-string instruction.

As with other XSLT stylesheets, the result of the XPath function call analyze-string can be transformed to something other than the standard output of the analyze-string's function call (for example, to make the final result of XSLT stylesheet's output more user-friendly).

This is illustrated with the following XSLT stylesheet example ([XSL4]):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                        xmlns:fn="http://www.w3.org/2005/xpath-functions"
                        exclude-result-prefixes="fn"
                        version="3.0">

         <xsl:output method="xml" indent="yes"/>

         <xsl:template match="/">
             <wordsWithinAString> 
                <xsl:apply-templates select="analyze-string(info,'\w+')/fn:match"/>
            </wordsWithinAString>
         </xsl:template>

         <xsl:template match="fn:match">
            <word>
               <xsl:value-of select="."/>
            </word>
         </xsl:template>

</xsl:stylesheet>

When the XSL stylesheet XSL4 transforms the XML document XML1 specified earlier in this article, the stylesheet transformation produces the following result:

<?xml version="1.0" encoding="UTF-8"?>
<wordsWithinAString>
    <word>XSLT</word>
    <word>xsl</word>
    <word>analyze</word>
    <word>string</word>
    <word>instruction</word>
</wordsWithinAString>

As we can see in the above XSL stylesheet example, the stylesheet XSL4 processes only the XML element named "match" (and subsequently transforms that to an XML element named "word") from the result of the XPath function call analyze-string.

The XPath analyze-string function (or the XSLT instruction xsl:analyze-string) can be used in an XSL stylesheet as it would be normally used. The analyze-string function call's result can be post-processed (for example, grouping and aggregating the analyze-string function's output) by other XSLT language instructions. Let's study these concepts, illustrating with an example below.

XML document [XML2]:

<?xml version="1.0" encoding="UTF-8"?>
<info>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis condimentum, orci in accumsan pulvinar, 
orci diam condimentum dolor, at tincidunt ante lacus convallis turpis. Nunc metus risus, ultrices sit 
amet pretium eu, rhoncus non nisl. Ut eu luctus magna. Sed quis lorem magna. Nunc malesuada velit volutpat, 
lacinia odio ornare, mattis augue. Sed scelerisque urna et consectetur vulputate. Vivamus porttitor laoreet 
nisl, lacinia blandit quam facilisis facilisis. Donec libero augue, facilisis eget blandit in, convallis 
sed urna. Aliquam elementum dapibus malesuada. Fusce mattis ipsum eu viverra tincidunt. In hac habitasse 
platea dictumst.
</info>

XSL stylesheet document [XSL5]:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                        xmlns:xs="http://www.w3.org/2001/XMLSchema"
                        xmlns:fn="http://www.w3.org/2005/xpath-functions"
                        exclude-result-prefixes="xs fn"
                        version="3.0">
        <xsl:output method="xml" indent="yes"/>

       <xsl:template match="/">
             <groupsOfWords> 
                   <xsl:for-each-group select="analyze-string(info,'[\s+|,|\.]+')/fn:non-match" group-by="string-length(.)">
                        <xsl:sort select="current-grouping-key()" data-type="number"/>
                        <wordsGroup strLength="{current-grouping-key()}" groupSize="{count(current-group())}">
                            <words> 
                                <xsl:value-of select="string-join(for $nonMatchElem in current-group() return xs:string($nonMatchElem),',')"/>
                            </words>
                        </wordsGroup>
                   </xsl:for-each-group>
             </groupsOfWords>
        </xsl:template>
</xsl:stylesheet>

When the XML document XML2 is transformed by the XSL stylesheet XSL5 that's shown above, the following XSL transformation output is produced:

<?xml version="1.0" encoding="UTF-8"?>
<groupsOfWords>
    <wordsGroup strLength="2" groupSize="9">
        <words>in,at,eu,Ut,eu,et,in,eu,In</words>
    </wordsGroup>
    <wordsGroup strLength="3" groupSize="7">
        <words>sit,sit,non,Sed,Sed,sed,hac</words>
    </wordsGroup>
    <wordsGroup strLength="4" groupSize="18">
        <words>amet,elit,Duis,orci,orci,diam,ante,Nunc,amet,nisl,quis,Nunc,odio,urna,nisl,quam,eget,urna</words>
    </wordsGroup>
    <wordsGroup strLength="5" groupSize="16">
        <words>Lorem,ipsum,dolor,dolor,lacus,metus,risus,magna,lorem,magna,velit,augue,Donec,augue,Fusce,ipsum</words>
    </wordsGroup>
    <wordsGroup strLength="6" groupSize="7">
        <words>turpis,luctus,ornare,mattis,libero,mattis,platea</words>
    </wordsGroup>
    <wordsGroup strLength="7" groupSize="11">
        <words>pretium,rhoncus,lacinia,Vivamus,laoreet,lacinia,blandit,blandit,Aliquam,dapibus,viverra</words>
    </wordsGroup>
    <wordsGroup strLength="8" groupSize="5">
        <words>accumsan,pulvinar,ultrices,volutpat,dictumst</words>
    </wordsGroup>
    <wordsGroup strLength="9" groupSize="13">
        <words>tincidunt,convallis,malesuada,vulputate,porttitor,facilisis,facilisis,facilisis,convallis,elementum,malesuada,tincidunt,habitasse</words>
    </wordsGroup>
    <wordsGroup strLength="10" groupSize="1">
        <words>adipiscing</words>
    </wordsGroup>
    <wordsGroup strLength="11" groupSize="5">
        <words>consectetur,condimentum,condimentum,scelerisque,consectetur</words>
    </wordsGroup>
</groupsOfWords>

The XSL stylesheet XSL5 shown above used the XPath analyze-string function, whose result has been aggregated and grouped using the xsl:for-each-group instruction to provide a different aggregate data view of the analyze-string's result.

5) XPath 3.1 tokenize function

Although studying XPath's tokenize function is not the topic of this article, it is useful to discuss an XSL stylesheet example that solves one of the use cases using the XPath tokenize function that was solved earlier in this article using the xsl:analyze-string instruction and/or the XPath function analyze-string.

Let's study the following XSLT stylesheet [XSL6]:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                        xmlns:xs="http://www.w3.org/2001/XMLSchema"
                        xmlns:fn0="http://fn0"
                        exclude-result-prefixes="xs fn0"
                        version="3.0">

        <xsl:output method="xml" indent="yes"/>

        <xsl:template match="/">
            <groupsOfWords> 
                 <xsl:for-each-group select="fn0:getNonEmptyTokens(tokenize(info,'[\s+|,|\.]+'))" group-by="string-length(.)">
                      <xsl:sort select="current-grouping-key()" data-type="number"/>
                         <wordsGroup strLength="{current-grouping-key()}" groupSize="{count(current-group())}">
                             <words> 
                                   <xsl:value-of select="string-join(for $nonMatchElem in current-group() return xs:string($nonMatchElem),',')"/>
                            </words>
                        </wordsGroup>
                   </xsl:for-each-group>
             </groupsOfWords>
        </xsl:template>

        <!-- Get sequence of "token" elements, for token strings having length > 0. -->
        <xsl:function name="fn0:getNonEmptyTokens" as="element()*">
             <xsl:param name="tokens" as="xs:string*"/>
             <xsl:for-each select="$tokens[string-length(.) &amp;gt; 0]">
             <token><xsl:value-of select="."/></token>
             </xsl:for-each>
        </xsl:function>

</xsl:stylesheet>

The XSL stylesheet XSL6 illustrated above, when transforming the XML input document XML2 described earlier in this section, produces an XSL transformation output which is the same as the XSL transformation output that the XSL stylesheet XSL5 produced.

In the XSL stylesheet XSL6 shown above, we've transformed the result of the XPath tokenize function to a node sequence using the stylesheet function fn0:getNonEmptyTokens, and subsequently grouped the result of the function call fn0:getNonEmptyTokens to produce the XSL stylesheet XSL6's final output.

As we discussed earlier in this article, the XPath tokenize function's difference with the xsl:analyze-string instruction and the XPath function analyze-string is that the XPath tokenize function cannot produce an input string's regex matching regions.

6) Conclusion

This article has discussed the XSLT 3.0 language xsl:analyze-string instruction in detail, and an XPath 3.1 function analyze-string that produces an output with similar information as the xsl:analyze-string function. Both the XSLT xsl:analyze-string instruction and the XPath function analyze-string are useful XSL language features for XML and text string information analysis using programming regular expressions.

We have also discussed using an XPath 3.1 function 'tokenize' to do string information analysis using regular expressions, to solve use cases with similar objectives as the XSLT and XPath analyze-string language features.

This article hasn't explained details about the xsl:analyze-string instruction and XPath analyze-string function's regex 'flags'. Regex flags are optional to use with these XSL language features. Regex flags are options that allow among various things like, regex match to work in case-insensitive mode. XPath 3.1 regex flags are explained in detail at the link : https://www.w3.org/TR/xpath-functions-31/#flags. The Regex syntax used by all the features in XSLT 3.0 and XPath 3.1 that require regex, is available at the link : https://www.w3.org/TR/xpath-functions-31/#regex-syntax.

Users familiar with using regex in languages like XML Schema, Perl and Java, shall find simpler to learn XSLT 3.0 and XPath 3.1 regular expressions. Following are links to few of these various other regex syntax definitions : https://www.w3.org/TR/xmlschema-2/#regexs, https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

All the examples provided in this article have been successfully tested with the Saxon-HE XSLT 3.0 processor and Apache Xalan-J's XSLT 3.0 development build.

7) References

Following are the references to relevant W3C recommendations and XSLT processors.