Menu

EXSLT for MSXML

August 6, 2003

Dimitre Novatchev

Introduction

The EXSLT project specifies a set of standard extension functions for XSLT that, when implemented by all vendors of XSLT processors, will allow writing portable XSLT applications. Until now most of the major XSLT processors had already some form of support for EXSLT, with the notable exception of MSXML.

This article describes a third party implementation of EXSLT for MSXML4 by Dimitre Novatchev. The author had no access to internal product interfaces and had to overcome some serious difficulties, which until now had prevented the development of any such third party implementation of EXSLT for MSXML4.

How to extend an XSLT processor

There are different possible ways to implement extensions to an XSLT processor:

  • Modify the implementation of the XSLT processor, recompile, rebuild, test and redeploy it.
  • Implement one or more extension elements.
  • Provide a library of inline extension functions.
  • Provide external extension objects, whose methods will be referenced as extension functions.

The first and the second option require that either the source code of the XSLT processor (and the right to modify and extend it) is available or that documentation is provided explaining how to implement extension elements. None of these is true in the case of MSXML.

Following the third option will require the MSXML programmers to include some inline scripting code written in a language like Javascript. This is not completely convenient and differs from the way extension functions are implemented and used in other XSLT processors.

The decision to choose the last option reflects its advantages:

  • A namespace prefix is associated with an extension object.
  • The code of this object is not inlined in the transformation and in fact the XSLT programmer may not know anything about it.
  • This is the way to reference and use extension functions in most XSLT processors.

What EXSLT modules to implement

There are eight different modules of functions specified by EXSLT:

  • Common. Covers common, basic extension elements and functions.
  • Dates and Times. Covers common, basic extension elements and functions.
  • Dynamic. Covers extension elements and functions that deal with the dynamic evaluation of strings containing XPath expressions.
  • Functions. Extension elements and functions that allow users to define their own functions for use in expressions and patterns in XSLT.
  • Math. Covers extension elements and functions that provide facilities to do with maths.
  • Regular Expressions. Covers extension elements and functions that provide facilities to do with regular expressions.
  • Sets. Covers those extension elements and functions that provide facilities to do with set manipulation.
  • Strings Covers extension elements and functions that provide facilities to do with string manipulation.

Dynamic and Functions require access to the internals (code/data structures) of the XSLT processor -- something not achievable in the case of MSXML. Math is too trivial and has already an implementation in XSLT 1.0 and XSLT 2.0 (See The FXSL Functional Programming Library for XSLT1 and The FXSL Functional Programming Library for XSLT2 Functions, Dates and Times, Regular expressions and part of Strings will be covered by the standard XSLT 2.0 (see XQuery 1.0 and XPath 2.0 Functions and Operators.) There are also pure XSLT 1.0 libraries covering dates and time (A date_time XSLT 1.0 template library available as part of XSelerator)

From the standpoint of immediate usability in XSLT 1.0, the most useful EXSLT function is common:node-set(). Other necessary and useful functions, which cannot be implemented in XSLT 1.0, are those from the Sets module.

So I decided to implement the following EXSLT functions:

  • common:node-set()
    And all functions from the Sets module:
  • set:intersection()
  • set:difference()
  • set:distinct()
  • set:leading()
  • set:trailing()
  • set:has-same-node()

The Big Problem

Have you ever wondered why for more than two years there has been no attempt at a third-party implementation of EXSLT for MSXML? Try to produce such and you'll know that there is a big obstacle, which no one until now had been able to remove.

In the object model of MSXML there is one method for obtaining a node set as result of evaluating an XPath expression. This method is selectNodes() member of the IXMLDOMNode object and defined as follows:

HRESULT selectNodes( BSTR expression, IXMLDOMNodeList ** resultList);

As can be seen, in the MSXML object model a node set is represented by an IXMLDOMNodeList object. A node is represented by an IXMLDOMNode object.

selectNodes() can be issued only against a "current node" -- some IXMLDOMNode object.

The problem is that there is no documented way to create an IXMLDOMNodeList, except as returned by selectNodes(). This means that one cannot perform even such simple tasks as getting the union of two IXMLDOMNode nodes.

How then can we get a subset of a node set using only the MSXML object model? Impossible. But this is exactly what is needed in order to implement any of the six functions in the EXSLT Sets module.

I must confess that it was exactly the challenge of the impossible task that attracted me. Someone, who didn't know better, told me that implementing the Sets module was impossible without asking Microsoft to provide a more powerful interface, containing methods that can create an IXMLDOMNodeList from any collection of IXMLDOMNode objects. It was also implied that an XSLT specialist was inferior in attacking, solving and even understanding this problem.

What happened next is described below.

The Solution: Steal an IXMLDOMNodeList

As explained above, using the MSXML4 object model it is impossible to create an IXMLDOMNodeList other than one returned by the selectNodes() method and this has strong limitations making impossible the implementation of any EXSLT Sets functions.

The solution is to try to create an IXMLDOMNodeList outside of the MSXML object model. In XSLT there are no such limitations. It is straightforward to obtain the result of evaluating any XPath expression. So why not perform an XSLT transformation, which will evaluate any Xpath expression we need and produce its result to us?

A nice idea, but there is a major flaw in it -- an XSLT transformation always produces copies of the original nodes, not the nodes themselves. This is probably the moment when anybody stopped in desperation.

A transformation can evaluate any XPath expression internally and have access to the resulting node set, but it cannot "pass it back", it can only produce copies of the original nodes.

Can a transformation pass the result node set to any piece of code at all? Yes, it can pass it to another template it calls or instantiates or to an extension function.

This seems absolutely unusable in our case. We called the transformation so how it can call us? Even if this were possible, we still must make a return and will lose the valuable node set that the transformation passed to us.

The answer is simple: we store it in a property of our extension object. When the transformation returns to our code that started it, the node set will still be the value of this property.

This manipulation is reflected in the following picture:

Diagram.
Figure 1: How to create a desired, new IXMLDOMNodeList

Initial algorithms

Having solved the big problem, it's time for the complete implementation. The XSLT solutions in the "Aux. Transform" box can be really simple and compact. Thus for set:intersection() we can have:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:caller="urn:my-object">
 
  <xsl:param name="ns1" select="/.."/>
  <xsl:param name="ns2" select="/.."/>
  
  <xsl:variable name="vCnt" select="count($ns2)"/>

  <xsl:template match="intersect">
    <xsl:value-of 
     select="caller:storeXPathResult($ns1[count(. | $ns2) = $vCnt])"/>
  </xsl:template>
</xsl:stylesheet>

For set:distinct() we can have the following XSLT implementation:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:caller="urn:my-object">

  <xsl:param name="ns1" select="/.."/>
  
  <xsl:template match="makeDistinct" name="makeDistinct">
    <xsl:param name="pDistinct" select="/.."/>
    <xsl:param name="pNodes" select="$ns1"/>

    <xsl:choose>
      <xsl:when test="$pNodes">
        <xsl:variable name="pnewDistinct"
   select="$pDistinct | $pNodes[1]"/>
         
         <xsl:call-template name="makeDistinct">
           <xsl:with-param name="pDistinct" select="$pnewDistinct"/>
           <xsl:with-param name="pNodes"
select="$pNodes[position() > 1][not(. = $pnewDistinct)]"/>
         </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="caller:storeXPathResult($pDistinct)"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>

Finally, for set:leading() we could have:

<xsl:stylesheet version="1.0" 
 xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" 
 xmlns:caller="urn:my-object">
 
<xsl:param name="ns1" select="/.."/>
 <xsl:param name="ns2" select="/.."/>
 
  <xsl:template match="leading" name="leading">
    <xsl:param name="pNodes" select="$ns1"/>
    <xsl:param name="pANode" select="$ns2[1]"/>
    <xsl:param name="pLeading" select="/.."/>
    <xsl:choose>
      <xsl:when test="not($pNodes) or not($pANode)">
        <xsl:value-of select="caller:storeXPathResult($pNodes)"/>
      </xsl:when>
      
      <xsl:when test="count($pANode | $pNodes[1]) = 1
                or count($pANode | $pNodes) != count($pNodes)">
        <xsl:value-of select="caller:storeXPathResult($pLeading)"/>
      </xsl:when>
      
      <xsl:otherwise>
        <xsl:variable name="pnewLeading"
       select="$pLeading | $pNodes[1]"/>
         
         <xsl:call-template name="leading">
           <xsl:with-param name="pLeading"
               select="$pnewLeading"/>
           <xsl:with-param name="pNodes"
          select="$pNodes[position() > 1]"/>
           <xsl:with-param name="pANode" select="$pANode"/>
         </xsl:call-template>
        
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>

The XSLT implementation of the other functions of the Sets module is similar -- set:difference() is coded in a similar way to set:intersection(), set:trailing() is similar to set:leading(), and set:has-same-node() returns true if set:intersection() is non-empty.

Optimization

We could stop here, but the algorithms are not fast and robust, especially in some worst case scenarios. In particular, deep recursion typically can eat up significant amounts of memory, slow down the execution from linear to exponential, and eventually crash an XSLT processor. (See Use recursion effectively in XSL and Two-stage recursive algorithms in XSLT).

Therefore, in looking to improve our initial algorithms we try to limit the depth of recursion or to eliminate recursion completely.

Limiting recursion -- the Divide and Conquer (DVC) approach

One powerful way of limiting the depth of recursion is the Divide and Conquer (DVC) method (see Algorithms in C++, Part 1-4, by Robert Sedgewick, Addison-Wesley (ISBN 0-201-35088-2) and DVC Algorithms in XSLT). The idea is to split the input into two or more parts, apply the algorithm on each part and then combine the results. A DVC-style algorithm has a maximum recursion depth of Log2(N), which allows very long inputs to be processed without any stack overflow problems -- e.g. the maximum recursion depth for processing a list of 1000000 (1M) elements is only 19.

A DVC-style set:distinct() implementation looks like this:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:set="http://exslt.org/sets"
  xmlns:caller="urn:my-object">

  <xsl:param name="ns1" select="/.."/>

  <xsl:template match="distinct">
    <xsl:param name="pNodes" select="$ns1"/>
    
    <xsl:if test="$pNodes">
      <xsl:choose>
        <xsl:when test="not($pNodes[2])">
          <xsl:value-of 
           select="caller:storeXPathResult($pNodes)"/>   
        </xsl:when>
        <xsl:otherwise>
          <xsl:variable name="vHalfLng"
   select="floor(count($pNodes) div 2)"/>
          <xsl:variable name="vHalf1" 
     select="$pNodes[position() &lt;= $vHalfLng]"/>
          
          <xsl:variable name="vHalf2" 
     select="$pNodes[position() > $vHalfLng]"/>
           
           <xsl:variable name="vDist1"
    select="set:distinct($vHalf1)"/>
           <xsl:variable name="vDist2"
    select="set:distinct($vHalf2)"/>
           
           <xsl:value-of 
           select="caller:storeXPathResult($vDist1
                                   |
                                    $vDist2
                                     [not(. = $vDist1)])"/>   
         </xsl:otherwise>
      </xsl:choose>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

With longer input node sets this transformation is much faster than the previous and it scales well in processing huge inputs.

Binary search

The initial algorithm for set:leading() incrementally unifies all nodes from the start of the node set until the designated (second argument) node is reached. This has all the problems of deep recursion.

A more direct and efficient solution exists, one based on the classic binary search algorithm. Here we try to find a subset of the original node set, such that the designated (second argument) node is the last node in this subset. When such a subset is found, we simply return:

$accumulatedLeading | thisSubset[position() != last()]

The $accumulatedLeading is the union of all subsets preceding the current subset. It is much more efficient to accumulate whole subsets than to accumulate node by node. The binary search algorithm is similar to a DVC-style algorithm but is more efficient because at any time only one of the two parts is recursively processed.

To find such a subset we divide the original node set into two halves and take the one which contains the designated node. If the designated node is not the last, we do the same recursively to the chosen (half) subset. Here's a somewhat simplified version of the Binary Search set:leading() code:

<xsl:stylesheet version="1.0" 
 xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" 
 xmlns:caller="urn:my-object">
 <xsl:param name="ns1" select="/.."/>
 <xsl:param name="ns2" select="/.."/>

  <xsl:template name="binLeading" match="leading">
    <xsl:param name="pNodes" select="$ns1"/>
    <xsl:param name="pANode" select="$ns2[1]"/>
    <xsl:param name="pLeading" select="/.."/>

    <xsl:choose>
      <xsl:when test="count($pANode | $pNodes[last()]) = 1">
        <xsl:value-of 
      select="caller:storeXPathResult($pLeading
                        | $pNodes[position() != last()])"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:variable name="vcntHalf"
 select="floor(count($pNodes) div 2)"/>
        
        <xsl:variable name="vHalf1" 
     select="$pNodes[position() &lt;= $vcntHalf]"/>
        
        <xsl:choose>
          <xsl:when test="count($pANode | $vHalf1) = count($vHalf1)">
            <xsl:call-template name="binLeading">
              <xsl:with-param name="pNodes" select="$vHalf1"/>
              <xsl:with-param name="pANode" select="$pANode"/>
              <xsl:with-param name="pLeading" select="$pLeading"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <xsl:call-template name="binLeading">
              <xsl:with-param name="pNodes"
   select="$pNodes[position() > $vcntHalf]"/>
               <xsl:with-param name="pANode" select="$pANode"/>
               <xsl:with-param name="pLeading"
    select="$pLeading | $vHalf1"/>
            </xsl:call-template>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>

Eliminating recursion

Not surprisingly, we can even further improve efficiency if we manage to eliminate recursion altogether.

Let's take as example set:distinct():

  1. We copy every node from the argument node set $ns1 into an element of a temporary tree. This element has the same string value as the node being copied. It also has an attribute "p", the value of which is the position of the original node in the argument node-set.
  2. Then using the Muenchian method for grouping (see Grouping using xsl:key and the key() function -- the Muenchian method) we find in this tree every first in document order element that has some unique value.
  3. We create a pipe-delimited string $vids of the original positions (the value of the "p" attribute) of all such first elements.
  4. Finally, the set of distinct elements is:
    $ns1[contains($vIds, concat('|', position(), '|'))]

Here's the code for set:distinct():

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  xmlns:caller="urn:my-object">
  <xsl:param name="ns1" select="/.."/>
  
  <xsl:key name="kVal" match="e" use="."/>

  <xsl:template match="distinct">
   <xsl:variable name="vrtfDoc">
     <xsl:for-each select="$ns1">
        <e p="{position()}"><xsl:value-of select="."/></e>
     </xsl:for-each>
   </xsl:variable>
   
   <xsl:variable name="vDoc" select="msxsl:node-set($vrtfDoc)"/>
   
   <xsl:variable name="vIds">
     <xsl:for-each 
      select="$vDoc/e[generate-id()
               =
                generate-id(key('kVal', .)[1])
                ]">
                                  
       <xsl:if test="position() = 1">|</xsl:if>
       <xsl:value-of select="concat(@p,'|')"/>
     </xsl:for-each>
   </xsl:variable>
   
   <xsl:value-of 
   select="caller:storeXPathResult
            (
              $ns1[contains($vIds,
                      concat('|', position(), '|')
                            )
                   ]
             )"/>
  </xsl:template>

</xsl:stylesheet>

In a similar way we derive new algorithms for set:intersection() and set:difference() by constructing a pipe-delimited string of concatenated node ids.

Then the result (e.g. for set:difference()) is

$ns1[not(contains($vIds, concat('|', generate-id(), '|')))]

Is XSLT Inefficient?

Today the prevailing opinion is that XSLT applications are quite inefficient as compared to applications written in a conventional programming language. Therefore, it would be interesting to see how this XSLT-based implementation compares to other implementations of EXSLT.

It was of particular interest how different EXSLT implementations behave with very large inputs. One of the tests was performed with all functions of the Sets module over a node set of 10000 nodes. The second test ran these functions on a much shorter node set consisting of 2000 nodes. I tested the Sets implementations of the following eight XSLT processors: Xalan C 1.5 ( C ), Saxon 6.5.2 (Java), Msxml4 (XSLT), JD (Java), .Net (C#), 4XSLT (Python), xsltProc/xsltLib ( C ), Xalan J 2.4.1 (Java).

I don't claim that the results are 100% correct or that XSLT processor X is much faster than an XSLT processor Y. I encourage readers are encouraged to perform their own tests. However, I believe that these results would be useful both to users and developers of XSLT processors -- in choosing a suitable XSLT processor that supports EXSLT or in further improving the EXSLT implementation of such a processor.

As can be seen from the results, exMSXML4 ran considerably faster than 5 of the 8 tested implementations, which were written in Java, Python, C#, and C.

exMSXML4 was slower only compared to Xalan C 1.5 and Saxon 6.5.2 (although it was faster than Saxon on set:leading() and set:trailing() and faster than both Xalan C 1.5 and Saxon 6.5.2 on set:has-same-node()).

The actual tests are provided in the Appendix. They were run on a 1.7GHz Pentium 4 with 256MB of RAM. All results are in milliseconds.

The results are summarized in the two tables below:

Time Comparison between different implementations of EXSLT

1. With a node-set of 10000 nodes
Implementation Intersection difference leading trailing distinct Has-same-node Total Average
Xalan C 1.5 80 80 10 15 50 58 293 49
Saxon 6.5.2 170 170 140 140 160 150 930 155
exMsxml 4 1391 1524 29 23 4468 46 7481 1247
JD 3095 3635 170 140 12117 150 19307 3218
4XSLT 6840 7400 891 871 751 6930 23323 3887
.Net xslTransform 5802 5832 22 22 10696 6558 28932 4822
xsltProc 37674 40808 20549 20429 20830 44754 185044 30840
Xalan J 2.4.1 2013 2003 181220 43653 1112 1122 231123 38520


2. With a node-set of 2000 nodes
Implementation Intersection difference leading trailing distinct Has-same-node Total Average
Xalan C 1.5 15 13 1 1 10 10 50 8
Saxon 6.5.2 31 30 30 20 30 20 161 27
exMsxml 4 72 75 11 9 238 14 419 70
.Net xslTransform 115 115 12 11 368 140 761 127
JD 140 160 80 80 370 100 930 155
4XSLT 460 460 150 150 140 471 1831 305
xsltProc 430 430 210 210 250 570 2100 350
Xalan J 2.4.1 921 901 6709 2733 832 871 12967 2161

Conclusion

This article described a general way to obtain an instance of a non-creatable object by using a callback. I have shown how to build an efficient XSLT application by improving an algorithm step by step. I demonstrated several optimization techniques: binary search, DVC, using keys, and elimination of recursion. The test results show that this XSLT-based implementation of the Sets module of EXSLT is faster than five other implementations written in Python, Java, C, and C#.

Appendix -- the tests code

  1. The two source xml documents, on which the tests were performed look like this:

    nums10000.xml:
    <Nums>
      <Num>1</Num>
      <Num>2</Num>
      <Num>3</Num>
    .  .  .  .  .  .  .  .  .
      <Num>9998</Num>
      <Num>9999</Num>
      <Num>10000</Num>
    </Nums>
    

    Nums2000.xml:
    <Nums>
      <Num>1</Num>
      <Num>2</Num>
      <Num>3</Num>
    .  .  .  .  .  .  .  .  .
      <Num>1998</Num>
      <Num>1999</Num>
      <Num>2000</Num>
    </Nums>
    


  2. Testing set:intersection() -- testExsltIntersect.xsl:
    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:set="http://exslt.org/sets">
     
     <xsl:output omit-xml-declaration="yes"/>
     
      <xsl:template match="/">
        <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/>
        <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/>
        
        <xsl:variable name="vIntersect" select="set:intersection($v2, $v3)"/>
        
        <xsl:copy-of select="$vIntersect[1]"/>
    
      </xsl:template>
    </xsl:stylesheet>
    


  3. Testing set:difference() -- testExsltDifference.xsl:
    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:set="http://exslt.org/sets">
     
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
     
      <xsl:template match="/">
        <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/>
        <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/>
        
        <xsl:variable name="vDifference" select="set:difference($v2, $v3)"/>
        <xsl:copy-of select="$vDifference[1]"/>
    
      </xsl:template>
    </xsl:stylesheet>
    


  4. Testing set:leading() -- testLeading.xsl:
    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:set="http://exslt.org/sets">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
     
      <xsl:template match="/">
        <xsl:copy-of select="set:leading(/*/*, /*/*[last()])[1]"/>
      </xsl:template>
    </xsl:stylesheet>
    


  5. Testing set:trailing() -- testTrailing.xsl:

    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:set="http://exslt.org/sets">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
     
      <xsl:template match="/">
        <xsl:copy-of select="set:trailing(/*/*, /*/*[1])[1]"/>
      </xsl:template>
    </xsl:stylesheet>
    


  6. Testing set:distinct() -- testEXSLTDistinct.xsl:
    <xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:set="http://exslt.org/sets">
      <xsl:output omit-xml-declaration="yes" indent="yes"/>
      
      <xsl:template match="/">
        <xsl:copy-of select="set:distinct(/*/*)[1]"/>
      </xsl:template>
    </xsl:stylesheet>
    


  7. Testing set:has-same-node() -- testExsltSameNodes.xsl:
    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:set="http://exslt.org/sets">
     
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
     
      <xsl:template match="/">
        <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/>
        <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/>
        <xsl:variable name="v4" select="/*/*[7]"/>
        
        <xsl:variable name="vbSame23" select="set:has-same-node($v2, $v3)"/>
        
        set:has-same-node(/*/*[. mod 2 = 0], /*/*[. mod 3 = 0]) = <xsl:text/>
        
        <xsl:copy-of select="$vbSame23"/>
        
        <xsl:text>&#160;#160;</xsl:text>
        <xsl:text>&#160;#160;</xsl:text>
        
        <xsl:variable name="vbSame24" select="set:has-same-node($v2, $v4)"/>
        
        set:has-same-node(/*/*[. mod 2 = 0], /*/*[7]) = <xsl:text/>
        
        <xsl:copy-of select="$vbSame24"/>
      </xsl:template>
    </xsl:stylesheet>