XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

EXSLT for MSXML
by Dimitre Novatchev | Pages: 1, 2, 3

Optimization

We could stop here, but the algorithms are not fast and robust, especially in some worst case scenarios. In particular, deep recursion typically can eat up significant amounts of memory, slow down the execution from linear to exponential, and eventually crash an XSLT processor. (See Use recursion effectively in XSL and Two-stage recursive algorithms in XSLT).

Therefore, in looking to improve our initial algorithms we try to limit the depth of recursion or to eliminate recursion completely.

Limiting recursion -- the Divide and Conquer (DVC) approach

One powerful way of limiting the depth of recursion is the Divide and Conquer (DVC) method (see Algorithms in C++, Part 1-4, by Robert Sedgewick, Addison-Wesley (ISBN 0-201-35088-2) and DVC Algorithms in XSLT). The idea is to split the input into two or more parts, apply the algorithm on each part and then combine the results. A DVC-style algorithm has a maximum recursion depth of Log2(N), which allows very long inputs to be processed without any stack overflow problems -- e.g. the maximum recursion depth for processing a list of 1000000 (1M) elements is only 19.

A DVC-style set:distinct() implementation looks like this:

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:set="http://exslt.org/sets"
  xmlns:caller="urn:my-object">

  <xsl:param name="ns1" select="/.."/>

  <xsl:template match="distinct">
    <xsl:param name="pNodes" select="$ns1"/>
    
    <xsl:if test="$pNodes">
      <xsl:choose>
        <xsl:when test="not($pNodes[2])">
          <xsl:value-of 
           select="caller:storeXPathResult($pNodes)"/>   
        </xsl:when>
        <xsl:otherwise>
          <xsl:variable name="vHalfLng"
   select="floor(count($pNodes) div 2)"/>
          <xsl:variable name="vHalf1" 
     select="$pNodes[position() &lt;= $vHalfLng]"/>
          
          <xsl:variable name="vHalf2" 
     select="$pNodes[position() > $vHalfLng]"/>
           
           <xsl:variable name="vDist1"
    select="set:distinct($vHalf1)"/>
           <xsl:variable name="vDist2"
    select="set:distinct($vHalf2)"/>
           
           <xsl:value-of 
           select="caller:storeXPathResult($vDist1
                                   |
                                    $vDist2
                                     [not(. = $vDist1)])"/>   
         </xsl:otherwise>
      </xsl:choose>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

With longer input node sets this transformation is much faster than the previous and it scales well in processing huge inputs.

Binary search

The initial algorithm for set:leading() incrementally unifies all nodes from the start of the node set until the designated (second argument) node is reached. This has all the problems of deep recursion.

A more direct and efficient solution exists, one based on the classic binary search algorithm. Here we try to find a subset of the original node set, such that the designated (second argument) node is the last node in this subset. When such a subset is found, we simply return:

$accumulatedLeading | thisSubset[position() != last()]

The $accumulatedLeading is the union of all subsets preceding the current subset. It is much more efficient to accumulate whole subsets than to accumulate node by node. The binary search algorithm is similar to a DVC-style algorithm but is more efficient because at any time only one of the two parts is recursively processed.

To find such a subset we divide the original node set into two halves and take the one which contains the designated node. If the designated node is not the last, we do the same recursively to the chosen (half) subset. Here's a somewhat simplified version of the Binary Search set:leading() code:

<xsl:stylesheet version="1.0" 
 xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" 
 xmlns:caller="urn:my-object">
 <xsl:param name="ns1" select="/.."/>
 <xsl:param name="ns2" select="/.."/>

  <xsl:template name="binLeading" match="leading">
    <xsl:param name="pNodes" select="$ns1"/>
    <xsl:param name="pANode" select="$ns2[1]"/>
    <xsl:param name="pLeading" select="/.."/>

    <xsl:choose>
      <xsl:when test="count($pANode | $pNodes[last()]) = 1">
        <xsl:value-of 
      select="caller:storeXPathResult($pLeading
                        | $pNodes[position() != last()])"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:variable name="vcntHalf"
 select="floor(count($pNodes) div 2)"/>
        
        <xsl:variable name="vHalf1" 
     select="$pNodes[position() &lt;= $vcntHalf]"/>
        
        <xsl:choose>
          <xsl:when test="count($pANode | $vHalf1) = count($vHalf1)">
            <xsl:call-template name="binLeading">
              <xsl:with-param name="pNodes" select="$vHalf1"/>
              <xsl:with-param name="pANode" select="$pANode"/>
              <xsl:with-param name="pLeading" select="$pLeading"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <xsl:call-template name="binLeading">
              <xsl:with-param name="pNodes"
   select="$pNodes[position() > $vcntHalf]"/>
               <xsl:with-param name="pANode" select="$pANode"/>
               <xsl:with-param name="pLeading"
    select="$pLeading | $vHalf1"/>
            </xsl:call-template>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>

Eliminating recursion

Not surprisingly, we can even further improve efficiency if we manage to eliminate recursion altogether.

Let's take as example set:distinct():

  1. We copy every node from the argument node set $ns1 into an element of a temporary tree. This element has the same string value as the node being copied. It also has an attribute "p", the value of which is the position of the original node in the argument node-set.
  2. Then using the Muenchian method for grouping (see Grouping using xsl:key and the key() function -- the Muenchian method) we find in this tree every first in document order element that has some unique value.
  3. We create a pipe-delimited string $vids of the original positions (the value of the "p" attribute) of all such first elements.
  4. Finally, the set of distinct elements is:
    $ns1[contains($vIds, concat('|', position(), '|'))]

Here's the code for set:distinct():

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:msxsl="urn:schemas-microsoft-com:xslt"
  xmlns:caller="urn:my-object">
  <xsl:param name="ns1" select="/.."/>
  
  <xsl:key name="kVal" match="e" use="."/>

  <xsl:template match="distinct">
   <xsl:variable name="vrtfDoc">
     <xsl:for-each select="$ns1">
        <e p="{position()}"><xsl:value-of select="."/></e>
     </xsl:for-each>
   </xsl:variable>
   
   <xsl:variable name="vDoc" select="msxsl:node-set($vrtfDoc)"/>
   
   <xsl:variable name="vIds">
     <xsl:for-each 
      select="$vDoc/e[generate-id()
               =
                generate-id(key('kVal', .)[1])
                ]">
                                  
       <xsl:if test="position() = 1">|</xsl:if>
       <xsl:value-of select="concat(@p,'|')"/>
     </xsl:for-each>
   </xsl:variable>
   
   <xsl:value-of 
   select="caller:storeXPathResult
            (
              $ns1[contains($vIds,
                      concat('|', position(), '|')
                            )
                   ]
             )"/>
  </xsl:template>

</xsl:stylesheet>

In a similar way we derive new algorithms for set:intersection() and set:difference() by constructing a pipe-delimited string of concatenated node ids.

Then the result (e.g. for set:difference()) is

$ns1[not(contains($vIds, concat('|', generate-id(), '|')))]

Pages: 1, 2, 3

Next Pagearrow