EXSLT for MSXML
The EXSLT project specifies a set of standard extension functions for XSLT that, when implemented by all vendors of XSLT processors, will allow writing portable XSLT applications. Until now most of the major XSLT processors had already some form of support for EXSLT, with the notable exception of MSXML.
This article describes a third party implementation of EXSLT for MSXML4 by Dimitre Novatchev. The author had no access to internal product interfaces and had to overcome some serious difficulties, which until now had prevented the development of any such third party implementation of EXSLT for MSXML4.
There are different possible ways to implement extensions to an XSLT processor:
The first and the second option require that either the source code of the XSLT processor (and the right to modify and extend it) is available or that documentation is provided explaining how to implement extension elements. None of these is true in the case of MSXML.
Following the third option will require the MSXML programmers to include some inline scripting code written in a language like Javascript. This is not completely convenient and differs from the way extension functions are implemented and used in other XSLT processors.
The decision to choose the last option reflects its advantages:
There are eight different modules of functions specified by EXSLT:
Dynamic and Functions require access to the internals (code/data structures) of the XSLT processor -- something not achievable in the case of MSXML. Math is too trivial and has already an implementation in XSLT 1.0 and XSLT 2.0 (See The FXSL Functional Programming Library for XSLT1 and The FXSL Functional Programming Library for XSLT2 Functions, Dates and Times, Regular expressions and part of Strings will be covered by the standard XSLT 2.0 (see XQuery 1.0 and XPath 2.0 Functions and Operators.) There are also pure XSLT 1.0 libraries covering dates and time (A date_time XSLT 1.0 template library available as part of XSelerator)
From the standpoint of immediate usability in XSLT 1.0, the most useful EXSLT function is common:node-set(). Other necessary and useful functions, which cannot be implemented in XSLT 1.0, are those from the Sets module.
So I decided to implement the following EXSLT functions:
Have you ever wondered why for more than two years there has been no attempt at a third-party implementation of EXSLT for MSXML? Try to produce such and you'll know that there is a big obstacle, which no one until now had been able to remove.
In the object model of MSXML there is one method for obtaining a
node set as result of evaluating an XPath expression. This method is
selectNodes() member of the IXMLDOMNode object
and defined as follows:
HRESULT selectNodes( BSTR expression, IXMLDOMNodeList ** resultList);
As can be seen, in the MSXML object model a node set is represented by
an IXMLDOMNodeList object. A node is represented by an
IXMLDOMNode object.
selectNodes() can be issued only against a "current node" -- some IXMLDOMNode object.
The problem is that there is no documented way to create an
IXMLDOMNodeList, except as returned by
selectNodes(). This means that one cannot perform even such
simple tasks as getting the union of two IXMLDOMNode
nodes.
How then can we get a subset of a node set using only the MSXML object model? Impossible. But this is exactly what is needed in order to implement any of the six functions in the EXSLT Sets module.
I must confess that it was exactly the challenge of the impossible task
that attracted me. Someone, who didn't know better, told me that
implementing the Sets module was impossible without asking Microsoft to
provide a more powerful interface, containing methods that can create an
IXMLDOMNodeList from any collection of
IXMLDOMNode objects. It was also implied that an XSLT
specialist was inferior in attacking, solving and even understanding this
problem.
What happened next is described below.
As explained above, using the MSXML4 object model it is impossible to
create an IXMLDOMNodeList other than one returned by the
selectNodes() method and this has strong limitations making
impossible the implementation of any EXSLT Sets functions.
The solution is to try to create an IXMLDOMNodeList
outside of the MSXML object model. In XSLT there are no such
limitations. It is straightforward to obtain the result of evaluating any
XPath expression. So why not perform an XSLT transformation, which will
evaluate any Xpath expression we need and produce its result to us?
A nice idea, but there is a major flaw in it -- an XSLT transformation always produces copies of the original nodes, not the nodes themselves. This is probably the moment when anybody stopped in desperation.
A transformation can evaluate any XPath expression internally and have access to the resulting node set, but it cannot "pass it back", it can only produce copies of the original nodes.
Can a transformation pass the result node set to any piece of code at all? Yes, it can pass it to another template it calls or instantiates or to an extension function.
This seems absolutely unusable in our case. We called the transformation so how it can call us? Even if this were possible, we still must make a return and will lose the valuable node set that the transformation passed to us.
The answer is simple: we store it in a property of our extension object. When the transformation returns to our code that started it, the node set will still be the value of this property.
This manipulation is reflected in the following picture:
|
|
Having solved the big problem, it's time for the complete
implementation. The XSLT solutions in the "Aux. Transform" box can be
really simple and compact. Thus for set:intersection() we can
have:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:param name="ns2" select="/.."/> <xsl:variable name="vCnt" select="count($ns2)"/> <xsl:template match="intersect"> <xsl:value-of select="caller:storeXPathResult($ns1[count(. | $ns2) = $vCnt])"/> </xsl:template> </xsl:stylesheet> |
For set:distinct() we can have the following XSLT implementation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:template match="makeDistinct" name="makeDistinct"> <xsl:param name="pDistinct" select="/.."/> <xsl:param name="pNodes" select="$ns1"/> <xsl:choose> <xsl:when test="$pNodes"> <xsl:variable name="pnewDistinct" select="$pDistinct | $pNodes[1]"/> <xsl:call-template name="makeDistinct"> <xsl:with-param name="pDistinct" select="$pnewDistinct"/> <xsl:with-param name="pNodes" select="$pNodes[position() > 1][not(. = $pnewDistinct)]"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="caller:storeXPathResult($pDistinct)"/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> |
Finally, for set:leading() we could have:
<xsl:stylesheet version="1.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:param name="ns2" select="/.."/> <xsl:template match="leading" name="leading"> <xsl:param name="pNodes" select="$ns1"/> <xsl:param name="pANode" select="$ns2[1]"/> <xsl:param name="pLeading" select="/.."/> <xsl:choose> <xsl:when test="not($pNodes) or not($pANode)"> <xsl:value-of select="caller:storeXPathResult($pNodes)"/> </xsl:when> <xsl:when test="count($pANode | $pNodes[1]) = 1 or count($pANode | $pNodes) != count($pNodes)"> <xsl:value-of select="caller:storeXPathResult($pLeading)"/> </xsl:when> <xsl:otherwise> <xsl:variable name="pnewLeading" select="$pLeading | $pNodes[1]"/> <xsl:call-template name="leading"> <xsl:with-param name="pLeading" select="$pnewLeading"/> <xsl:with-param name="pNodes" select="$pNodes[position() > 1]"/> <xsl:with-param name="pANode" select="$pANode"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> |
The XSLT implementation of the other functions of the Sets
module is similar -- set:difference() is coded in a similar
way to set:intersection(), set:trailing() is
similar to set:leading(), and
set:has-same-node() returns true if
set:intersection() is non-empty.
|
We could stop here, but the algorithms are not fast and robust, especially in some worst case scenarios. In particular, deep recursion typically can eat up significant amounts of memory, slow down the execution from linear to exponential, and eventually crash an XSLT processor. (See Use recursion effectively in XSL and Two-stage recursive algorithms in XSLT).
Therefore, in looking to improve our initial algorithms we try to limit the depth of recursion or to eliminate recursion completely.
One powerful way of limiting the depth of recursion is the Divide and Conquer (DVC) method (see Algorithms in C++, Part 1-4, by Robert Sedgewick, Addison-Wesley (ISBN 0-201-35088-2) and DVC Algorithms in XSLT). The idea is to split the input into two or more parts, apply the algorithm on each part and then combine the results. A DVC-style algorithm has a maximum recursion depth of Log2(N), which allows very long inputs to be processed without any stack overflow problems -- e.g. the maximum recursion depth for processing a list of 1000000 (1M) elements is only 19.
A DVC-style set:distinct() implementation looks like this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:template match="distinct"> <xsl:param name="pNodes" select="$ns1"/> <xsl:if test="$pNodes"> <xsl:choose> <xsl:when test="not($pNodes[2])"> <xsl:value-of select="caller:storeXPathResult($pNodes)"/> </xsl:when> <xsl:otherwise> <xsl:variable name="vHalfLng" select="floor(count($pNodes) div 2)"/> <xsl:variable name="vHalf1" select="$pNodes[position() <= $vHalfLng]"/> <xsl:variable name="vHalf2" select="$pNodes[position() > $vHalfLng]"/> <xsl:variable name="vDist1" select="set:distinct($vHalf1)"/> <xsl:variable name="vDist2" select="set:distinct($vHalf2)"/> <xsl:value-of select="caller:storeXPathResult($vDist1 | $vDist2 [not(. = $vDist1)])"/> </xsl:otherwise> </xsl:choose> </xsl:if> </xsl:template> </xsl:stylesheet> |
With longer input node sets this transformation is much faster than the previous and it scales well in processing huge inputs.
The initial algorithm for set:leading() incrementally
unifies all nodes from the start of the node set until the designated
(second argument) node is reached. This has all the problems of deep
recursion.
A more direct and efficient solution exists, one based on the classic binary search algorithm. Here we try to find a subset of the original node set, such that the designated (second argument) node is the last node in this subset. When such a subset is found, we simply return:
$accumulatedLeading | thisSubset[position() != last()] |
The $accumulatedLeading is the union of all subsets
preceding the current subset. It is much more efficient to accumulate
whole subsets than to accumulate node by node. The binary search algorithm
is similar to a DVC-style algorithm but is more efficient because at any
time only one of the two parts is recursively processed.
To find such a subset we divide the original node set into two halves
and take the one which contains the designated node. If the designated
node is not the last, we do the same recursively to the chosen (half)
subset. Here's a somewhat simplified version of the Binary Search
set:leading() code:
<xsl:stylesheet version="1.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:param name="ns2" select="/.."/> <xsl:template name="binLeading" match="leading"> <xsl:param name="pNodes" select="$ns1"/> <xsl:param name="pANode" select="$ns2[1]"/> <xsl:param name="pLeading" select="/.."/> <xsl:choose> <xsl:when test="count($pANode | $pNodes[last()]) = 1"> <xsl:value-of select="caller:storeXPathResult($pLeading | $pNodes[position() != last()])"/> </xsl:when> <xsl:otherwise> <xsl:variable name="vcntHalf" select="floor(count($pNodes) div 2)"/> <xsl:variable name="vHalf1" select="$pNodes[position() <= $vcntHalf]"/> <xsl:choose> <xsl:when test="count($pANode | $vHalf1) = count($vHalf1)"> <xsl:call-template name="binLeading"> <xsl:with-param name="pNodes" select="$vHalf1"/> <xsl:with-param name="pANode" select="$pANode"/> <xsl:with-param name="pLeading" select="$pLeading"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:call-template name="binLeading"> <xsl:with-param name="pNodes" select="$pNodes[position() > $vcntHalf]"/> <xsl:with-param name="pANode" select="$pANode"/> <xsl:with-param name="pLeading" select="$pLeading | $vHalf1"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> |
Not surprisingly, we can even further improve efficiency if we manage to eliminate recursion altogether.
Let's take as example set:distinct():
$ns1 into
an element of a temporary tree. This element has the same string value as
the node being copied. It also has an attribute "p", the
value of which is the position of the original node in the argument
node-set.$vids of the original
positions (the value of the "p" attribute) of all such first
elements.$ns1[contains($vIds, concat('|', position(), '|'))]Here's the code for set:distinct():
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:caller="urn:my-object"> <xsl:param name="ns1" select="/.."/> <xsl:key name="kVal" match="e" use="."/> <xsl:template match="distinct"> <xsl:variable name="vrtfDoc"> <xsl:for-each select="$ns1"> <e p="{position()}"><xsl:value-of select="."/></e> </xsl:for-each> </xsl:variable> <xsl:variable name="vDoc" select="msxsl:node-set($vrtfDoc)"/> <xsl:variable name="vIds"> <xsl:for-each select="$vDoc/e[generate-id() = generate-id(key('kVal', .)[1]) ]"> <xsl:if test="position() = 1">|</xsl:if> <xsl:value-of select="concat(@p,'|')"/> </xsl:for-each> </xsl:variable> <xsl:value-of select="caller:storeXPathResult ( $ns1[contains($vIds, concat('|', position(), '|') ) ] )"/> </xsl:template> </xsl:stylesheet> |
In a similar way we derive new algorithms for
set:intersection() and set:difference() by
constructing a pipe-delimited string of concatenated node ids.
Then the result (e.g. for set:difference()) is
$ns1[not(contains($vIds, concat('|', generate-id(), '|')))]
|
Today the prevailing opinion is that XSLT applications are quite inefficient as compared to applications written in a conventional programming language. Therefore, it would be interesting to see how this XSLT-based implementation compares to other implementations of EXSLT.
It was of particular interest how different EXSLT implementations behave with very large inputs. One of the tests was performed with all functions of the Sets module over a node set of 10000 nodes. The second test ran these functions on a much shorter node set consisting of 2000 nodes. I tested the Sets implementations of the following eight XSLT processors: Xalan C 1.5 ( C ), Saxon 6.5.2 (Java), Msxml4 (XSLT), JD (Java), .Net (C#), 4XSLT (Python), xsltProc/xsltLib ( C ), Xalan J 2.4.1 (Java).
I don't claim that the results are 100% correct or that XSLT processor X is much faster than an XSLT processor Y. I encourage readers are encouraged to perform their own tests. However, I believe that these results would be useful both to users and developers of XSLT processors -- in choosing a suitable XSLT processor that supports EXSLT or in further improving the EXSLT implementation of such a processor.
As can be seen from the results, exMSXML4 ran considerably faster than 5 of the 8 tested implementations, which were written in Java, Python, C#, and C.
exMSXML4 was slower only compared to Xalan C 1.5 and Saxon 6.5.2
(although it was faster than Saxon on set:leading() and
set:trailing() and faster than both Xalan C 1.5 and Saxon
6.5.2 on set:has-same-node()).
The actual tests are provided in the Appendix. They were run on a 1.7GHz Pentium 4 with 256MB of RAM. All results are in milliseconds.
The results are summarized in the two tables below:
Time Comparison between different implementations of EXSLT
| 1. With a node-set of 10000 nodes | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Implementation | Intersection | difference | leading | trailing | distinct | Has-same-node | Total | Average | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Xalan C 1.5 | 80 | 80 | 10 | 15 | 50 | 58 | 293 | 49 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Saxon 6.5.2 | 170 | 170 | 140 | 140 | 160 | 150 | 930 | 155 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| exMsxml 4 | 1391 | 1524 | 29 | 23 | 4468 | 46 | 7481 | 1247 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| JD | 3095 | 3635 | 170 | 140 | 12117 | 150 | 19307 | 3218 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4XSLT | 6840 | 7400 | 891 | 871 | 751 | 6930 | 23323 | 3887 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| .Net xslTransform | 5802 | 5832 | 22 | 22 | 10696 | 6558 | 28932 | 4822 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| xsltProc | 37674 | 40808 | 20549 | 20429 | 20830 | 44754 | 185044 | 30840 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Xalan J 2.4.1 | 2013 | 2003 | 181220 | 43653 | 1112 | 1122 | 231123 | 38520 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2. With a node-set of 2000 nodes | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Implementation | Intersection | difference | leading | trailing | distinct | Has-same-node | Total | Average | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Xalan C 1.5 | 15 | 13 | 1 | 1 | 10 | 10 | 50 | 8 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Saxon 6.5.2 | 31 | 30 | 30 | 20 | 30 | 20 | 161 | 27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| exMsxml 4 | 72 | 75 | 11 | 9 | 238 | 14 | 419 | 70 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| .Net xslTransform | 115 | 115 | 12 | 11 | 368 | 140 | 761 | 127 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| JD | 140 | 160 | 80 | 80 | 370 | 100 | 930 | 155 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4XSLT | 460 | 460 | 150 | 150 | 140 | 471 | 1831 | 305 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| xsltProc | 430 | 430 | 210 | 210 | 250 | 570 | 2100 | 350 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Xalan J 2.4.1 | 921 | 901 | 6709 | 2733 | 832 | 871 | 12967 | 2161 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This article described a general way to obtain an instance of a non-creatable object by using a callback. I have shown how to build an efficient XSLT application by improving an algorithm step by step. I demonstrated several optimization techniques: binary search, DVC, using keys, and elimination of recursion. The test results show that this XSLT-based implementation of the Sets module of EXSLT is faster than five other implementations written in Python, Java, C, and C#.
The two source xml documents, on which the tests were performed look like this:
nums10000.xml:
<Nums> <Num>1</Num> <Num>2</Num> <Num>3</Num> . . . . . . . . . <Num>9998</Num> <Num>9999</Num> <Num>10000</Num> </Nums> |
<Nums> <Num>1</Num> <Num>2</Num> <Num>3</Num> . . . . . . . . . <Num>1998</Num> <Num>1999</Num> <Num>2000</Num> </Nums> |
set:intersection() -- testExsltIntersect.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes"/> <xsl:template match="/"> <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/> <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/> <xsl:variable name="vIntersect" select="set:intersection($v2, $v3)"/> <xsl:copy-of select="$vIntersect[1]"/> </xsl:template> </xsl:stylesheet> |
set:difference() -- testExsltDifference.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/> <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/> <xsl:variable name="vDifference" select="set:difference($v2, $v3)"/> <xsl:copy-of select="$vDifference[1]"/> </xsl:template> </xsl:stylesheet> |
set:leading() -- testLeading.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select="set:leading(/*/*, /*/*[last()])[1]"/> </xsl:template> </xsl:stylesheet> |
set:trailing() -- testTrailing.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select="set:trailing(/*/*, /*/*[1])[1]"/> </xsl:template> </xsl:stylesheet> |
set:distinct() -- testEXSLTDistinct.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select="set:distinct(/*/*)[1]"/> </xsl:template> </xsl:stylesheet> |
set:has-same-node() -- testExsltSameNodes.xsl:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:set="http://exslt.org/sets"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:variable name="v2" select="/*/*[. mod 2 = 0]"/> <xsl:variable name="v3" select="/*/*[. mod 3 = 0]"/> <xsl:variable name="v4" select="/*/*[7]"/> <xsl:variable name="vbSame23" select="set:has-same-node($v2, $v3)"/> set:has-same-node(/*/*[. mod 2 = 0], /*/*[. mod 3 = 0]) = <xsl:text/> <xsl:copy-of select="$vbSame23"/> <xsl:text> </xsl:text> <xsl:text> </xsl:text> <xsl:variable name="vbSame24" select="set:has-same-node($v2, $v4)"/> set:has-same-node(/*/*[. mod 2 = 0], /*/*[7]) = <xsl:text/> <xsl:copy-of select="$vbSame24"/> </xsl:template> </xsl:stylesheet> |
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.