Duplicate and Empty Elements
October 2, 2002
When we copy a source tree to a result tree, the task of deleting duplicate elements from the copy sounds simple: don't copy an element from the source tree to the result tree if an identical one has already been copied. But what do we mean by "identical"? According to the XPath specification, two node sets are considered equal if their string values are equal. The string values are the concatenation of any text node descendants of the elements. Because text nodes store the character data contents of elements -- that is, the part between start- and end-tags -- two elements with different attributes or with different values in the same attribute are still considered equal, because attributes aren't taken into account in the XPath version of element equality. So, if you only want to compare element content when determining which elements are duplicates, an equals sign will do, but if you want to consider attribute values, you have to explicitly say so.
Let's look at an example. While no two line elements in the following document are exactly the same -- each has a different lid ("line ID") attribute value along with possible other differences -- we'll examine several ways to avoid copying certain elements to the result tree because they have content or an attribute value in common with others. (All sample files are in this zip file.)
<sample> <line lid="u1">hello</line> <line color="red" lid="u2">hello</line> <line color="blue" lid="u3">hello</line> <line lid="u4">hello there</line> <line color="blue" lid="u5">hello there</line> <line color="blue" lid="u6">hello</line> </sample>
The first stylesheet has a template rule for line elements that only copies one to the result tree if it's not equal to any of the line elements in the preceding axis -- that is, not equal to any of the line elements that finished before the one being processed began. (See the column titled Axis Powers: Part Two for more on this axis.) The other template copies all the other nodes verbatim.
<!-- xq495.xsl: converts xq494.xml into xq496.xml --> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes"/> <xsl:template match="line"> <xsl:if test="not(. = preceding::line)"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:if> </xsl:template> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
As I mentioned above, the XPath spec considers elements equal if the string values that represent their contents are equal, and the contents are the parts between the tags, so attribute values aren't considered in this kind of equality test. So, this stylesheet adds only one line element with the contents "hello" to the result tree, and no more, regardless of their attribute values. Likewise for the "hello there" elements:
<sample> <line lid="u1">hello</line> <line lid="u4">hello there</line> </sample>
The first variation on the line template rule above has a different condition in the test attribute of its xsl:if instruction: it won't add a line element to the result tree if any preceding line element had the same value in its color attribute that the context node line element has.
<!-- xq497.xsl: converts xq494.xml into xq498.xml --> <xsl:template match="line"> <xsl:if test="not(@color = preceding::line/@color)"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:if> </xsl:template>
Once the XSLT processor adds one line element with a color attribute value of "blue", it doesn't add any more—even when the line element has different content, such as the "u5" one with "hello there" as its content.
<sample> <line lid="u1">hello</line> <line lid="u2" color="red">hello</line> <line lid="u3" color="blue">hello</line> <line lid="u4">hello there</line> </sample>
The next version of the same template rule won't copy any line element that has the same content and the same color attribute value as any earlier line element. It's more complicated than the earlier examples. First, this template rule sets two local variables with the contents and color attribute value of the context node to use in the comparison. Then, the comparison in the predicate (that is, in the square brackets) has a boolean and to connect the two conditions. It's checking for preceding nodes that meet both conditions, and the not() function wrapped around the whole XPath expression tells the XSLT processor to only process list elements that don't meet both of these conditions.
<!-- xq499.xsl: converts xq494.xml into xq500.xml --> <xsl:template match="line"> <xsl:variable name="contents" select="."/> <xsl:variable name="colorVal" select="@color"/> <xsl:if test = "not(preceding::line[(. = $contents) and (@color = $colorVal)])"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:if> </xsl:template>
The result of running the source document with this version of the template has all the line elements except the "u6" one, which is the only one with contents and a color attribute that match the contents and color attribute of an earlier line element ("u3"):
<sample> <line lid="u1">hello</line> <line lid="u2" color="red">hello</line> <line lid="u3" color="blue">hello</line> <line lid="u4">hello there</line> <line lid="u5" color="blue">hello there</line> </sample>
To compare the lid attribute value along with the contents and color attribute value of the line elements would just mean declaring another local variable and adding another condition inside the square brackets. Remember, though, that the more complicated a comparison condition you have, the more work the XSLT processor must do, so the slower the stylesheet will run.
If you know that all potential duplicate elements are siblings, as they are in this chapter's examples, you can speed things up by using the preceding-sibling axis instead of the preceding axis so that the XSLT processor won't try to check as many nodes for equality. (See Axis Powers: Part One for more on this axis.) This chapter's examples use the preceding axis because it does a more complete check that would work in a wider variety of cases.
Creating and Checking for Empty Elements
To an XML parser, the elements <sample/> and <sample></sample> are the same—they're both empty sample elements. When your stylesheet adds an element to the result tree, if it creates no content for that element, it's created an empty element, whether that element type was declared as being empty in a DTD or not. To demonstrate, let's look at a template rule that copies the following element and adds some empty sample elements to the copy.
<test>Dagon his Name, Sea Monster</test>
The stylesheet adds seven empty sample elements after the result tree's test start-tag. The fourth, fifth, and sixth resemble the first three except that each of these empty elements includes an attribute specification.
<!-- xq135.xsl: converts xq134.xml into xq476.xml --> <xsl:template match="test"> <test> 1. <sample/> 2. <sample></sample> 3. <xsl:element name="sample"/> 4. <sample color="green"/> 5. <sample color="green"></sample> 6. <xsl:element name="sample"> <xsl:attribute name="color">green</xsl:attribute> </xsl:element> 7. <sample> </sample> <xsl:apply-templates/> </test> </xsl:template>
Whether elements are shown in the stylesheet as a single-tag empty element, a start- and end-tag pair with nothing between them, or as an xsl:element instruction that has no content specified, they all show up in the result tree as empty elements.
<test> 1. <sample/> 2. <sample/> 3. <sample/> 4. <sample color="green"/> 5. <sample color="green"/> 6. <sample color="green"/> 7. <sample/>Dagon his Name, Sea Monster</test>
The seventh sample is a special case. If space characters (tabs, carriage returns, or spacebar spaces) and no others occur between two tags, an XML parser does not treat those characters as character data, which is why that seventh sample element is considered empty. This does look a little confusing in the stylesheet, so it's a good idea to avoid it when possible.
How about checking for empty elements so that your stylesheet can perform certain actions if they're empty and others if they're not? The following test document has seven sample elements, and the first four are empty.
<test> <sample eid="A"/> <sample eid="B"></sample> <sample eid="C"> </sample> <sample eid="D"> </sample> <sample eid="E">some text</sample> <sample eid="F"><color>blue</color></sample> <sample eid="G"><color>red</color>more text</sample> </test>
How can a template rule know which ones are empty? When the node set "." (an abbreviation of "self::node()") is passed to the normalize-space() function, the function converts this node set to a string (because it only acts on strings) and then does its real job: it converts any multi-space sequences to a single space and removes all leading and trailing spaces. (See Splitting and Removing Strings for more on this function.) If there's anything left, this means that the sample element had something in it, and that makes the boolean value of the xsl:when element's test attribute true.
The following template checks this boolean value before adding a message about the sample element to the result tree:
<!-- xq137.xsl: converts xq136.xml into xq138.txt --> <xsl:template match="sample"> <xsl:choose> <xsl:when test="normalize-space(.)"> Sample element <xsl:value-of select="@eid"/> isn't empty. </xsl:when> <xsl:otherwise> Sample element <xsl:value-of select="@eid"/> is empty. </xsl:otherwise> </xsl:choose> </xsl:template>
When the test value is true, the stylesheet adds a message to the result tree about that sample element not being empty. When run with the XML document above, it does this for sample elements E, F, and G.
Sample element A is empty. Sample element B is empty. Sample element C is empty. Sample element D is empty. Sample element E isn't empty. Sample element F isn't empty. Sample element G isn't empty.
Also in Transforming XML
When there's nothing left after normalize-space() deletes any unnecessary space, the xsl:choose instruction's xsl:otherwise element adds a message to the result tree about that sample element being empty. It does this for sample elements A, B, C, and D.
Some stylesheets use simpler syntax to check for empty elements, but it's safer to use the normalize-space() function to make sure that the odd cases are caught as well as more typical empty elements.
(In addition to the past columns linked to this article, you can find more on the functions, axes, and XSLT instructions mentioned here in my book XSLT Quickly, from which these columns are excerpted.)