Menu

Seeking Equality

June 8, 2005

Bob DuCharme

In an earlier column titled Duplicate and Empty Elements, I wrote about how the notion of "duplicate" elements isn't as simple as it sounds, because XPath 1.0 (and hence your XSLT style sheets) considers two elements to be equal if their string values are the same. The string value is essentially all of the PCDATA between the element's start and end tags, even if the element has descendant elements. For example, an XSLT processor considers the w and z elements in the following to be equal, because they both have a string value of "abcdefghi":

<a>
  <w>abc<y color="blue">def</y>ghi</w>
  <z flavor="chocolate">abcdefghi</z>
</a>

The two elements have plenty of differences: they have different names; w has its "def" in a child element that z doesn't have; and they have different attributes in different places. Still, they have the same PCDATA: "abcdefghi". The following style sheet confirms that an XSLT 1.0 processor considers the w and z elements above to be equal:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

    <xsl:template match="a">

      <xsl:choose>
        <xsl:when test="w = z">w = z: true
        </xsl:when>
        <xsl:otherwise>w = z: false
        </xsl:otherwise>
      </xsl:choose>

    </xsl:template>

</xsl:stylesheet>

The = operator is even more interesting when comparing node sets. According to the XPath 1.0 Recommendation's Booleans section, "If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true." Let's compare the d children of the following document's b element with the d children of the document's c element:

<a>
  <b>
    <d>red</d>
    <d>green</d>
    <d>blue</d>
  </b>
  <c>
    <d>yellow</d>
    <d>orange</d>
    <d>green</d>
  </c>
</a>

We can use the same style sheet as above, replacing the xsl:choose statement with this:

    <xsl:choose>
      <xsl:when test="b/d = c/d">b/d = c/d: true
      </xsl:when>
      <xsl:otherwise>b/d = c/d: false
      </xsl:otherwise>
    </xsl:choose>

The result shows that it considers the two node sets to be "equal"—that is, a node in the first node set and a node in the second node set have the same string values, even though the other d elements are completely different:

b/d = c/d: true

XSLT 2.0 and Element Equality

XSLT 2.0 can take advantage of several new options that XPath 2.0 provides for comparing elements. This includes the value comparison operators eq, ne, lt, le, gt, and ge, which are "used for comparing single values." They work for both element and attribute values. Replacing the = operator in the previous example with eq (and setting the xsl:stylesheet element's version attribute to "2.0") causes an error (from Saxon: "A sequence of more than one item is not allowed as the first operand of 'eq'"), because the XSLT processor is trying to compare a sequence with another sequence using an operator that's designed to compare single values.

Replacing the = in our first example with eq and changing the xsl:stylesheet element's version attribute to make it an XSLT 2.0 style sheet wouldn't change its result, but it's worth looking more closely at how the 2.0 version gets its result. An XSLT 2.0 processor atomizes the operands before comparing them—that is, it converts them to a sequence of atomic values—which is why the y element tags around the "def" string in the middle of the w element are ignored.

What if, in comparing two elements, you want to check that all the details—the name, the attributes and their values, and the subelements—are really the same? My earlier article on finding duplicates describes ways to be more explicit in checking whether the element name and attributes are equal, but XPath 2.0 makes it much simpler, adding an evocative new adjective to our XML vocabulary: deep-equal.

The term won't be completely new to Java developers and other users of object-oriented languages. In XPath, two deep-equal elements have the same XPath tree representing them. (The XPath "XQuery 1.0 and XPath 2.0 Functions and Operators" working draft has a more technical definition.) When you pass two elements to the deep-equal function, it returns a Boolean true if the two elements are deeply equal.

Let's try out this function. The a element in the following document has some child elements to compare.

<a>

  <b color="red">a test</b>

  <c flavor="mint">a test</c>

  <d>
    <z color="blue" flavor="vanilla">a test</z>
  </d>

  <e>
    <z color="blue" flavor="vanilla">a test</z>
  </e>

  <f>
    <z flavor="vanilla" color="blue">a test</z>
  </f>

  <g>
    <z flavor="vanilla" color="bluish">a test</z>
  </g>

  <h>
    <n><j id="i1">Joe</j><k date="2005-05-22"/><l>a test</l></n>
  </h>

  <i>
    <n><j id="i1">Joe</j><k date="2005-05-22"/><l>a test</l></n>
  </i>
 
  <m>
    <n><j id="i1">Joe</j><k date="2005-05-22"/>
<l>a test</l></n>
  </m>
 
</a>

The following XSLT 2.0 style sheet has a single template rule, which compares various children and grandchildren of the a element when the XSLT processor finds it. The discussion that follows the sample output refers to each comparison using the numbered string that introduces each one:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="2.0">

    <xsl:template match="a">

      <xsl:text>1. Different element names, same PCDATA: </xsl:text>
      <xsl:choose>
        <xsl:when test="b = c">b = c: true
        </xsl:when>
        <xsl:otherwise>b = c: false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>2. Different element names, same PCDATA: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(b,c)">deep-equal(b,c): true
        </xsl:when>
        <xsl:otherwise>deep-equal(b,c): false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>3. Elements identical: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(d/z,e/z)">deep-equal(d/z,e/z): true
        </xsl:when>
        <xsl:otherwise>deep-equal(d/z,e/z): false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>4. Attributes in different orders: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(e/z,f/z)">deep-equal(e/z,f/z): true
        </xsl:when>
        <xsl:otherwise>deep-equal(e/z,f/z): false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>5. One attribute value different: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(f/z,g/z)">deep-equal(f/z,g/z): true
        </xsl:when>
        <xsl:otherwise>deep-equal(f/z,g/z): false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>6. Comparing sequences of nodes: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(h/n,i/n)">deep-equal(h/n,i/n): true
        </xsl:when>
        <xsl:otherwise>deep-equal(h/n,i/n): false
        </xsl:otherwise>
      </xsl:choose>

      <xsl:text>7. Comparing sequences of nodes with an extra carriage 
	    return: </xsl:text>
      <xsl:choose>
        <xsl:when test="deep-equal(i/n,m/n)">deep-equal(i/n,m/n): true
        </xsl:when>
        <xsl:otherwise>deep-equal(i/n,m/n): false
        </xsl:otherwise>
      </xsl:choose>

    </xsl:template>

</xsl:stylesheet>

Before we analyze what the style sheet does, let's look at the result of running it on the document shown above it (with the XML declaration and some white space removed):

1. Different element names, same PCDATA: b = c: true
2. Different element names, same PCDATA: deep-equal(b,c): false
3. Elements identical: deep-equal(d/z,e/z): true
4. Attributes in different orders: deep-equal(e/z,f/z): true
5. One attribute value different: deep-equal(f/z,g/z): false
6. Comparing sequences of nodes: deep-equal(h/n,i/n): true
7. Comparing sequences of nodes with an extra carriage return: 
   deep-equal(i/n,m/n): false

Test No. 1 in the template rule is a simpler recap of the one I showed at the beginning of this column, showing that the XSLT processor considers the two elements to be equal if their PCDATA string values are the same, despite their different element names and attribute lists. Test No. 2, which compares the same elements using the XPath 2.0's pickier deep-equal function, finds that they're not deeply equal because of the different element names and attributes.

The third test in the template rule compares the z child of the d element with the z child of the e element and finds them to be identical. The f element's z child looks like the d and e element's z children, but its attributes are in a different order; the deep-equal function in the template rule's fourth test still finds the e element's z child and the f element's z child to be equal, because conformant XML parsers don't care about attribute order.

A slight difference in a single attribute value (compare the value "bluish" of the g element's z value with the value "blue" in the f element's z value) is enough for the deep-equal function in the fifth test to say that the two z elements are different.

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

The Path of Control

Using Stylesheet Schemas

The deep-equal function considers two elements with more complex structures to be equal if they really are equal. The template rule's sixth test shows that the h element's n child and the i element's n child are deeply equal. A single carriage return between subelements, though, is enough for deep-equal to give a thumbs-down when comparing the i element's n child with the m element's n child, as the seventh test demonstrates.

How Much Equality Do You Need?

By adding the value comparison operators and the deep-equal function to our options for checking element equality, XPath 2.0 gives XSLT developers a wider choice when you need to evaluate the potential similarity of two elements. I mentioned, but didn't demonstrate, that the value comparison operators include their equivalents of the !=, <, <=, >, and >= operators that we've had since XSLT 1.0: ne, lt, le, gt, and ge. Those give you even more options for determining the potential relationship between two nodes in an XML document or temporary tree.