Utility Stylesheets, Part Two

May 5, 2004

Last month we looked at some short utility stylesheets, each dedicated to a specific task that may be necessary with a wide variety of XML documents: stripping empty paragraphs, converting mixed content to element content, and adding ID values to elements. Stylesheets like these can serve as building blocks in the creation of a large, complex workflow composed of pipelined modular processes. This week, we'll look at several more such stylesheets.

Strip the Namespaces from a Document

XML namespaces play an important role in XML applications; they help to keep track of which elements and attributes come from where, but to be honest, they're such a pain sometimes. The following stylesheet copies all source tree nodes to the result tree, and it uses XPath 1.0's local-name() function to make sure that the elements and attributes on the result tree have no namespace prefix. (It must be useful -- when I suggested last month that readers send in their own short utility stylesheets, one sent me his own version of this stylesheet without knowing that I had planned to include one just like it.)

<!-- Copy document, stripping namespaces, i.e. for elements 
     and attributes only copy the local part of their names. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
  </xsl:template>

  <xsl:template match="processing-instruction()|comment()">
    <xsl:copy>
      <xsl:apply-templates select="node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

In XSLT, xsl:copy elements and literal result elements are popular ways to add elements and attributes to result trees, but this stylesheet demonstrates a key advantage of using xsl:element and xsl:attribute elements instead: because they offer more control over the names of those elements and attributes. The name attributes in these elements call the local-name() function to convert the original names to the ones with no namespace prefixes; using other function calls (or combinations of functions) can let you be even more creative in how you name your result elements and attributes.

Converting Attribute Value Qnames to URIs

The use of qualified names (names that include a namespace prefix) in attribute values is generally considered a Bad Idea in XML design. After all, a namespace prefix is only standing in for the full URI of the namespace it represents, and while XML parsers track the prefix/URL relationship for a document's element and attribute names, they don't do this for attribute values. See Kendall Clark's February 2002 XML Deviant column for a fuller discussion, which points out that XSLT 1.0 itself uses qnames in attribute values. For example, if you declare that xmlns:h="http://www.w3.org/1999/xhtml", you can then set your xsl:template element's match attribute to "h:h1" or "h:p" to define a template rule for h or p elements from the http://www.w3.org/1999/xhtml namespace.

When I read in a W3C IRC log that "XSLT 1.0 can't deal well with qnames," however, I took it as a challenge -- it can't deal well with qnames if you don't use the (little-used) namespace:: axis. With a bit of help from David Carlisle, I came up with a stylesheet that converts a namespace prefix in an attribute value to the corresponding URI:

<!-- qname2uri.xsl: convert namespace prefixes in attribute values to
                    their associated URIs.  -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:template match="@*[contains(.,':')]">
    <!-- For any attributes that have a colon in their value... -->

    <xsl:variable name="nsprefix">
      <xsl:value-of select="substring-before(.,':')"/>
    </xsl:variable>

    <xsl:variable name="nsURI">
      <!-- URI that the prefix maps to: namespace node of parent
           whose name() = the namespace prefix. -->
      <xsl:variable name="nsNode" select=
           "parent::*/namespace::*[name() = $nsprefix]"/>
      <xsl:choose>
        <xsl:when test="$nsNode">
          <xsl:value-of select="$nsNode"/>
        </xsl:when>
        <xsl:otherwise>
          <!-- Uncomment the following xsl:text element  
               to flag prefixes that weren't declared. -->
          <!-- <xsl:text>NO-URI-DECLARED-FOR-PREFIX:</xsl:text>-->
          <xsl:value-of select="$nsprefix"/>
          <xsl:text>:</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

    <!-- Add attribute to result tree, substituting URI for prefix. -->
    <xsl:attribute name="{name()}">
      <xsl:value-of select="$nsURI"/>
      <xsl:value-of select="substring-after(.,':')"/>
    </xsl:attribute>

  </xsl:template>

  <!-- Copy anything not covered by that first template rule. -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

The stylesheet has two template rules: the first handles attributes with colons in their values, and the second copies any other source tree node to the result tree unchanged. The first defines two variables to make its logic more modular: the "nsprefix" variable stores the namespace prefix, and the "nsURI" variable stores the URI that corresponds to that namespace prefix. If the stylesheet declares no URI for that prefix, "nsURI" just stores the prefix; uncommenting the xsl:text element with the value of "NO-URI-DECLARED-FOR-PREFIX:" adds that string to flag the lack of a properly declared URI for that prefix. You can easily change that to a proper URI or to any string you want.

To test this stylesheet, I used the following document as a source document:

<a xmlns:sn="http://www.snee.com/ns/whatever#">
<b>this is a test</b>
<b attr1="sn:blah">Second b element.</b>
<b attr1="xx:blah">Third b element.</b> <!-- No declaration for xx. -->
<c xmlns:sn="http://www.example.com/">  <!-- Redeclared prefix. -->
  <d color="red" direction="north">     <!-- No colons in these values. -->
  <x attr2="sn:foo">nested namespace</x>
  </d>
</c>
</a>

The three commented lines attempt to trip up a conversion program that doesn't handle the URI-prefix mapping properly. Although it's not a very extensive test, it shows that the stylesheet works pretty well, creating this result from it:

<?xml version="1.0" encoding="utf-8"?><a xmlns:sn="http://www.snee.com/ns/whatever#">
<b>this is a test</b>
<b attr1="http://www.snee.com/ns/whatever#blah">Second b element.</b>
<b attr1="xx:blah">Third b element.</b> <!-- No declaration for xx. -->
<c xmlns:sn="http://www.example.com/"> <!-- Redeclared prefix. -->
  <d color="red" direction="north">    <!-- No colons in these values. -->
  <x attr2="http://www.example.com/foo">nested namespace</x>
  </d>
</c>
</a>

The second b element's prefix was mapped to the snee.com URI, and the third b element's prefix was left alone because it had no corresponding URI. The d element's attribute values were left alone, and the x element's namespace prefix, which was the same as the one on the second b element, was mapped to a different URI: the one that the "sn" prefix was mapped to in the c element that contains the d element, thereby showing that the scoping of the declarations was respected.

Converting a Document's Encoding

There are several utilities available that can convert a file's encoding, but if you need to convert the encoding of an XML document, an XSLT processor and an eight-line stylesheet (OK, a little longer with blank lines added for readability) are all you need.

The following stylesheet has only one template rule: the same one we've seen in most of the utility stylesheets, which copies everything passed to it verbatim.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:output encoding="utf-16"/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

Also in Transforming XML

Automating Stylesheet Creation

The stylesheet also has an xsl:output element. This element has many useful attributes, and the encoding one is particularly valuable: tell it what encoding to use when writing the result document, and your stylesheet is ready to convert some documents. If your XSLT processor can't handle the output encoding you've asked for, it will tell you.

The choice of encodings that your XSLT processor can read and write isn't entirely up to the processor. The XML parser that it uses determines which encodings it can read, and for a Java-based XSLT processor, the JVM in use may limit the number of supported output encodings. Check your processor's documentation -- for example, the "Character encodings supported" section of Saxon 6.5.3's Standards Conformance page lists four input encodings recognized by the built-in AElfred parser that it uses by default, and nine encodings that it supports for output, if your JVM supports them.