Utility Stylesheets

April 7, 2004

I have several useful little stylesheets that I've never mentioned in this column because each is so short that describing it would make for a pretty short column. I recently realized, though, that by combining them I have enough to fill up two columns, so this month we'll look at the first few.

Despite their brevity -- counting white space and comments, the longest is 25 lines long -- they're each useful in a wide variety of situations and can be used on nearly any XML document. I say "nearly" because some are focused on XHTML, but you can easily modify them to handle DocBook documents or other document types.

Most follow a similar pattern (or, to use the appropriate buzz phrase, "design pattern"): one template rule copies everything in the source document verbatim to the result tree, and another template rule, or even another instruction, takes care of the particular problem that the stylesheet addresses. As pipelining approaches to processing XML become more popular, stylesheets like these can be useful building blocks when creating larger, more complex processes.

Stripping Empty Paragraphs

Sometimes, when using something like Perl or Python to convert a text file to XML, you have to assume that a carriage return in your text file input shows the end of a paragraph, and multiple carriage returns in a row get converted to empty paragraphs. The following stylesheet's addition to the "copy everything verbatim" template rule is a template rule for p elements that only copies them if they have any content after the removal of their extraneous white space.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <!-- Only copy non-empty p elements. -->
  <xsl:template match="p">
    <xsl:if test="normalize-space(.)">            
      <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

<@/xsl:stylesheet>

You can customize the first template rule's match condition to look for other empty elements to suppress. For example, to have it check for empty p, pre, and h4 elements, the match condition would be "p|pre|h4". If you want it to look for empty para elements in a DocBook document, set the match condition to "para".

Convert Mixed Content to Element Content

Technically, any element that has character content at all is considered to have mixed content, but in more popular usage, "mixed content" describes an element that has both character data and child elements mixed together in the same element. For example,

<doc>
  <title>In the Mix</title>
  <para>Technically, this element has mixed content.</para>
  <para>This one is <keyterm>really <emph>very</emph> mixed</keyterm>,
   as you can see. </para>
</doc>

The presence of text nodes and elements as children of the same element can present problems when processing and storing XML documents. For example, if you were storing each element of the document above in its own record in a database, the first para element would be simple enough to store, but what about the second one? Would you store the keyterm element in its own record? How could its relationship to the phrases that precede and follow it be tracked? How would you map the relationship of its two text node children and one element child to database records or objects?

The following stylesheet can help in these situations. Its second template rule, like the second template rule in the stylesheet above, copies everything not addressed by the first template rule. The first template rule looks for non-whitespace text nodes that have element nodes as siblings and wraps them in a textnode element. (You might want to name them text or PCDATA elements instead. If your source documents are XHTML, you could name the text node wrapper elements span elements and give them class attribute set to something useful for your application, thereby making the result valid XHTML.)

When parsing a document without checking its DTD, a template rule that wraps a textnode element around all text nodes with element siblings would also wrap the carriage returns at the end of each line (for example, the "text node" between the title end-tag and the first para start-tag in the example document above), which you probably don't want, so the [normalize-space(.)] predicate in the following stylesheet's first template rule ensures that this only happens to text nodes that, after extraneous white space is removed, still have something left.

<!-- mixed2ec.xsl: convert mixed content to element content: wrap any non-blank 
     text nodes that have element siblings in <textnode></textnode> tags. 
-->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

    <xsl:template match="text()[normalize-space(.)][../*]">
        <textnode><xsl:value-of select="."/></textnode>
    </xsl:template>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

(Note for XSLT geeks: you could also use the preceding-sibling and following-sibling axes to check for siblings of the text node, but [../*], which checks whether the parent has any element children, is more concise.) This stylesheet converts the sample document above to the document shown here, in which no non-blank text node has an element as a sibling:

<doc>
  <title>In the Mix</title>
  <para>Technically, this element has mixed content.</para>
  <para><textnode>This one is </textnode><keyterm><textnode>really </textnode>
  <emph>very</emph><textnode> mixed</textnode></keyterm><textnode>,
   as you can see. </textnode></para>
</doc>

Adding ID Values to Elements

Ask devotees of object-oriented development about the value of object identity, and then just try to shut them up. When an XML element has an attribute with a value that's guaranteed to be unique within the document, it has identity, and this brings several advantages. Like a record's key value in a database, it can provide a hook for referring to it from elsewhere, which lets you associate new data with it. If the attribute's name is "id" (which is a common convention) it makes it easier to link to that element, especially if it's within an XHTML document -- just add a pound sign and the ID value to the document's URL to link to that point in the document. Adding IDs to your elements is a simple, quick way to add value to your data.

The following XSLT stylesheet copies a source document to a result tree, taking advantage of XSLT 1.0's generate-id() function along the way to create unique IDs for every element that doesn't already have an id attribute. (Technically, there's a small chance that one of the created ones will be the same as an existing one, but it's a very small chance.) The first template rule copies all elements, adding the id value if it's not already there, and the second template rule copies all the other node types.

<!-- addids1.xsl: Add ID values to all elements that don't have them. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

  <xsl:template match="*">
    <xsl:copy>
    <xsl:if test="not(@id)">
      <xsl:attribute name="id">
        <xsl:value-of select="generate-id()"/>
      </xsl:attribute>
    </xsl:if>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@*|processing-instruction()|comment()">
    <xsl:copy>
      <xsl:apply-templates select="node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Also in Transforming XML

Automating Stylesheet Creation

When describing the mixed2ec.xsl stylesheet, I mentioned how it could make loading of narrative XML into a relational database easier. Combining that stylesheet with this one would make it even easier, because assigning an identifier to every node of a document makes it easier to track if you split up and rejoin the document.

The following variation on the stylesheet above is one that I use often. In theory, it's nice to have ID values on every single element, but it won't add much value to inline elements other than linking elements, which can then hold their own as part of a two-way link. Instead of adding IDs to every element, this next stylesheet adds them to a specific list of elements: my most-used HTML block elements (plus the a element), which will make them all valid link destinations. Because the HTML documents I use with this stylesheet may or may not have the XHTML namespace declared as the default namespace, the first template rule's match attribute lists each element that should get an ID added twice: once in case they're in that namespace and once if they aren't in any namespace.

<!-- addids2.xsl: Add ID values to the elements listed in the
     first xsl:template elements match attribute -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:h="http://www.w3.org/1999/xhtml"
                version="1.0">

  <!-- Add id attributes to these elements -->
  <xsl:template match="p|pre|li|h1|h2|h3|h4|a|blockquote|h:p|h:pre|
                         h:li|h:h1|h:h2|h:h3|h:h4|h:a|h:blockquote">
    <xsl:copy>
      <xsl:if test="not(@id)">
        <xsl:attribute name="id">
          <xsl:value-of select="generate-id()"/>
        </xsl:attribute>
      </xsl:if>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

If you don't want to add IDs to every element in your document, you can modify the match attribute in this stylesheet's second template rule to list whatever elements you like.

Next month we'll look at some stylesheets for indenting, for cleaning up potential namespace headaches, and for converting document encodings. And, if you have any short general-purpose stylesheets like these that you're interested in sharing with XML.com readers, let me know; maybe this can be a three-part series.