Menu

XSLT as Pretty Printer

November 29, 2006

Hew Wolff

Introduction

Recently I was wading through some hard-to-read XML files. Art & Logic, the company I work for, was helping a client to build an Ajax-style Web interface that used XML to talk to the backend and client-side XSLT to produce the HTML. I found myself reformatting the XML by hand to make things easier and finally wondering as I hit the spacebar yet again: couldn't an XSLT style sheet do this formatting for me? I had done something similar before, so I decided to try writing that style sheet, using a test-driven approach. Some hours later I had a handy utility, and a new appreciation for some of the wrinkles of XML. Here's a cleaned-up account of what I did.

So what will the test be? Well, since XSLT is itself a dialect of XML, the stylesheet (call it indent.xsl) will be an XML document. Why not just use the code for its own test? If I make sure my code looks good as I write it, then indent.xsl should transform itself to itself. So I write a shell script like

# Use my local XSLT processor...

~/runXslt indent.xsl indent.xsl out.xml

diff indent.xsl out.xml

First Steps

Inspired by Extreme Programming, I start with The Stupidest Thing That Could Possibly Work: an empty style sheet with a hopeful comment. I specify that the output is generic XML, and include the usual XSLT namespace so that the XSLT processor knows that xsl:... elements are XSLT instructions and not just data. I'll stick with XSLT version 1.0 since that has solid support (such as the Saxon library that I'm using).


<!--

Used for formatting XML into a reasonable style.

-->

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>

</xsl:stylesheet>

The output from the diff test is

1,6c1,2

< <!--

< Used for formatting XML into a reasonable style.

< -->

< <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>

< </xsl:stylesheet>

---

> 

>

Well darn, that didn't work. It produces no output at all. Looking at the test results, the first problem is that it ignores the comment. Inserting a simple template should take care of that: when it sees a comment, it should just copy it through.


   <xsl:template match="comment()">

      <xsl:comment>

         <xsl:value-of select="."/>

      </xsl:comment>

   </xsl:template>

OK, but now it doesn't produce the style sheet element. How about another template to copy each input element to an output element with the same name.


   <xsl:template match="*">

      <xsl:element name="{name(.)}"/>

   </xsl:template>

The output looks like this:

<!--

Used for formatting XML into a reasonable style.

--><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>

...

Getting better. But I also need to start a new line between the comment and the element.

   <xsl:template match="*">

      <xsl:text>&#xA;</xsl:text>

      <xsl:element name="{name(.)}"/>

And I want the element's attributes too, so add them to the element by hand.




      <xsl:element name="{name(.)}">

         <xsl:for-each select="@*">

            <xsl:attribute name="{name(.)}"><xsl:value-of select="."/></xsl:attribute>

         </xsl:for-each>

      </xsl:element>

Actually, as you can see above, I already got one attribute for free, namely the xmlns:xsl attribute for the XSLT namespace. But this is not a normal attribute. It's there because of the XSLT/XPath data model, in which the tree structure of an XML document contains not only the familiar hierarchy of elements, attributes, and text, but also namespace nodes. The namespace nodes attached to an element tell an XML application how to interpret the names inside that element. When you create an output element in your style sheet, XSLT basically copies the namespace nodes from the style sheet into the result, so that's where that free attribute came from.

Annoyingly, the XSLT processor really wants to put the version attribute after xmlns:xsl, but I think they look nicer the other way around. I might be able to fix that, but studying the spec shows that I can't expect to preserve attribute order in general: XSLT makes no guarantees about the relative order of attributes in an element. A style sheet, in general, does not even know the order of the attributes in the input document. XML doesn't care, but diff does. So I'll just accept this as a limitation of my test, and in my code I'll follow the order preferred by my XSLT processor.


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

The next problem revealed by the diff is that the style sheet element's children are missing in the output, so I should be applying the element template recursively.

         <xsl:for-each select="@*">

            <xsl:attribute name="{name(.)}"><xsl:value-of select="."/></xsl:attribute>

         </xsl:for-each>

         <xsl:apply-templates/>

This weird testing process is beginning to work—the test shows that the matching output is creeping forward slowly, although there's a long way to go.

Indentation

The test now points out that the first child element needs indenting. Probably that means that each element has an associated depth, and the element template gets this depth as a parameter. I'll indent with three spaces.

   <xsl:template match="*">

      <xsl:param name="depth">0</xsl:param>

      <!-- New line with indenting. -->

      <xsl:if test="$depth > 0">

         <xsl:text>    </xsl:text>

      </xsl:if>

      <xsl:text>&#xA;</xsl:text>

      <xsl:element name="{name(.)}">

         <xsl:for-each select="@*">

            <xsl:attribute name="{name(.)}"><xsl:value-of select="."/></xsl:attribute>

         </xsl:for-each>



         <xsl:apply-templates>

            <xsl:with-param name="depth" select="$depth + 1"/>

         </xsl:apply-templates>

      </xsl:element>

   </xsl:template>

Hmm, indenting isn't happening. Add some debugging code.

      <!-- New line with indenting. -->

<xsl:value-of select="concat('depth: ', $depth)"/>

      <xsl:if test="$depth > 0">

Oh, right, I have to indent after the new line.

      <!-- New line with indenting. -->

      <xsl:text>&#xA;</xsl:text>

      <xsl:if test="$depth > 0">

         <xsl:text>    </xsl:text>

      </xsl:if>

OK, the first child tag is indented now, but there's a blank line separating it from its parents, which looks bad. I'll start by gaining as much control of the whitespace as possible. That means no automatic indentation by the XSLT processor, and whitespace preserved in my text elements but nowhere else.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

   <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>

   <xsl:strip-space elements="*"/>

   <xsl:preserve-space elements="xsl:text"/>

Now I need to figure out exactly when I want a blank line above an element. This will turn out to be the trickiest part of the whole operation: trying to capture my personal intuition about what spacing looks good. A reasonable rule for now is to insert a blank line before an element that has children, whenever there's something else above it.

      <xsl:text>&#xA;</xsl:text>

      <xsl:if test="position() > 1 and count(./*) > 0">

         <xsl:value-of select="'&#xA;'"/>

      </xsl:if>

The test now says:

13,16c12,13

<       <xsl:comment>

<          <xsl:value-of select="."/>

<       </xsl:comment>

<    </xsl:template>

---

>    <xsl:comment>

>    <xsl:value-of select="."/></xsl:comment></xsl:template>

That reminds me that deeply nested elements have to be indented more. That suggests an indentation template, which would also take a depth parameter. Since XSLT doesn't really have a concept of iteration, I use recursion instead.

      <!-- Set off a large element with a blank line. -->

      <xsl:if test="position() > 1 and count(./*) > 0">

         <xsl:text>&#xA;</xsl:text>

      </xsl:if>

      <xsl:call-template name="indent">

         <xsl:with-param name="depth" select="$depth"/>

      </xsl:call-template>

      ...



   <xsl:template name="indent">

      <xsl:param name="depth"/>



      <xsl:if test="$depth > 0">

         <xsl:text>   </xsl:text>

         <xsl:call-template name="indent">

            <xsl:with-param name="depth" select="$depth - 1"/>

         </xsl:call-template>

      </xsl:if>

   </xsl:template>

Closing tags require a newline and indentation too. But only when there are child elements: a simple one-line element, maybe with some text in it, looks OK.

      <xsl:element name="{name(.)}">

         <xsl:for-each select="@*">

            <xsl:attribute name="{name(.)}"><xsl:value-of select="."/></xsl:attribute>

         </xsl:for-each>



         <xsl:apply-templates>

            <xsl:with-param name="depth" select="$depth + 1"/>

         </xsl:apply-templates>



         <xsl:if test="count(./*) > 0">

            <xsl:text>&#xA;</xsl:text>

            <xsl:call-template name="indent">

               <xsl:with-param name="depth" select="$depth"/>

            </xsl:call-template>

         </xsl:if>

      </xsl:element>

Much better.

Nailing Down the Test

I'll summarize the remaining steps more briefly. For reference, you can skip ahead to the complete code at the end.

There's a gratifying amount of refactoring to be done in the handling of elements and comments. First of all, comments need to be indented pretty much the same as elements. Since they share a lot of code, I made them separate cases in the same template, using xsl:choose. Then a colleague pointed out that, rather than explicitly instantiating the output node using xsl:element or xsl:comment, it's simpler to use xsl:copy. This feature creates a copy of the current input node, and also (unlike xsl:copy-of) lets me add whitespace children in the output. Also, it's not necessary to iterate explicitly through an element's attributes when the expression @* gives me all of them at once. This leads to the nice code below.


   <xsl:template match="*|comment()">

      <xsl:param name="depth">0</xsl:param>

      ...

      <xsl:copy>

         <xsl:if test="self::*">

            <xsl:copy-of select="@*"/>



            <xsl:apply-templates>

               <xsl:with-param name="depth" select="$depth + 1"/>

            </xsl:apply-templates>

            ...

         </xsl:if>

      </xsl:copy>

      ...

   </xsl:template>

By the way, the first time I tried this, I couldn't get it working because I left out the self:: axis prefix. There's a parallel between the template match pattern and the test expression, but the parallel is deceptive. In the second case a context node has already been established, and the default axis is child::. So * means "all children of the current node that are elements," but I want self::*, which means "the current node if it's an element."

I kept getting tripped up by further ambiguities like the attribute order mentioned above. For example, the processor keeps taking my &#xA; (the character reference for a new line character) and converting it into a literal new line. This is correct XML, but it messes up the formatting. It turns out that this is another case where the output text is not guaranteed: the processor may escape characters if it wants to. XSLT does provide a mechanism to control output escaping in some cases, so I added a template to restore the new line character references in text nodes.


   <xsl:template match="text()">

      <xsl:call-template name="escapeNewlines">

         <xsl:with-param name="text">

            <xsl:value-of select="."/>

         </xsl:with-param>

      </xsl:call-template>

   </xsl:template>

   ...

   <xsl:template name="escapeNewlines">

      ...

   </xsl:template>

Similarly, I would like to use literal < and > characters in my XPath expressions, but the processor prefers to escape them, and it has the right to. Here, for the sake of the test, I just followed the processor's preference:


      <xsl:if test="$depth &gt; 0">

The XSLT processor would also be within its rights to insert extra whitespace between attributes of an element. Fortunately, the default behavior, inserting just one space, is also what I want to enforce.

Another approach to these escaping problems would be to tell the processor that I'm writing plain text instead of XML. This would allow finer control, at the cost of more complex code: I would write the output character by character, rather than describing it as a tree of XML nodes. This works, but I decided it wasn't worth the complexity.

The test has one last nitpick: the last line in the file should be terminated with a new line.


      <xsl:variable name="isLastNode" select="count(../..) = 0 and position() = last()"/>



      <xsl:if test="$isLastNode">

         <xsl:text>&#xA;</xsl:text>

      </xsl:if>

Tweaks

Finally the test passes! Am I done? Well, the code looks nice, and also provides a good test, but running it on some other random XML files suggests a few more tweaks. For example, I use a comment to apply to the following code, so it should have a blank line above but not below. However, people also use a comment to mark the end of a section, or to hide some content temporarily. A good compromise seems to be to insert a blank line whenever a comment is next to an element (either above or below it).

Also, I often omit the XML declaration at the top of a document, out of laziness. But when I saw a non-ASCII character showing up in the sample XML I chose, I realized that the output had no indication of the character encoding, which is impolite. So I put the XML declaration back in, and added a blank line after that as well.

Here's the blank-line policy I finally came up with, with intermediate variables added for documentation.


      <!--

      Set off from the element above if one of the two has children.

      Also, set off a comment from an element.

      And set off from the XML declaration if necessary.

      -->



      <xsl:variable name="isFirstNode" select="count(../..) = 0 and position() = 1"/>

      <xsl:variable name="previous" select="preceding-sibling::node()[1]"/>

      <xsl:variable name="adjacentComplexElement" select="count($previous/*) &gt; 0 or count(*) &gt; 0"/>

      <xsl:variable name="adjacentDifferentType" select="not(($previous/self::comment() and self::comment()) or ($previous/self::* and self::*))"/>



      <xsl:if test="$isFirstNode or ($previous and ($adjacentComplexElement or $adjacentDifferentType))">

         <xsl:text>&#xA;</xsl:text>

      </xsl:if>

Conclusion

I did a little more polishing and documentation and then decided it was a good place to stop. Here's the complete code. As the artists say, the work is never finished, only abandoned. A number of limitations are clear.

  • There are certainly more sophisticated tools for cleaning up XML, such as XMLStarlet. If you just want to view the XML you can probably open it in your browser. You should also consider a full XML editor like XMLSpy.

  • If you're reformatting the XML for later use, it's important that the application is not sensitive to white space next to tags. Most likely you're producing HTML, and a browser generally doesn't care about this.

  • The particular style produced by this style sheet may not be to your taste. But it will be more readable than an average HTML document, and that's what makes it handy.


<?xml version="1.0" encoding="UTF-8"?>



<!--

Converts XML into a nice readable format.

Tested with Saxon 6.5.3.

As a test, this stylesheet should not change when run on itself.

But note that there are no guarantees about attribute order within an

element (see http://www.w3.org/TR/xpath#dt-document-order), or about

which characters are escaped (see

http://www.w3.org/TR/xslt#disable-output-escaping).

I did not test processing instructions, CDATA sections, or

namespaces.



Hew Wolff

Senior Engineer

Art & Logic, Inc.

www.artlogic.com

-->



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

   <!-- Take control of the whitespace. -->



   <xsl:output method="xml" indent="no" encoding="UTF-8"/>

   <xsl:strip-space elements="*"/>

   <xsl:preserve-space elements="xsl:text"/>



   <!-- Copy comments, and elements recursively. -->



   <xsl:template match="*|comment()">

      <xsl:param name="depth">0</xsl:param>



      <!--

      Set off from the element above if one of the two has children.

      Also, set off a comment from an element.

      And set off from the XML declaration if necessary.

      -->



      <xsl:variable name="isFirstNode" select="count(../..) = 0 and position() = 1"/>

      <xsl:variable name="previous" select="preceding-sibling::node()[1]"/>

      <xsl:variable name="adjacentComplexElement" select="count($previous/*) &gt; 0 or count(*) &gt; 0"/>

      <xsl:variable name="adjacentDifferentType" select="not(($previous/self::comment() and self::comment()) or ($previous/self::* and self::*))"/>



      <xsl:if test="$isFirstNode or ($previous and ($adjacentComplexElement or $adjacentDifferentType))">

         <xsl:text>&#xA;</xsl:text>

      </xsl:if>



      <!-- Start a new line. -->



      <xsl:text>&#xA;</xsl:text>



      <xsl:call-template name="indent">

         <xsl:with-param name="depth" select="$depth"/>

      </xsl:call-template>



      <xsl:copy>

         <xsl:if test="self::*">

            <xsl:copy-of select="@*"/>



            <xsl:apply-templates>

               <xsl:with-param name="depth" select="$depth + 1"/>

            </xsl:apply-templates>



            <xsl:if test="count(*) &gt; 0">

               <xsl:text>&#xA;</xsl:text>



               <xsl:call-template name="indent">

                  <xsl:with-param name="depth" select="$depth"/>

               </xsl:call-template>

            </xsl:if>

         </xsl:if>

      </xsl:copy>



      <xsl:variable name="isLastNode" select="count(../..) = 0 and position() = last()"/>



      <xsl:if test="$isLastNode">

         <xsl:text>&#xA;</xsl:text>

      </xsl:if>

   </xsl:template>



   <xsl:template name="indent">

      <xsl:param name="depth"/>



      <xsl:if test="$depth &gt; 0">

         <xsl:text>   </xsl:text>



         <xsl:call-template name="indent">

            <xsl:with-param name="depth" select="$depth - 1"/>

         </xsl:call-template>

      </xsl:if>

   </xsl:template>



   <!-- Escape newlines within text nodes, for readability. -->



   <xsl:template match="text()">

      <xsl:call-template name="escapeNewlines">

         <xsl:with-param name="text">

            <xsl:value-of select="."/>

         </xsl:with-param>

      </xsl:call-template>

   </xsl:template>



   <xsl:template name="escapeNewlines">

      <xsl:param name="text"/>



      <xsl:if test="string-length($text) &gt; 0">

         <xsl:choose>

            <xsl:when test="substring($text, 1, 1) = '&#xA;'">

               <xsl:text disable-output-escaping="yes">&amp;#xA;</xsl:text>

            </xsl:when>



            <xsl:otherwise>

               <xsl:value-of select="substring($text, 1, 1)"/>

            </xsl:otherwise>

         </xsl:choose>



         <xsl:call-template name="escapeNewlines">

            <xsl:with-param name="text" select="substring($text, 2)"/>

         </xsl:call-template>

      </xsl:if>

   </xsl:template>

</xsl:stylesheet>