Menu

Controlling Whitespace, Part Three

January 2, 2002

Bob DuCharme

In the first and second parts of this three-part series, we looked at techniques for stripping, processing, and adding whitespace when creating a result document from a source document. This month we'll see how to add tab characters to a result document, and how to automate the indenting of a result document according to the nesting of its elements.

Adding Tabs to your Output

A stylesheet can add tabs to output using the character reference "	". For example, let's say we want to convert this source document into a text file that uses tabs to line up the columns of information.

<employees>

  <employee hireDate="04/23/1999">
    <last>Hill</last>
    <first>Phil</first>
    <salary>100000</salary>
  </employee>

  <employee hireDate="09/01/1998">
    <last>Herbert</last>
    <first>Johnny</first>
    <salary>95000</salary>
  </employee>

  <employee hireDate="08/20/2000">
    <last>Hill</last>
    <first>Graham</first>
    <salary>89000</salary>
  </employee>

</employees>

Ample use of this character reference in this stylesheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="employees">
Last&#9;First&#9;Salary&#9;Hire Date
----&#9;-----&#9;------&#9;----------
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="employee">
  <xsl:apply-templates select="last"/>
  <xsl:text>&#9;</xsl:text>
  <xsl:apply-templates select="first"/>
  <xsl:text>&#9;</xsl:text>
  <xsl:apply-templates select="salary"/>
  <xsl:text>&#9;</xsl:text>
  <xsl:apply-templates select="@hireDate"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

produces this result from that source document:

Last    First   Salary  Hire Date
----    -----   ------  ----------
Hill    Phil    100000  04/23/1999
Herbert Johnny  95000   09/01/1998
Hill    Graham  89000   08/20/2000

When the stylesheet's first template sees an employees element, it adds a two-line header to the result tree before applying the appropriate templates to the children of the employees element: one line consisting of the field names separated by "&#9;" character references and another line with several groups of hyphens, with each group separated by the same character reference.

The only possible child of the employees element is the employee element, and its template rule individually applies templates (in this case, the default XSLT template that outputs an element's text content) to its children with the "&#9;" character reference between each one. This character reference doesn't always have to be inside of an xsl:text instruction (note that it's not in the stylesheet's first template), but if it had been added without this element in the second template, the XSLT processor would have ignored it -- remember, like carriage returns and the spacebar space, tab characters are considered whitespace, and an XSLT processor ignores white spacecharacters between elements if they're the only characters there and not enclosed by an xsl:text instruction.

Tip Although stylesheets are easier to read when elements are indented to show their levels of nesting, when you're concerned with controlling it, extraneous whitespace in your stylesheet can cause alignment problems in your output. Thus, this section's examples are not always indented.

Defining a general entity for this "<xsl:text>&#9;</xsl:text>" string can make the stylesheet easier to read, especially if you call the entity "tab":

<!DOCTYPE stylesheet [
  <!ENTITY tab "<xsl:text>&#9;</xsl:text>">
  <!ENTITY cr "<xsl:text>
</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">
<xsl:output method="text"/>

<xsl:template match="employees">
Last&tab;First&tab;Salary&tab;Hire Date
----&tab;-----&tab;------&tab;----------
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="employee">
  <xsl:apply-templates select="last"/>&tab;
  <xsl:apply-templates select="first"/>&tab;
  <xsl:apply-templates select="salary"/>&tab;
  <xsl:apply-templates select="@hireDate"/>&cr;
</xsl:template>

</xsl:stylesheet>

This stylesheet has the same effect as the previous one, but it's easier to read. As long as I was defining a "tab" entity, I defined a "cr" one as well for "carriage return," which also makes the stylesheet easier to read.

See the earlier column Entities and XSLT (or my book, XSLT Quickly) for more on defining and referencing entities in XSLT and XML.

Indenting

Setting the xsl:output element's indent attribute to a value of "yes" tells the XSLT processor that it may add additional whitespace to the result tree. The default value is "no".

Warning An indent value of "yes" means that an XSLT processor may add whitespace to the result. It doesn't have to, so if setting this doesn't have the desired effect when you use it, try it with a different XSLT processor. Or check the processor's documentation -- the Xalan C++ XSLT processor, for example, indents elements zero spaces as a default, but this figure can be reset with the -INDENT command line parameter.

The following stylesheet is the identity stylesheet with the xsl:output element's indent value set to "yes". In other words, this stylesheet copies all the nodes of the source tree document to the result tree without changing any, except that the XSLT processor may add more whitespace.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>


</xsl:stylesheet>

With an XSLT processor that does add whitespace, this stylesheet turns this source document

<chapter><title>My Chapter</title>
<para>This paragraph introduces the chapter's sections.</para>
<sect1><title>Section 1 of "My Chapter"</title>
<para>Here is the first section's first paragraph.</para>
<para>Here is the first section's second paragraph.</para>
</sect1>
<sect1><title>Section 2 of "My Chapter"</title>
<para>Here is the first section's first paragraph.</para>
<sect2><title>Section 2.2</title>
<para>This section has a subsection.</para>
</sect2>
</sect1>
</chapter>

into this:

<chapter>
   <title>My Chapter</title>
   <para>This paragraph introduces the chapter's sections.</para>
   <sect1>
      <title>Section 1 of "My Chapter"</title>
      <para>Here is the first section's first paragraph.</para>
      <para>Here is the first section's second paragraph.</para>
   </sect1>
   <sect1>
      <title>Section 2 of "My Chapter"</title>
      <para>Here is the first section's first paragraph.</para>
      <sect2>
         <title>Section 2.2</title>
         <para>This section has a subsection.</para>
      </sect2>
   </sect1>
</chapter>

The added indenting makes the parent-child and sibling relationships of the elements much clearer, because a child element's tags are indented further than a parent element's tags and siblings are all indented to the same level. When someone gives you an XML document with no DTD or schema, and you need to figure out its structure, a pass through this little stylesheet is a great first step. I use this stylesheet at least several times a week, even when I'm not engaged in XSLT-related work.

Section 16.1 of the XSLT Recommendation warns us that it's "usually not safe" to set indent to "yes" with documents that have elements that mix character data with child elements. For example, the first color child of the colors element in the following document has the string "red:" as character data followed by three shade elements that are children of that color element. The second color element only has character data content (the string "yellow"), and the third one has a structure similar to the first one.

<colors>
<color>red:
<shade>fire engine</shade>
<shade>candy apple</shade>
<shade>brick</shade>
</color>
<color>yellow</color>
<color>blue:
<shade>navy</shade>
<shade>robin's egg</shade>
<shade>cerulean</shade>
</color>
</colors>

The same stylesheet indents the elements of this document, but not the first shade element in the first and third color elements.

<colors>
   <color>red:
      <shade>fire engine</shade>
      <shade>candy apple</shade>
      <shade>brick</shade>
   </color>
   <color>yellow</color>
   <color>blue:
      <shade>navy</shade>
      <shade>robin's egg</shade>
      <shade>cerulean</shade>
   </color>
</colors>
    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

It doesn't indent those two shade elements because that would add character data to the document. Adding whitespace between two elements (for example, between a </color> end-tag and a <color> start-tag in the example) doesn't affect a document's contents, but adding it within an element that has character data content adds text that an XML parser considers significant -- in other words, it changes the content of the document.

To summarize, an indent value of "yes" is useful if every element in your source document has either character data and no elements as content (like the shade elements above) or elements and no character data as content (like the colors element in the example); but it can lead to unpredictability if your source document has elements that mix child elements with character data like the color elements above. The spaces that indent the other shade elements are also inside of the "red" color element, but because this whitespace isn't being added to existing character data at those positions, the text nodes that they're in are pure whitespace, so the XML processor will ignore them. (It's a tricky concept; see the earlier column Controlling White Space with XSLT, Part 1 for more on this.)