Entity and Character References

June 2, 2004

XSLT stylesheet developers often ask how they can leave entity references in the source document unchanged as the stylesheet passes them to the result document. For example, they want an   entity reference in the source document to still be   in the result document. The usual answer is that you shouldn't need to do this, because the substitution of the entity values for the entity references shouldn't make any difference.

It shouldn't, if everyone always played by the rules, but not everyone does. I have my own schema for presentation slides, and I once wrote a stylesheet to convert the XML versions of my slides into HTML that Microsoft PowerPoint would recognize and import so that I could then save the slides as a binary PPT file. PowerPoint's import didn't treat code point 160 the same way it treated the entity reference  , so I absolutely had to have the entity reference in the HTML created by my stylesheet.

In the March 2001 Transforming XML column, I explained why source document entity references can't be preserved in the transformation: the XML parser that an XSLT processor depends on to read in the source document (and the stylesheet itself) converts any entity references to their entity values as it reads them in, before it puts the source document in the source tree where the XSLT processor can get at it. Replacing entity references with entity values is part of the XML parser's job.

That column went on to demonstrate the use of the disable-output-escaping attribute to add an entity reference to the output when you absolutely must. The use of this attribute, like the use of CDATA sections, is usually a bit kludgy. Also, the example I gave in that column showed how to add an entity reference to the result tree, but it didn't show how to convert something from the source document into an entity reference.

Many new features in XSLT 2.0 respond to wishes expressed by XSLT developers since 1.0 was released. The wish to leave entity references alone can't be granted completely, because redefining an XML parser's responsibilities is outside of the scope of the W3C XSL Working Group's responsibilities. They have added a great new feature, though, called character maps, that makes conversion of specific source document characters to entity references (or to any strings you like) very simple.

A character map lets you tell the XSLT processor "when this character is on its way to the result tree, put this string there instead." If I'd had this when I was trying to create PPT files from the XML of my slide presentation, I could have used this to map code point 160 to the string " " in the HTML that I was preparing for import into PowerPoint. The following XSLT 2.0 stylesheet does this and maps three other characters to strings.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

  <xsl:output use-character-maps="cm1"/>

  <xsl:character-map name="cm1">
    <xsl:output-character character="&#160;#160;" string="&amp;nbsp;"/>   
    <xsl:output-character character="&#233;" string="&amp;233;"/> <!-- é -->
    <xsl:output-character character="ô" string="&amp;#244;"/>
    <xsl:output-character character="&#8212;" string="--"/>
  </xsl:character-map>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

The stylesheet's single template rule does a verbatim copy of the source tree to the result tree. The new XSLT 2.0 parts are the xsl:character-map element, which defines the mappings to execute, and the use-character-maps attribute of the xsl:output element.

The xsl:output element is an old friend from XSLT 1.0 that's let us perform tasks such as the setting of an output file's encoding, the inclusion or omission of the XML declaration from the result document, and the setting of a DOCTYPE declaration for the result document. The new use-character-maps attribute lets you name one or more characters maps, with their names separated by spaces, to use in converting characters bound for the result tree into alternative strings.

The ability to use more than one character map lets you group mappings into modules and mix and match them for different output media. For example, imagine a stylesheet that was nothing but xsl:character-map elements, each one being a set of character mappings for a particular purpose. Other stylesheets could use xsl:include to reference that file and then name which sets of mappings they wanted to use in the xsl:output element's use-character-maps attribute. For example:

  <xsl:output use-character-maps="dashes2ASCII GermanVowels2ASCII"/>

The xsl:character-map element has the name attribute to store the name used to reference it from the xsl:output element's use-character-maps attribute and a collection of xsl:output-character child elements. Each of these children has a character attribute to identify the character to map and a string attribute storing the string to map it to. Because an XML parser that sees the string " " in any attribute value interprets it as a non-breaking space character and not as the entity reference string for this character, this must be written as "&nbsp;" in the xsl:output-character element's string attribute value to show that we want the entity reference added to the result tree.

The first two xsl:output-character elements specify their character values with a numeric character reference. Remember, just because something begins with an ampersand and ends with a semicolon, that doesn't make it an entity reference. An entity is a name that is declared and referenced, whether it's predeclared like the lt or amp entities or declared in a DTD as one might do with nbsp or ntilde. A numeric character reference is a way to indicate a character by using its code point number; no declaration is necessary, so it's not a named unit of storage, and therefore not an entity.

For the third xsl:output-character element, instead of a numeric character reference, I just entered the character "é" in there to show which character I wanted mapped. The fourth xsl:output-character element maps the character for the punctuation mark known as the "em dash"—a punctuation mark that I probably use too much—to a pair of hyphens, which is the traditional way to represent an em dash when all you have are 7-bit ASCII characters.

Of the four strings that this character map can add to the result tree, only   is an entity reference. If the document created from the result tree is XML, a parser that reads that document will choke on " " if it never saw a declaration for it, so the stylesheet should include an xsl:output top-level element with a doctype-system attribute so that the result document points to a DTD with the appropriate declaration. If the result is HTML, though, this isn't necessary, because all web browsers understand the   entity reference.

The second xsl:output-character element almost looks like it's mapping "&233;" to itself. It isn't, though; the XML parser that hands this stylesheet to the XSLT processor will turn the character attribute value into the character itself as it puts in on the source tree, and then the XSLT processor, as it reads the source tree and creates data for the result tree, will convert that character to the string specified in that xsl:output-character element's string attribute.

As a test, I ran the stylesheet on this source document:

<doc word="côté">
  <p>côté is the French word for "side."</p>
  <p>A non-breaking space (&#160;#160;) presents special problems&#8212;and
requires special handling.</p>
</doc>

I often use the word "côté" as an example for English-speaking people who don't take accented characters seriously enough, because the presence or absence of each of these accents can turn it into three different words. Using Saxon 7, the stylesheet turns the document into this:

<?xml version="1.0" encoding="UTF-8"?><doc word="c&#244;t&#233;">
  <p>c&#244;t&#233; is the French word for "side."</p>
  <p>A non-breaking space (&nbsp;) presents special problems--and 
requires special handling.</p>
</doc>

The two vowels in "côté" have been converted to numeric character references, both in the doc element's word attribute and in the content of the first p element. The em dash was converted to two hyphens, and the non-breaking space was converted to an entity reference for it.

Why would you need XSLT 2.0's character mapping feature? The most common problem it solves is the mangling of non-ASCII characters by processes that can't handle the encoding of a file that they're reading. The PowerPoint example above is one example. Confusion of some processes between the UTF-8 encoding, in which characters such as French vowels with accents are represented with two bytes, and Latin-1, which represents them with one byte, has often led to two strange bytes showing up in my browser window where I expected to see one accented vowel. (One more example as I finish up the last draft of this column: I pulled up the XHTML that I wrote into Microsoft Word to see if its spell checker would catch anything that I missed, and despite the XML declaration at the top of this file indicating that it's in UTF-8, Word thinks it's Latin 1, and shows the foreign characters and em dashes as two garbage bytes each.) Mapping these characters to their numeric character references, in which any Unicode code point can be represented using 7-bit ASCII (an ampersand followed by a pound sign, the code point number for the character, and a semicolon) can help to maintain the integrity of the representation all the way to the delivery application.

A straight mapping of a character to a 7-bit ASCII representation, as we saw with the em dash example, can also provide a compromise between a typographically slick representation of something and the possibility that garbage character(s) will appear in its place. Ultimately, XSLT 2.0 character mapping gives us more control over how our characters look and get represented, and it will be very handy.