Entity and Character References
June 2, 2004
XSLT stylesheet developers often ask how they can leave entity references in the
source document unchanged as the stylesheet passes them to the result document. For
they want an
entity reference in the source document to still be
in the result document. The usual answer is that you shouldn't
need to do this, because the substitution of the entity values for the entity references
shouldn't make any difference.
It shouldn't, if everyone always played by the rules, but not everyone does. I
have my own schema for presentation slides, and I once wrote a stylesheet to convert
versions of my slides into HTML that Microsoft PowerPoint would recognize and import
I could then save the slides as a binary PPT file. PowerPoint's import didn't treat
point 160 the same way it treated the entity reference
, so I
absolutely had to have the entity reference in the HTML created by my stylesheet.
In the March 2001 Transforming XML column, I explained why source document entity references can't be preserved in the transformation: the XML parser that an XSLT processor depends on to read in the source document (and the stylesheet itself) converts any entity references to their entity values as it reads them in, before it puts the source document in the source tree where the XSLT processor can get at it. Replacing entity references with entity values is part of the XML parser's job.
That column went on to demonstrate the use of the
disable-output-escaping attribute to add an entity reference to the output
when you absolutely must. The use of this attribute, like the use of CDATA sections,
usually a bit kludgy. Also, the example I gave in that column showed how to add an
reference to the result tree, but it didn't show how to convert something from the
document into an entity reference.
Many new features in XSLT 2.0 respond to wishes expressed by XSLT developers since 1.0 was released. The wish to leave entity references alone can't be granted completely, because redefining an XML parser's responsibilities is outside of the scope of the W3C XSL Working Group's responsibilities. They have added a great new feature, though, called character maps, that makes conversion of specific source document characters to entity references (or to any strings you like) very simple.
A character map lets you tell the XSLT processor "when this character is on its way to the result tree, put this string there instead." If I'd had this when I was trying to create PPT files from the XML of my slide presentation, I could have used this to map code point 160 to the string " " in the HTML that I was preparing for import into PowerPoint. The following XSLT 2.0 stylesheet does this and maps three other characters to strings.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:output use-character-maps="cm1"/> <xsl:character-map name="cm1"> <xsl:output-character character=" #160;" string="&nbsp;"/> <xsl:output-character character="é" string="&233;"/> <!-- é --> <xsl:output-character character="ô" string="&#244;"/> <xsl:output-character character="—" string="--"/> </xsl:character-map> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The stylesheet's single template rule does a verbatim copy of the source tree to
the result tree. The new XSLT 2.0 parts are the
which defines the mappings to execute, and the
use-character-maps attribute of
xsl:output element is an old friend from XSLT 1.0 that's let us
perform tasks such as the setting of an output file's encoding, the inclusion or omission
the XML declaration from the result document, and the setting of a DOCTYPE declaration
the result document. The new
use-character-maps attribute lets you name one or
more characters maps, with their names separated by spaces, to use in converting characters
bound for the result tree into alternative strings.
The ability to use more than one character map lets you group mappings into
modules and mix and match them for different output media. For example, imagine a
that was nothing but
xsl:character-map elements, each one being a set of
character mappings for a particular purpose. Other stylesheets could use
xsl:include to reference that file and then name which sets of mappings they
wanted to use in the
attribute. For example:
<xsl:output use-character-maps="dashes2ASCII GermanVowels2ASCII"/>
xsl:character-map element has the
to store the name used to reference it from the
use-character-maps attribute and a collection of
xsl:output-character child elements. Each of these children has a
character attribute to identify the character to map and a
string attribute storing the string to map it to. Because an XML parser that
sees the string " " in any attribute value interprets it as a non-breaking space
character and not as the entity reference string for this character, this must be
"&nbsp;" in the
attribute value to show that we want the entity reference added to the result tree.
The first two
xsl:output-character elements specify their
character values with a numeric character
reference. Remember, just because something begins with an ampersand and ends with a
semicolon, that doesn't make it an entity reference. An entity is a name that is declared
and referenced, whether it's predeclared like the
entities or declared in a DTD as one might do with
A numeric character reference is a way to indicate a character by using its code point
number; no declaration is necessary, so it's not a named unit of storage, and therefore
For the third
xsl:output-character element, instead of a numeric
character reference, I just entered the character "é" in there to show which character
I wanted mapped. The fourth
xsl:output-character element maps the character for
the punctuation mark known as the "em dash"—a punctuation mark that I probably use
much—to a pair of hyphens, which is the traditional way to represent an em dash when
all you have are 7-bit ASCII characters.
Of the four strings that this character map can add to the result tree, only
is an entity reference. If the document created from the result tree is XML,
parser that reads that document will choke on " " if it never saw a declaration
it, so the stylesheet should include an
xsl:output top-level element with a
doctype-system attribute so that the result document points to a DTD with the
appropriate declaration. If the result is HTML, though, this isn't necessary, because
web browsers understand the entity reference.
xsl:output-character element almost looks like it's
mapping "&233;" to itself. It isn't, though; the XML parser that hands this stylesheet
to the XSLT processor will turn the
character attribute value into the
character itself as it puts in on the source tree, and then the XSLT processor, as
the source tree and creates data for the result tree, will convert that character
string specified in that
As a test, I ran the stylesheet on this source document:
<doc word="côté"> <p>côté is the French word for "side."</p> <p>A non-breaking space ( #160;) presents special problems—and requires special handling.</p> </doc>
I often use the word "côté" as an example for English-speaking people who don't take accented characters seriously enough, because the presence or absence of each of these accents can turn it into three different words. Using Saxon 7, the stylesheet turns the document into this:
<?xml version="1.0" encoding="UTF-8"?><doc word="côté"> <p>côté is the French word for "side."</p> <p>A non-breaking space ( ) presents special problems--and requires special handling.</p> </doc>
The two vowels in "côté" have been converted to numeric character
references, both in the
word attribute and in the
content of the first
p element. The em dash was converted to two hyphens, and
the non-breaking space was converted to an entity reference for it.
Why would you need XSLT 2.0's character mapping feature? The most common problem it solves is the mangling of non-ASCII characters by processes that can't handle the encoding of a file that they're reading. The PowerPoint example above is one example. Confusion of some processes between the UTF-8 encoding, in which characters such as French vowels with accents are represented with two bytes, and Latin-1, which represents them with one byte, has often led to two strange bytes showing up in my browser window where I expected to see one accented vowel. (One more example as I finish up the last draft of this column: I pulled up the XHTML that I wrote into Microsoft Word to see if its spell checker would catch anything that I missed, and despite the XML declaration at the top of this file indicating that it's in UTF-8, Word thinks it's Latin 1, and shows the foreign characters and em dashes as two garbage bytes each.) Mapping these characters to their numeric character references, in which any Unicode code point can be represented using 7-bit ASCII (an ampersand followed by a pound sign, the code point number for the character, and a semicolon) can help to maintain the integrity of the representation all the way to the delivery application.
A straight mapping of a character to a 7-bit ASCII representation, as we saw with the em dash example, can also provide a compromise between a typographically slick representation of something and the possibility that garbage character(s) will appear in its place. Ultimately, XSLT 2.0 character mapping gives us more control over how our characters look and get represented, and it will be very handy.