
Entity and Character References
XSLT stylesheet developers often ask how they can leave
entity references in the source document unchanged as the stylesheet
passes them to the result document. For example, they want
an entity reference in the source document to
still be in the result document. The usual answer
is that you shouldn't need to do this, because the substitution of the
entity values for the entity references shouldn't make any
difference.
It shouldn't, if everyone always played by the rules,
but not everyone does. I have my own schema for presentation slides,
and I once wrote a stylesheet to convert the XML versions of my slides
into HTML that Microsoft PowerPoint would recognize and import so that
I could then save the slides as a binary PPT file. PowerPoint's import
didn't treat code point 160 the same way it treated the entity
reference , so I absolutely had to have the entity
reference in the HTML created by my stylesheet.
In the March 2001 Transforming XML column, I explained why source document entity references can't be preserved in the transformation: the XML parser that an XSLT processor depends on to read in the source document (and the stylesheet itself) converts any entity references to their entity values as it reads them in, before it puts the source document in the source tree where the XSLT processor can get at it. Replacing entity references with entity values is part of the XML parser's job.
That column went on to demonstrate the use of
the disable-output-escaping attribute to add an entity
reference to the output when you absolutely must. The use of this
attribute, like the use of CDATA sections, is usually a bit
kludgy. Also, the example I gave in that column showed how to add an
entity reference to the result tree, but it didn't show how to convert
something from the source document into an entity reference.
Many new features in XSLT 2.0 respond to wishes expressed by XSLT developers since 1.0 was released. The wish to leave entity references alone can't be granted completely, because redefining an XML parser's responsibilities is outside of the scope of the W3C XSL Working Group's responsibilities. They have added a great new feature, though, called character maps, that makes conversion of specific source document characters to entity references (or to any strings you like) very simple.
A character map lets you tell the XSLT processor "when this character is on its way to the result tree, put this string there instead." If I'd had this when I was trying to create PPT files from the XML of my slide presentation, I could have used this to map code point 160 to the string " " in the HTML that I was preparing for import into PowerPoint. The following XSLT 2.0 stylesheet does this and maps three other characters to strings.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output use-character-maps="cm1"/>
<xsl:character-map name="cm1">
<xsl:output-character character=" " string="&nbsp;"/>
<xsl:output-character character="é" string="&233;"/> <!-- é -->
<xsl:output-character character="ô" string="&#244;"/>
<xsl:output-character character="—" string="--"/>
</xsl:character-map>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The stylesheet's single template rule does a verbatim
copy of the source tree to the result tree. The new XSLT 2.0 parts are
the xsl:character-map element, which defines the mappings to
execute, and the use-character-maps attribute of
the xsl:output element.
The xsl:output element is an old friend from
XSLT 1.0 that's let us perform tasks such as the setting of an output
file's encoding, the inclusion or omission of the XML declaration from
the result document, and the setting of a DOCTYPE declaration for the
result document. The new use-character-maps attribute lets
you name one or more characters maps, with their names separated by
spaces, to use in converting characters bound for the result tree into
alternative strings.
The ability to use more than one character map lets you
group mappings into modules and mix and match them for different
output media. For example, imagine a stylesheet that was nothing
but xsl:character-map elements, each one being a set of
character mappings for a particular purpose. Other stylesheets could
use xsl:include to reference that file and then name which
sets of mappings they wanted to use in the xsl:output
element's use-character-maps attribute. For example:
<xsl:output use-character-maps="dashes2ASCII GermanVowels2ASCII"/>
The xsl:character-map element has
the name attribute to store the name used to reference it
from the xsl:output element's use-character-maps
attribute and a collection of xsl:output-character child
elements. Each of these children has a character attribute to
identify the character to map and a string attribute storing
the string to map it to. Because an XML parser that sees the string
" " in any attribute value interprets it as a non-breaking
space character and not as the entity reference string for this
character, this must be written as "&nbsp;" in
the xsl:output-character element's string attribute
value to show that we want the entity reference added to the result
tree.
The first two xsl:output-character elements
specify their character values with a numeric
character reference. Remember, just because something begins with
an ampersand and ends with a semicolon, that doesn't make it an entity
reference. An entity is a name that is declared and referenced,
whether it's predeclared like the lt or amp entities
or declared in a DTD as one might do with nbsp
or ntilde. A numeric character reference is a way to indicate
a character by using its code point number; no declaration is
necessary, so it's not a named unit of storage, and therefore not an
entity.
For the third xsl:output-character element,
instead of a numeric character reference, I just entered the character
"é" in there to show which character I wanted mapped. The
fourth xsl:output-character element maps the character for
the punctuation mark known as the "em dash"—a punctuation mark that I
probably use too much—to a pair of hyphens, which is the traditional
way to represent an em dash when all you have are 7-bit ASCII
characters.
Of the four strings that this character map can add to
the result tree, only is an entity reference. If the
document created from the result tree is XML, a parser that reads that
document will choke on " " if it never saw a declaration for
it, so the stylesheet should include an xsl:output top-level
element with a doctype-system attribute so that the result
document points to a DTD with the appropriate declaration. If the
result is HTML, though, this isn't necessary, because all web browsers
understand the entity reference.
The second xsl:output-character element almost
looks like it's mapping "&233;" to itself. It isn't, though; the
XML parser that hands this stylesheet to the XSLT processor will turn
the character attribute value into the character itself as it
puts in on the source tree, and then the XSLT processor, as it reads
the source tree and creates data for the result tree, will convert
that character to the string specified in that
xsl:output-character element's string
attribute.
As a test, I ran the stylesheet on this source document:
<doc word="côté">
<p>côté is the French word for "side."</p>
<p>A non-breaking space ( ) presents special problems—and
requires special handling.</p>
</doc>
I often use the word "côté" as an example for English-speaking people who don't take accented characters seriously enough, because the presence or absence of each of these accents can turn it into three different words. Using Saxon 7, the stylesheet turns the document into this:
<?xml version="1.0" encoding="UTF-8"?><doc word="côté">
<p>côté is the French word for "side."</p>
<p>A non-breaking space ( ) presents special problems--and
requires special handling.</p>
</doc>
The two vowels in "côté" have been converted to numeric
character references, both in the doc element's word
attribute and in the content of the first p element. The em
dash was converted to two hyphens, and the non-breaking space was
converted to an entity reference for it.
Why would you need XSLT 2.0's character mapping feature? The most common problem it solves is the mangling of non-ASCII characters by processes that can't handle the encoding of a file that they're reading. The PowerPoint example above is one example. Confusion of some processes between the UTF-8 encoding, in which characters such as French vowels with accents are represented with two bytes, and Latin-1, which represents them with one byte, has often led to two strange bytes showing up in my browser window where I expected to see one accented vowel. (One more example as I finish up the last draft of this column: I pulled up the XHTML that I wrote into Microsoft Word to see if its spell checker would catch anything that I missed, and despite the XML declaration at the top of this file indicating that it's in UTF-8, Word thinks it's Latin 1, and shows the foreign characters and em dashes as two garbage bytes each.) Mapping these characters to their numeric character references, in which any Unicode code point can be represented using 7-bit ASCII (an ampersand followed by a pound sign, the code point number for the character, and a semicolon) can help to maintain the integrity of the representation all the way to the delivery application.
A straight mapping of a character to a 7-bit ASCII representation, as we saw with the em dash example, can also provide a compromise between a typographically slick representation of something and the possibility that garbage character(s) will appear in its place. Ultimately, XSLT 2.0 character mapping gives us more control over how our characters look and get represented, and it will be very handy.
- Preserving Entities in XSLT 1.0
2005-10-04 03:11:41 JackParker - Preserving Entities in XSLT 1.0
2005-10-04 05:05:19 Bob DuCharme - Japanese special characters
2004-08-11 07:29:29 eswaraprasadh - xslt2 character map files
2004-06-04 08:37:35 David Carlisle