The new European currency, the euro, has a symbol € in Unicode 3.2 as character U+20AC. How can we use it with XML?
There are three ways of representing the euro in XML:
- numeric character references,
- character entity references, and
- direct characters.
This article examines these and other more arcane but important ramifications.
Numeric Character References
You can enter the Euro character as data in element content or
attribute values using number character references in any XML document:
€ or decimal
This character is allowed both in XML 1.0 and the proposed XML 1.1.
Numeric character references will not be recognized in CDATA marked sections and cannot be used in XML names, such as element names, attribute names and IDs.
Character Entity References
A friendlier alternative is to use the standard entity
€. This can be used in the same places that
you can use numeric character references.
An entity must have a declaration. The most failsafe approach
is to supply your own: make sure your document has a
DOCTYPE declaration with the following declaration
as part of its internal subset.
<!ENTITY euro "€">
The internal subset is the thing between the brackets
<!DOCTYPE ... [ <!-- internal subset --> ]>
If you are using XHTML or HTML you are in luck: there is already a declaration provided for the euro as part of the HTML Special entity set. If you wish to use those entity declarations, include the following markup declarations:
<!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special//EN//HTML"> %HTMLspecial;
Earlier this year, ISO JTC1 SC34 decided to add the Euro to the ISONum public entity set, with the same definition as HTML's. The updated version has not been released, and this will not be dependably available for some time.
If you are using the
€ form in XML other than
XHTML, you should provide your own definition. It is not an error to have
an entity defined multiply times; the first found is used in preference to
subsequent versions. Because there is no different opinion on which
Unicode character should be used, there should be no harm in putting the
entity declaration at the end of the internal subset.
Third, if you are using UTF-8 or UTF-16, then you can enter the character directly. Your GUI may provide a mechanism, and editors aimed for publishing will also provide some mechanism.
In Adobe® FrameMaker®, for example, you hold
down and type
0128 on the keypad. In my Topologi™
Collaborative Markup Editor, you can enter the character by hex number or
use the Keyboard>Currency menu.
The character type of modern programming languages such as Java, C#, Python and recently Perl is Unicode, typically in the UTF-16 encoding which uses fixed 16-bit code points to represent characters.
When a Unicode character greater than U+FFFF is needed, two UTF-16 code points are used, using a mechanism called surrogates, which complicate the simple expectation that one code should equal one character if you have 16-bit characters. For more information on characters and encodings, there are two excellent books: Ken Lunde's CJKV Information Processing (O'Reilly) which concentrates on East Asian encoding issues, and Tony Graham's Unicode: a Primer (IDG), which has useful information on Unicode in particular.
Goodbye ISO 8859-1?
However, the most common way that the Euro will be used will be as part of an XML document encoded using your system's local or regional character set. And this is where the Euro will complicate our lives in XML.
Web developers familiar with HTML will typically choose to use ISO 8859-1 (Latin 1) as the encoding for Western European documents: indeed, it is the default for HTML.
The problem? ISO 8859-1 does not have the euro character in it. Instead, the developers of the ISO 8859 series have issued ISO 8859-15 (Latin 9), which both adds the euro and replaces some unfortunate and rarely-used characters in 8859-1 with letters for better support of French and Finnish. In particular, the euro takes over the 0xA4 code point used as the generic CURRENCY SIGN in ISO 8859-1.
Even more confusingly, this set, which is officially Latin 9 is now being called, especially in Linux circles, Latin 0; probably a fitting brand name.
Character encodings are registered with IANA. Here is the registration:
Name: ISO-8859-15 Alias: iso-ir-203 Alias: iso-8859-15 (preferred MIME name) Alias: latin9 Alias: latin0 Alias: csISOLatin15
So if you are using Latin 0 with XML, the preferred XML header is
<?xml version="1.0" encoding="iso-8859-15"?>
When sending XML documents as text over HTTP, use
The Case of the Missing On-Screen Character
This is a trap for new and old players: you look at the document on your screen and the character is not there. The obvious conclusion: the euro has been stripped out during some import or processing.
Not so fast. While it may be that the euro was deleted during import (many transcoding systems just discard characters they don't know what to do with) the more likely explanation is that the current font does not have the Euro character.
The place in your document where the euro character is expected may be empty, or perhaps have some other glyph (picture) showing (a square box, for example).
Since Win98 and MacOS 8.5, operating systems, transcoders, fonts and applications have had a time to become euro-friendly in preparation for 2002; check that your systems have been updated.
Until last month, Microsoft had euro-friendly versions of their Core Web Fonts available at http://www.microsoft.com/typography. This was a great way for older versions of Windows to keep abreast. Microsoft have removed this now, but the independent Corefonts project has been set up to redistribute the fonts under the license originally granted by Microsoft, offering support for Windows and Linux systems.
Linux systems are a little fiddly with respect to fonts. You may have
to check that you have Latin 0 fonts. Try the utility
Windows Code Pages
On the Windows side, the most common Windows code page for
Western documents is
CP1252, sometimes aliased as
ANSI. (If you are at a party and wish to avoid
standards-people, just loudly talk about ANSI code page and watch
whose nostrils twitch.)
CP 1252 is a superset of ISO 8859-1.
The euro character has been introduced as 0x80 in most Microsoft code pages for Europe:
- CP1252 (Western Europe),
- CP1250 (Eastern Europe),
- CP1253 (Greek),
- CP1254 (Turkish),
- CP1255 (Baltic)
An exception is the Cyrillic code page CP 1521, it is code point 0x88.
In Unicode, the character at U+0080 is reserved for control characters,
to be determined by the application, but suggested as ISO 6429 C1 set by
default. (The C1 controls are the characters in an 8-bit set between
0x9F, reserved for control functions.)
In any case, neither Unicode nor ISO 6429 specify a character for the
0x80, so by any criteria if you find a character
0x80 in your Unicode data there is something fishy going
on. The most likely explanation is that someone has used the new
1252 but has mislabeled it as ISO 8859-1 or ISO 8859-15.
Developers should note that they cannot rely on transcoding software to
catch the error where there is a 0x80 in data labeled ISO 8859-n;
even though modern APIs such as Java 1.4's transcoders will generate
exceptions when a bad encoding is detected, 0x80 is a legitimate (though
unused) code in ISO 8859-n encodings and so will probably not
generate an error. I note that at least some versions of MSXML 4 do the
right thing and complain, but in XML 1.0 the behavior has been
underspecified and largely up to the skills and expectations of the
programmers creating XML parsers.
How could XML 1.1 help?
There has been some discussion recently of how to treat those control characters in XML 1.1. In order to catch as many encoding mislabeling problems as possible, it would be best to ban the C1 characters outright; but that would leave a range of characters that can appear in a DOM but not be interchanged.
The best compromise between the two conflicting requirements of
interchange and error-detection will be to say that control characters,
except for the various whitespace characters like CR, NL, SPACE, TAB and
NEL, must be serialized in XML 1.1 as numeric character references. (The
Until recently most people in Europe could get away with editing their
documents with a
CP 1252 editor: the extra characters are not
But now XML developers in the West join their East Asian colleagues in being able to recognize encoding related problems.
If you are using
CP1252, you may find some problems:
older transcoders may not know the correct name, and transcoding software
may not understand the correct alias. The official IANA name seems to be
windows-1252 but the Java transcoder uses the name
If you are using Windows, there is usually an an excellent Character Map utility available in the Accessories menu: look through each font to find one which has the euro glyph available, then use that font in the application in which you are trying viewing the XML document. Other operating systems have similar utilities available.
- Euro and other Latin-9 are getting garbled
2002-11-15 13:06:42 Amod Lagu