XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Euro-XML

September 18, 2002

The new European currency, the euro, has a symbol € in Unicode 3.2 as character U+20AC. How can we use it with XML?

There are three ways of representing the euro in XML:

  • numeric character references,
  • character entity references, and
  • direct characters.

This article examines these and other more arcane but important ramifications.

Numeric Character References

You can enter the Euro character as data in element content or attribute values using number character references in any XML document: hexadecimal € or decimal €. This character is allowed both in XML 1.0 and the proposed XML 1.1.

Numeric character references will not be recognized in CDATA marked sections and cannot be used in XML names, such as element names, attribute names and IDs.

Character Entity References

A friendlier alternative is to use the standard entity €. This can be used in the same places that you can use numeric character references.

An entity must have a declaration. The most failsafe approach is to supply your own: make sure your document has a DOCTYPE declaration with the following declaration as part of its internal subset.

	<!ENTITY euro "€">

The internal subset is the thing between the brackets in many DOCTYPE declarations:

 <!DOCTYPE ...  [
 
 	<!-- internal subset -->
 	
 ]>

If you are using XHTML or HTML you are in luck: there is already a declaration provided for the euro as part of the HTML Special entity set. If you wish to use those entity declarations, include the following markup declarations:

     <!ENTITY % HTMLspecial PUBLIC
       "-//W3C//ENTITIES Special//EN//HTML">
     %HTMLspecial; 

Earlier this year, ISO JTC1 SC34 decided to add the Euro to the ISONum public entity set, with the same definition as HTML's. The updated version has not been released, and this will not be dependably available for some time.

If you are using the &euro; form in XML other than XHTML, you should provide your own definition. It is not an error to have an entity defined multiply times; the first found is used in preference to subsequent versions. Because there is no different opinion on which Unicode character should be used, there should be no harm in putting the entity declaration at the end of the internal subset.

Direct Characters

Third, if you are using UTF-8 or UTF-16, then you can enter the character directly. Your GUI may provide a mechanism, and editors aimed for publishing will also provide some mechanism.

In Adobe® FrameMaker®, for example, you hold Alt down and type 0128 on the keypad. In my Topologi™ Collaborative Markup Editor, you can enter the character by hex number or use the Keyboard>Currency menu.

The character type of modern programming languages such as Java, C#, Python and recently Perl is Unicode, typically in the UTF-16 encoding which uses fixed 16-bit code points to represent characters.

When a Unicode character greater than U+FFFF is needed, two UTF-16 code points are used, using a mechanism called surrogates, which complicate the simple expectation that one code should equal one character if you have 16-bit characters. For more information on characters and encodings, there are two excellent books: Ken Lunde's CJKV Information Processing (O'Reilly) which concentrates on East Asian encoding issues, and Tony Graham's Unicode: a Primer (IDG), which has useful information on Unicode in particular.

Goodbye ISO 8859-1?

However, the most common way that the Euro will be used will be as part of an XML document encoded using your system's local or regional character set. And this is where the Euro will complicate our lives in XML.

Web developers familiar with HTML will typically choose to use ISO 8859-1 (Latin 1) as the encoding for Western European documents: indeed, it is the default for HTML.

The problem? ISO 8859-1 does not have the euro character in it. Instead, the developers of the ISO 8859 series have issued ISO 8859-15 (Latin 9), which both adds the euro and replaces some unfortunate and rarely-used characters in 8859-1 with letters for better support of French and Finnish. In particular, the euro takes over the 0xA4 code point used as the generic CURRENCY SIGN in ISO 8859-1.

Even more confusingly, this set, which is officially Latin 9 is now being called, especially in Linux circles, Latin 0; probably a fitting brand name.

Character encodings are registered with IANA. Here is the registration:

Name: ISO-8859-15
Alias: iso-ir-203
Alias: iso-8859-15 (preferred MIME name)
Alias: latin9
Alias: latin0
Alias: csISOLatin15

So if you are using Latin 0 with XML, the preferred XML header is

<?xml version="1.0" encoding="iso-8859-15"?>

When sending XML documents as text over HTTP, use

Content-Type: text/xml;charset=iso-8859-15

The Case of the Missing On-Screen Character

This is a trap for new and old players: you look at the document on your screen and the character is not there. The obvious conclusion: the euro has been stripped out during some import or processing.

Not so fast. While it may be that the euro was deleted during import (many transcoding systems just discard characters they don't know what to do with) the more likely explanation is that the current font does not have the Euro character.

The place in your document where the euro character is expected may be empty, or perhaps have some other glyph (picture) showing (a square box, for example).

Since Win98 and MacOS 8.5, operating systems, transcoders, fonts and applications have had a time to become euro-friendly in preparation for 2002; check that your systems have been updated.

Until last month, Microsoft had euro-friendly versions of their Core Web Fonts available at http://www.microsoft.com/typography. This was a great way for older versions of Windows to keep abreast. Microsoft have removed this now, but the independent Corefonts project has been set up to redistribute the fonts under the license originally granted by Microsoft, offering support for Windows and Linux systems.

Linux systems are a little fiddly with respect to fonts. You may have to check that you have Latin 0 fonts. Try the utility xfontsel and look for iso8859-15 encodings. The long font names used by X Windows includes the font mapping in the last position of the name: it would be iso8859-15

Windows Code Pages

On the Windows side, the most common Windows code page for Western documents is CP1252, sometimes aliased as ANSI. (If you are at a party and wish to avoid standards-people, just loudly talk about ANSI code page and watch whose nostrils twitch.) CP 1252 is a superset of ISO 8859-1.

The euro character has been introduced as 0x80 in most Microsoft code pages for Europe:

  • CP1252 (Western Europe),
  • CP1250 (Eastern Europe),
  • CP1253 (Greek),
  • CP1254 (Turkish),
  • CP1255 (Baltic)

An exception is the Cyrillic code page CP 1521, it is code point 0x88.

In Unicode, the character at U+0080 is reserved for control characters, to be determined by the application, but suggested as ISO 6429 C1 set by default. (The C1 controls are the characters in an 8-bit set between 0x80 and 0x9F, reserved for control functions.)

In any case, neither Unicode nor ISO 6429 specify a character for the control code 0x80, so by any criteria if you find a character at 0x80 in your Unicode data there is something fishy going on. The most likely explanation is that someone has used the new CP 1252 but has mislabeled it as ISO 8859-1 or ISO 8859-15. Developers should note that they cannot rely on transcoding software to catch the error where there is a 0x80 in data labeled ISO 8859-n; even though modern APIs such as Java 1.4's transcoders will generate exceptions when a bad encoding is detected, 0x80 is a legitimate (though unused) code in ISO 8859-n encodings and so will probably not generate an error. I note that at least some versions of MSXML 4 do the right thing and complain, but in XML 1.0 the behavior has been underspecified and largely up to the skills and expectations of the programmers creating XML parsers.

How could XML 1.1 help?

There has been some discussion recently of how to treat those control characters in XML 1.1. In order to catch as many encoding mislabeling problems as possible, it would be best to ban the C1 characters outright; but that would leave a range of characters that can appear in a DOM but not be interchanged.

The best compromise between the two conflicting requirements of interchange and error-detection will be to say that control characters, except for the various whitespace characters like CR, NL, SPACE, TAB and NEL, must be serialized in XML 1.1 as numeric character references. (The NEL character U+0085 is used by some text applications on IBM OS/390 mainframes.) This works around the layering problem that control characters in the C0 range are used for flow control in serial communications and so should never appear in their direct form in an XML document text.

Until recently most people in Europe could get away with editing their documents with a CP 1252 editor: the extra characters are not that common.

But now XML developers in the West join their East Asian colleagues in being able to recognize encoding related problems.

Hints

If you are using CP1252, you may find some problems: older transcoders may not know the correct name, and transcoding software may not understand the correct alias. The official IANA name seems to be windows-1252 but the Java transcoder uses the name cp1252.

If you are using Windows, there is usually an an excellent Character Map utility available in the Accessories menu: look through each font to find one which has the euro glyph available, then use that font in the application in which you are trying viewing the XML document. Other operating systems have similar utilities available.



1 to 1 of 1
  1. Euro and other Latin-9 are getting garbled
    2002-11-15 13:06:42 Amod Lagu
1 to 1 of 1