Compatibility Characters

Unicode makes a heroic effort to include all the real, useful, characters from all the world's writing systems. But when it was invented, it had to deal with the fact that there was a lot of existing text in the world. As a result, it had to adopt some fairly unwanted children.

For example, in English, when a lower-case "i" follows a lower-case "f", it is common to print them in a combined form called a ligature. It's probably wrong to think of this thing as a "character", but nonetheless, there were a lot of typesetting systems around that had a code for it. So there is a Unicode "compatibility" character (#xfb01, decimal 64527, if you must know) for this and for quite a few similar misshapen beasts.

There are quite a few Japanese compatibility characters, in particular the dreaded "half-width katakana". These arose from a misguided attempt, in the early days of computing, to handle Japanese texts by using only one of the three Japanese alphabets and storing these things in one byte apiece. These "half-width katakana" were also printed out at the same width as a Latin character, which is, well, half as wide as is really comfortable. The problem is, although Japanese computing is now done properly with 16-bit characters, there were some old systems that depended on these half-width mistakes, and thus they made it into Unicode.

However, the Unicode people were smart enough to provide an index which makes it easy to spot all these compromises with history, and it would be a really good idea if you were never under any circumstances to use them in a modern XML document. But if you must, this discouraging note won't actually stop you.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.