Encodings

Unicode, as we discussed above, identifies characters using 16-bit numbers. But that's not everything you need to know, because there are a variety of different ways of storing those 16-bit numbers in computer files, and XML can hardly get away with telling everyone they have to use the same one.

Unicode Encodings

Unicode itself defines a variety of ways of doing this; UTF-16 is the simplest; it just stores each 16-bit character in 16 bits in the obvious way, and larger characters using the Surrogate block trick described above.

UTF-8 is a trick, originally invented by the Unix gang at Bell Labs, which stores old-fashioned 7-bit ASCII characters as themselves, and anything else as anywhere from 2 to 5 bytes, each with a value greater than 127. UTF-8 saves disk space if your text is mostly ASCII, but wastes it for other texts.

UTF-8 also has the important virtue that pre-Unicode computer programs that process text a byte at a time can usually, if they're not doing anything too fancy, get away with processing UTF-8. On the other hand, if you're programming in Java, where all characters are 16 bits, you'd rather stay away from UTF-8.

Non-Unicode Encodings

Very little of the world's text is stored in Unicode encodings. Most of it's in ASCII or ISO-Latin (European) or JIS (Japanese) or one of many Chinese encoding schemes. However, since Unicode was built by combining the contents of all those schemes, all the characters in those files are in fact Unicode characters; so each of these things can be regarded just as encodings of 16-bit Unicode characters. And in fact, converting from any of these to real Unicode values is actually fairly straightforward. XML allows you to use any of these encodings, and has some tricks for communicating which one is in use.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.