Unicode Surrogates

Unicode characters are made up of 16 bits, which allows for 65,536 different characters. At this point, not all of them have been used, but they will be one day. There are a very large number of Japanese and Chinese characters that have not yet been included. None of these are in common use, but they include a lot of characters of serious interest to scholars and historians, so the problem won't go away.

Also, Japanese and Chinese people can (and do) invent new characters all the time.

But there are only so many characters you can store in 16 bits, so how to make the problem go away? Previous international-text standards, in particular the much-hated ISO 2022, used a trick where you would switch modes, i.e. have a magic sequence that said "shift into Korean" or "shift into Arabic"; while this was straightforward enough to render on a screen, it was pure hell for programmers. I'll skip the details to avoid boring the non-programmers, but for the geeks in the crowd, two words: pointer arithmetic.

Unicode has a clever solution to this, called the "surrogate blocks". They call the 65,536 Unicode characters the "Basic Multilingual Plane" (BMP for short), and it contains almost all of the characters most people will ever need. In the BMP, two chunks of 1,024 characters have been reserved and will never be used to represent ordinary characters. These are called the Surrogate Blocks; the first extends from #d800 to #dbff (decimal 55,296 to 56,319) inclusive, the second from #dc00 to #dfff (decimal 56,320 to 57,343). The trick is that for any new characters that get assigned outside of the BMP, they have to be made up of a 16-bit character from the first block, then another from the second. This way you get about a million (1,048,576) new characters; although to keep things simple, the Unicode people prefer to think of them as 16 new planes, each of 65,536 characters.

The beauty of the surrogate technique is that a program that doesn't understand it would just render such a character as two blobs on the screen. A program that understands the basic idea, but doesn't know that exact character, would render it as one blob on the scren. And a program that knows the character would be just fine.

And programmers love it, because by looking at one 16-bit quantity they can always tell whether they're looking at an ordinary character, or at the top or bottom half of a non-BMP character.

The practical effect is that if you have a character from one of the surrogate that's not part of a 2-character low-surrogate high-surrogate combo, then you have a fatal error.

By the way, the surrogates are going to be used not only for lots of Chinese-style characters, but for the characters of Egyptian hieroglyphics, Mayan, other dead languages, and Klingon.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.