Legal Characters

XML is thorougly internationalized, compared to most other document formats in common use, because it signs up to support each and every Unicode character. So the problem's solved, right? Well, not really, especially if you're a mathematician or textbook publisher. Because it turns out, there are lots of useful characters that are in occasional use (particularly in math textbooks, almost never in your weekly report to your boss) but just aren't there in Unicode.

There are a few solutions to this problem. One would be to use SGML, which has a trick called SDATA entities that can be used to talk about any old character you might want to dream up, whether or not they actually exist. Secondly, Unicode has a block of 6,400 characters called the "Private Use Area" (#e000 to #f8ff, decimal 57,344 to 63,743) for precisely this purpose; in your own application, you can use these characters to mean anything you want. Of course, if you want to interchange your XML documents with anyone else, you'd better have agreed in advance on what you're up to in this area.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.