String Matching

Most people think string matching should be easy, but that doesn't include people who work with Unicode (of which much more elsewhere). The problem is that in Unicode, a character such as "é" has two representations; one as the single character whose number is hexadecimal #xe9 (decimal 233), another as the ordinary character "e" (#x65, decimal 101) followed by the accent character (#x301, decimal 769). On the screen or on paper, though, there's no way to tell these apart. The Unicode Standard has all sorts of good advice as to how to deal with these situations.

To keep things simple, XML doesn't require a processor to try any of these combining-character tricks; that is to say, it is free to regard to the single character #e9 as different from the two-character sequence #65,#301. It is allowed to try, which might be a desirable feature in a commercial product; but if this behavior is causing problems for users, they can (note the "at user option phrase") turn it off.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.