Models with Character

March 9, 2005

"And this our life, exempt from public haunt, / Finds tongues in trees, books in the running brooks" -- Shakespeare

The xml-dev list is famously filled with disagreements; often long-running, occasionally contentious permathreads. A casual glance might reveal someone cursing out namespaces, debating strong versus loose coupling, or passionately arguing for changes to the core of XML. And that's just in the last three weeks. Yet, one topic is sacrosanct: that one of the smartest and best design decisions underlying XML was to define it on the foundation of characters, specifically the Universal Character Set and Unicode.

As such, a working knowledge of Unicode is not optional. Practitioners of XML need to be, at a minimum, conversant in the basics of Unicode as described in the first few sections of Mike J. Brown's excellent write-up. The amount and quality of such resources on the Web is impressive. As the years have passed, XML developers have become much better at understanding Unicode and related issues, though a few hold-outs remain. We all benefit from occasional reminders, so in that spirit, this week I'll review the W3C approach to internationalization, localization, and Unicode.

Meeting the Challenge

Many web reference works try to gloss over complicated details of international issues, under the theory that getting developers at least pointed in the right direction is better than nothing. True enough, though it can lead to problems down the road. Simply put, the world is an amazing place, filled with a panorama of scripts, alphabets, writing styles, conventions, rules, and exceptions. It's worth getting to know. What we need is a guide through these issues, assembled by a broad group of individuals who have researched all the angles so that we don't have to. What we need is a new W3C Recommendation: Character Model for the World Wide Web 1.0: Fundamentals, or charmod for short.

Charmod is targeted mainly at other W3C specification authors and editors. It contains conformance criteria which can be directly referenced from other specifications. Content developers and application developers who work with XML shouldn't feel left out. Throughout charmod, numbered conformance criteria are marked with letter codes to indicate applicability: C for content authors; I for software implementers, and S for specification developers. This can be easy to miss if you jump straight into the middle of the document instead of reading sequentially.

Someone used to only one language, or perhaps a cluster of related languages, tends to hold an intuitive grasp of what a "character" is. Section 3, "Perceptions of Characters", sets the tone for the document with a mind-expanding vignette of possibilities. For instance, Korean Hangul groups an entire syllable worth of sounds into a rectangular block. Depending on the context, either the whole syllable or its components can be considered a character.

There's more to consider. Characters can have a one-to-many or many-to-one relationship with aural presentation (phonemes), visual presentation (glyphs), units of storage (bytes or octets), and keyboard input. Some languages are written right-to-left, though numbers and other interleaved languages may still be left-to-right, creating interesting situations for selecting text. In summary, the word "character" gets thrown about, often carelessly. Next time that temptation arises, consider whether a more specific term would work better, and if not, at least spell out exactly what you mean.

Section 4, the longest in the document, gets into the digital aspects of characters. For Unicode, discussions on xml-dev usually boil down to a choice of either UTF-8 or UTF-16, though there are others. This section mentions issues of encoding choices, with special emphasis on clearly identifying which variant is in use. The meatiest part is the "Reference Processing Model", containing broad, result-oriented instruction on how applications, specifications, and content should deal with Unicode. These rules are spelled out in considerable detail, but for anyone working in XML they should sound familiar because "All specifications which define applications of the XML 1.0 specification [XML 1.0] automatically inherit this Reference Processing Model."

Section 6 talks about "strings", another term that gets used a bit too freely, at different levels of abstraction. One nice aspect of this section is that it lists good examples of specifications defining properly high-level character strings. Another issue that requires some care is pointing within strings, which charmod refers to as indexing. Suffice it to say, there are many strategies to do this. Some, like byte-offsets, are strongly discouraged. On the other hand, non-numeric techniques, like substrings or using markup, are encouraged.

Finally, section 7 talks about how to formally reference the Unicode spec, including whether to reference a specific version, or the generic 'latest version', or the preferred approach of doing both.

More Resources

One recently released document is an updated Working Draft with the imposing title Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0, aimed squarely at content authors. This document is great because it not only lists specific techniques, but also names names: each of Windows IE 6, Firefox 1.0, Mozilla 1.4, Opera 7.0, Navigator 7.0, Safari 1.03, and Mac IE 5.2 are graded based on how well they support each technique.

The W3C has a group focused on internationalization issues, both within the W3C and as an outreach program. This group has broad influence, providing review into essentially every W3C technical report, challenging work that is seldom visible to the outside world. Of course, they also produce their own documents, as we have seen. A list of other useful resources can be found at the international area of the W3C site.

When Homographs Attack

A related topic appeared in the news lately: International Domain Name (IDN) spoofing. The Domain Name System (DNS) that translates every alphabetical domain name into a numerical IP address can't directly cope with anything but a subset of US-ASCII. However, a clever encoding technique called punycode allows Unicode to appear in domain names, a topic which intersects with the XML world through the specification and common practice around XML namespaces.

It's already possible to create visually confusing domain names, for instance trading a lower-case l for a digit 1. Evil people phishing for passwords and bank account information love this. With Unicode in the picture, there are more homographs, or glyphs with nearly identical appearance depending on exact details like choice of font, and thus more opportunities to create misleading names. This is no new issue; the 2003 RFC has a "Security Considerations" section that warned against just such an attack.

If someone registers a domain name that looks exactly like paypal.com [warning: homograph spoofed link], it's hard to view that as a browser vulnerability. (In a cruel twist of irony, browsers that are lagging in aspects of standards support, including IE, don't recognize IDNs and are currently immune to this sort of spoofing.) Domain name registrars have an inherent conflict of interest because they try to sell as many domains as possible with little motivation to check for or prevent homographic abuses. The browser (or email client) may be the last line of defense for users.

As an immediate workaround, the Mozilla family of browsers has disabled the display of Unicode in IDNs. In the troublesome Paypal spoof above, you'll see raw punycode, like xn--pypal-4ve.com. In the future, a more nuanced UI will allow fully international domain names to be displayed, without risk of homographic confusion; perhaps a tidied-up version of what Paul Hoffman outlined.

If ten birds are sitting on a power line and four decide to fly away, how many are left? Still ten; they only decided, but didn't do anything about it. What's the moral of this story? International issues affect us all, whether or not we're aware of it. With so many resources available to help with education about Unicode and related topics, it's easy to decide to read up on the subject. Here's a toast to all readers who go beyond just deciding, and take action.