Internationalizing the URI

May 7, 2003

In a recent column, "An XML Hero Reconsiders?", I suggested that enduring topics of XML developer conversation (which members of XML-DEV call "permathreads") endure because, in the final analysis, there isn't a final analysis.

That's all well and good; but, as some recent conversations in the XML developer community demonstrate, sometimes we have to get on with life, even in the face of uncertainty, ambiguity, and complexity. In this XML-Deviant column I examine two unrelated issues: the transition to IRIs and a slew of new W3C drafts.

URIs Aren't So Universal After All

As Paul Grosso said at the end of April, the progress of the XML 1.1 and Namespaces 1.1 recommendations may be slowed, if not stopped altogether, because of issues raised by the future of URIs. That is to say, because the future, in the form of IRIs, isn't here yet. The W3C's Technical Architecture Group has been unable to reach consensus on its Issue 27, which asks whether, when, and how to integrate IRIs into the core recommendations of the Web. One of the problems is that IRIs aren't finished yet, and it's notoriously tricky to rely on a formal concept or standard which, in some strict sense, doesn't yet exist. It's perfectly reasonable for the TAG and for other W3C Working Groups to point at the eventual IRI RFC and say, "do it like that". But until that RFC is finished, pointing blindly may cause more trouble than simply waiting till it is.

As you may or may not know by now, an IRI is an Internationalized (hereafter, I18N) Resource Identifier, which is like a URI, only different. According to RFC 2396 ("Uniform Resource Identifiers (URI): Generic Syntax"), a URI "is a compact string of characters for identifying an abstract or physical resource", and by "string of characters", it means "string of characters formed by a subset of US-ASCII"; or, as the latest IRI Internet Draft puts it, "a sequence of characters chosen from a limited subset of the repertoire of US-ASCII characters".

At one point in the history of the Web, "URI" stood for "Universal Resource Identifier" (it now stands for "Uniform"); but, as it turned out, URIs, no matter what the "U" stands for, aren't so universal after all. At least, URIs aren't universal if "universal" means that any person can create one using the words or at least some of the characters of her own natural language. And that limitation of the Web is a problem, particularly a social and political one. It may only be a coincidence that we in the West have shortened "World Wide Web" to just "the Web", but it does suggest that we have been laboring under a pretty characteristic blindness about people who don't care to communicate using US-ASCII. If the Web is going to be truly "world wide", everyone who has access to it (notwithstanding even more pressing issues, which we XML technologists cannot really address by ourselves) should have access in terms of their own preferred natural language.

An IRI, then, is meant as a "complement to the URI"; it is a "sequence of characters from the Universal Character Set". The IRI draft

defines a new protocol element, called IRI ... by extending the syntax of URIs to a much wider repertoire of characters. It also defines "internationalized" versions corresponding to other constructs from [RFC2396], such as URI references.

This draft suggests that there are three necessary conditions for the widespread substitution of IRIs or IRI references for URIs or URI references.

In contexts (that is, in any "protocol or format element") where one might wish to substitute an IRI (or reference) for a URI (or reference), the permissibility of that substitution "should be explicitly designated". In other words, the idea is to substitute IRIs only in contexts where the permissibility of that substitution has been made explicit.
Any context in which one might wish to use an IRI must have a "mechanism to represent the wide range of characters ... either natively or by some protocol- or format-specific escaping mechanism".
In some contexts, the "encoding of US-ASCII characters should be based on UTF-8".

Whatever the eventual outcome of the IRI design and standardization process, there's no perfectly obvious strategy for dealing with the time delay. One could argue that, given the relative lateness of the internalization of URIs, the W3C and the IETF should hurry or should take their time.

Rick Jelliffe suggests that the IETF in particular is playing catch up: "IETF has had a problem coming to grips with non-ASCII characters in protocols. Internationalized domain names are at least five years too late. Ultimately, IETF has to go UTF-8 throughout..."

Mike Champion voiced what may be a growing sentiment of frustration among XML developers with the W3C's seeming inability to solve these issues. Norm Walsh suggested that the problem is the expectation of users, who don't ordinarily distinguish between URIs qua addresses and URIs qua names. Users, Walsh claimed, "don't think that they're getting some resource with a funny name, they think they're going to the www.example.com address and getting a document".

The Future of XML Querying and Transforming

Various W3C Working Groups have been pressing hard on new drafts in the XSLT, XPath, XQuery neighborhood. Last Friday, in time for XML Europe, it released 10 new draft specifications, including:

Norm Walsh provided a helpful reading guide for busy programmers. He suggests reading these drafts in this order:

Also in XML-Deviant

The More Things Change

XQuery 1.0 and XPath 2.0 Data Model
XQuery 1.0 and XPath 2.0 Functions and Operators
XML Path Language (XPath) 2.0
Then read either one or both of
1. XSL Transformations (XSLT) 2.0
2. XQuery 1.0: An XML Query Language

Joe English, objecting to the enshrinement of typed data via the PSVI in these data models, said that he "won't be using these technologies at all". Several contributors urged folks to contribute comments about these drafts to the relevant Working Groups, especially for those drafts which are in the Last Call stage. In addition, the issue of specification conformance levels was raised repeatedly, suggesting that those who object to aspects of these speficiations in their final form may be able to opt in or out according to various implementations of different conformance levels.

XML developers like to complain about "permathreads", but they like to participate in them even more. And that will continue to be the case for as long as, each time these enduring issues are reraised, there's a new wrinkle or angle to consider.