Concluding, Unscientific Postscript: Web Resource Identification
January 14, 2004
In the preceding three columns I discussed the W3C Technical Architecture Group's Architecture of the World Wide Web (AWWW). In last week's column I examined in some detail the AWWW's discussion of the first of three key architectural principles, namely, resource identification. In this column I will conclude that discussion by considering the issues of URI ambiguity, opacity, and fragment identifiers.
I remain puzzled by some of the AWWW's claims about URIs. Section 2.3, URI Ambiguity, contains some of those puzzling claims. For example, the AWWW says "just as a shared vocabulary has tangible value, the ambiguous use of terms imposes a cost in communication." Well, that depends on who you ask, actually. First, and this kind of irony should surprise no one, "the ambiguous use of terms" is itself pretty ambiguous. Does it mean ambiguous term sense or ambiguous term meaning? There are many kinds of ambiguity, but let's consider briefly only two: homonymy and polysemy. Homonyms, as all school children know, are unrelated words which happen to share, as a matter of linguistic and historical contingency, the "same orthographic and phonological form" (I'm quoting here a cognitive science paper called "The Advantages and Disadvantages of Semantic Ambiguity"; email me if you're interested in the bibliographic details). Polysemous words are ones which have multiple, irreducible, and perhaps systematically related senses. Which sort of ambiguous terms does the AWWW have in mind here?
Perhaps you're thinking that it doesn't matter since ambiguous terms of every sort impose some "cost in communication". But that's not necessarily true, at least not for natural language terms and speakers. There are, in fact, some situations in which ambiguity (of some kind) is an advantage. "...there does seem to be a consensus in the literature," according to a British team of cognitive scientists, "that lexical ambiguity can produce faster lexical decision times". In other words, insofar as the AWWW is arguing by way of analogy (though it's not clear whether it is so arguing), the analogy is suspect. URI ambiguity may impose a cost, but that cost might be outweighed by the advantages otherwise gained -- representational flexibility. Or it might be that URI ambiguity is practically unavoidable, such that the cost imposed is simply one which has to be borne, no matter what.
Further, given what else the AWWW, as well as REST, says about dereferencing URIs, it's not clear that there can even be the kind of URI ambiguity which the AWWW warns against: "URI ambiguity refers to the use of the same URI to refer to more than one distinct resource." Is that really possible? Or, to put it a bit less pointedly, under what conditions is this possible?
I suggest that there can only be URI ambiguity over time, but never at the same time. If true, this means that URI ambiguity is significantly different than natural language term ambiguity, which can be and often are ambiguous in both ways. A URI (at least, an http one, anyway) cannot be ambiguous at any one time if it is canonically dereferenced. This means that at any time if a URI is canonically dereferenced, the result is a representation of one and only one resource. At any one time a URI identifies one and only one resource.
But if at any time the owner of some URI arranges for the representation of a different resource to be retrieved as a result of canonical dereferencing, then that URI is ambiguous over time. Whether it is always and everywhere a good practice to avoid URI ambiguity over time is a question I cannot answer, given that the costs imposed thereby would seem to be finitely bounded, and it just may be the case that there is some benefit which outweighs those costs.
I commend to -- but will not summarize for -- you the discussion in AWWW Section 2.4, URI Schemes. But I must say a few words about AWWW Sections 2.5 and 2.6, URI Opacity and Fragment Identifiers respectively. The primary point to be made regarding URI opacity is, in truth, a very old point: some kinds of thing are such that their unique identifier is not necessarily an indication of, nor a pointer to, the nature of that thing.
Thus one must not rely upon the fact, for example, that an http URI contains some
fragment which conventionally identifies a particular kind of representation format.
other words, the string ".html" at the end of an http URI does not mean that the format
the representation of the resource identified by it is necessarily HTML. It may be
it may not be. What matters, at least for http URIs, is the Internet Media Type specified
Content-type: header. There is simply no formal constraint whatever -- even
if it's a fairly common practice -- which requires any part of an http URI to identify
even suggest the Internet Media Type of the retrieved representation.
While I have no real disagreement with this part of the AWWW, I think it does not go as far as it might. The opacity of URIs isn't simply a matter of decoupling identifiers from aspects of the resources they identify. The opacity virtue runs, I think, much deeper and is as much an implication of the split between resources and retrieved representations of resource-states as it is of anything else. Not only is there no way to know, by inspecting an http URI, whether its representation is one Internet Media Type or another, there's also no way to know whether its representation is statically or dynamically generated, whether it relies on computational resources local or remote to the origin server, whether its representation is hosted on a particular operating system, HTTP server, or in an particular geographic locale.
Why is this opacity a practical, that is to say, an implementational virtue of the Web? Because it means that you are free to change all of the following details -- and legions more which I won't think of -- about a collection of web resources (that is, a web site) and break no URIs: the Internet Media Types of representations; a content-negotiation schema, including the relative weightings given to alternative representations; the implementational framework, toolkit, programming language, and strategy used to create the representations of resources; the allocation of computational resources both local and remote to the origin server, including disk layouts, CPU configurations, secondary and tertiary media storage schemes; the operating system, HTTP server, and networking stack of the origin server; the information system for which, in a particular case, the Web is serving as a proxy (so, for example, any variety of directory services).
The opacity of http URIs is a robust and, in my view, deeply significant key to the success of the Web as an information system that encompasses other information systems. I suspect that the AWWW's authors would agree with this, and I think the AWWW would be a stronger document with a more robust statement.
Fragment identifiers, which would have been better named "part identifiers", are curious beasts. As I wrote in a column ("Identity Crisis") about a very early draft of the AWWW, when it had a different name, "a URI identifies one and only one resource, unambiguously, and the (optional) fragment identifier part of an absolute URI reference identifies some part of the representation of that resource." This implies, of course, that "URIs have two namespaces: one which points to entities within the shared information space of the Web, another which points inside the representational space of the state of a web resource." This is the case because, while the semantics of a fragment identifier are context-relative, the relevant context is neither the resource nor the URI's schema, but, rather, the Internet Media Type of the representational format.
Also in XML-Deviant
Now this has all changed. Rather than talking about resources and parts of the representations thereof, the AWWW talks about primary and secondary resources. A URI without a fragment identifier points to a primary resource; a URI with a fragment identifier points to a secondary resource, even though what that fragment identifier means -- and, hence, what the secondary resource could possibly be or mean -- is still entirely dependent on the type of primary-resource representation retrieved by dereferencing the URI.
This seems to be a fundamental shift in how we understand and conceptualize the Web. In what sense is "secondary" being used here? Is it merely a second resource which is somehow related to the first? If so, why not three, four, or five nested resources instead of merely two? Or is "secondary" meant here in the sense of conceptual priority? Of course in some sense these questions don't matter since the Web keeps working the same after these curious changes as it did before them. But what are the implications for, say, the Semantic Web or for web services?
Unfortunately the AWWW locates the detailed discussion of fragment identifier semantics in part 3, Interaction, to which I shall finally turn my attention in next week's column.