Reviewing the Architecture of the World Wide Web
January 19, 2005
The most significant networked application development yet is the World Wide Web, which has made the the personal computer a must-have item, and a web address as crucial as a phone number for a successful business. This is only the beginning; from web services to the Semantic Web, the web is changing fast. Yet no matter how fast things change, some things remain the same; this holds true for the principles of web architecture. Publication of the "Architecture of the World Wide Web" by the W3C hopes to codify these principles. These issues were covered in more depth on XML.com by Kendall Clark's seven-part series, but a brief overview is given here.
Often in the past, it has been said that some new proposal for a specification violated "principles of Web architecture," yet with no normative reference document there was significant disagreement about exactly what composed those principles. These disagreements, such as the treatment of relative URIs in XML namespaces, have led to annoyance and even, at times, broken specifications. To solve these issues, W3C embarked upon the ambitious task of creating a Technical Architecture Group (TAG) to "document and build consensus" upon "the underlying principles that should be adhered to by all Web components, whether developed inside or outside W3C," as stated in its charter. Just as the W3C finished celebrating its tenth anniversary, the first recommendation of the TAG, the Architecture of the World Wide Web (AWWW) passed through Last Call. It succeeds in summarizing web architecture in about fifty pages, and it's a blend of common-sense and surprising conclusions.
And the Rest
Before the AWWW, the primary available analysis of the web has been Fielding's REST (Representation State Transfer) architectural style, which relies upon keeping communication between client and server stateless and assigning a distinct URI to each server state. REST has proved popular and powerful, but some of the web violates or disagrees with REST, as the continual REST versus SOAP debates shows. Web architecture effects everyone, including ordinary users. Cookies violate REST by associating state between a server and a client, not with the current representation returned by the URI the client is visiting. As everyone knows, chaos ensues when the "Back button" is used in cookie-dependent sites. Are cookies a bug or a feature of the web and how could a neutral party tell without a normative document? REST highly influences AWWW, no doubt because Fielding and others are on the TAG.
The AWWW attempts to unify diverse web technologies with a set of core design components, constraints, and good practices. In a one-sentence summary, the web is composed of a set of resources that are identified by URIs, which agents can interact with using standardized protocols, usually retrieving representations of the resource via standardized formats. To more concrete terms, if you want to learn about the resource known as the Eiffel Tower in Paris, you can access its representation using its URI "http://www.tour-eiffel.fr/" and retrieve a web page in the HTML format using the HTTP protocol. In the rest of this article, I'll walk through the AWWW, identifying key quotes and conclusions.
A URI is a bit of syntax for identifying resources. URIs serve as a single global identification system, so that "any party can share information with any other party." This leads to the first principle of the web: "Global naming leads to global network effects." If people can mention and use your data by writing the URI by linking to it and sharing it, the value of the web itself increases.
What exactly is a resource? Tim Berners-Lee once stated that the great thing about resources is that he went on for years without having to define it. A resource is not just, as Pat Hayes put it once, an entity which emits representations like web pages. A resource is literally anything: "We do not limit the scope of what might be a resource...it is used in a general sense for whatever might be identified by a URI." To continue, "our use of the term resource is intentionally more broad. Other things, such as cars and dogs ... are resources too." This metaphysical trickery must be clearly articulated for the Semantic Web to work. The Semantic Web needs not to just make statements about web pages, but all sorts of things in the world that aren't normally thought of as being plugged into the web. So not only is the "Official Web Page of the Eiffel Tower" a resource, but so is the Eiffel Tower itself. While a resource can have multiple URIs, a URI identifies only one resource.
There is a difference between resources that are primarily used to denote collections of web pages and those like actual salivating dogs and whole monuments in France. An information resource is a "resource which has the property that all of its essential characteristics can be conveyed in a message." These resources "can be encoded, with varying degrees of fidelity, into a sequence of bits." The TAG appears to be using Shannon's theory of information, although this is not stated or explored. Also, "essential characteristics" is not really an objective standard. However, the common-sense distinction between two types of resources does clarify differences between the hypertext web and the Semantic Web. Still, a resource is a pretty slippery concept.
A representation is defined as "data that encodes information about resource state." One resource can have multiple representations, especially due to content negotiation (which is used by less than one percent of the web). Content negotiation could serve up differing representations for differing languages and browsing devices, particularly useful for the always-frustrating cellphone web surfing. Yet, the following sentence confuses the situation: "Representations do not necessarily describe the resource, or portray a likeness of the resource, or represent the resource in other senses of the word `represent'." One good practice TAG should add in would state that the owner of a resource should serve representations that within reason make it clear what the resource is. After all, what if the resource is "the moon" and I return a picture of green cheese?
One long-standing issue which has been surprisingly given a resolution is that URIs without representations are errors, since "a URI owner SHOULD provide representations of the resource it identifies." Importantly for XML developers, this means that documents such as XML namespaces that use URIs should provide a namespace document for their namespace, such as a RDDL document, providing information about the namespace. Likewise, every representation should be properly typed, since "new protocols created for the Web SHOULD transmit representations as octet streams typed by Internet media types." Other principles such as URI opacity are upheld as well.
One question, then, is who determines what resource a URI identifies, and who is responsible for giving these representations? The TAG comes out on the side of the URI owner. The TAG endorses as good practice "cool URIs don't change" in terms of "URI persistence," which is "the desirable property that, once associated with a resource, a URI should continue indefinitely to refer to that resource." While this may be a somewhat idealistic picture given the fact that owning a URI is much more like renting, most people would agree that ever-changing URIs are irritating and that "A URI owner SHOULD provide representations of the identified resource consistently and predictably." Of course, this is crucial for the Semantic Web to make any sense at all, since if URIs changed their resources, then the RDF triples referring to the old URIs would simply not make any sense.
Remember that "URIs identify a single resource." Yet if a URI refers to two or more resources, it's a URI collision. Note this is different than two URIs dereferencing the same representation; if one URI denotes "the weather today" and another URI denotes "the weather on Christmas," and "today is Christmas," one would hope we'd get the same representation from URIs, at least if the same weather company is dealing with it. URI collisions lead to an identity crisis, especially with the Semantic Web, when one uses a URI, such as my homepage, to represent both myself the person and the web page I created. I was created (born) on a different date than my web page. The AWWW rules that different resources should have distinct URIs. I should have one URI for myself and another for my web page, even if they return the same representation.
This situation becomes a bit murkier when the AWWW turns around and states that "Indirect identification" is just fine. They state that "to say that the URI 'mailto:firstname.lastname@example.org' identifies both an Internet mailbox and Nadia, the person, introduces a URI collision." Then they go on to state, "suppose that email@example.com is Nadia's email address. The organizers of a conference Nadia attends might use 'mailto:firstname.lastname@example.org' to refer indirectly to her....this does not introduce a URI collision." This seems contradictory, and somehow "local policy" allows this to be done. The inverse problem, URI aliases, is frowned upon, since "a URI owner SHOULD NOT associate arbitrarily different URIs with the same resource."
A fragment identifier is normally used to identify a portion of a web page. It has been overloaded in its use on the Semantic Web in order to solve the URI collision and resource identity problem; www.ibiblio.org/hhalpin# is me with all my essential characteristics included, while www.ibiblio.org/hhalpin is just my web page. The AWWW states correctly that the fragment identifier's semantics should be dealt with by the client, not the server: "The fragment's format and resolution are therefore dependent on the type of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced." Yet it lets the Semantic Web overload through the back door, since "if no such representation exists, then the semantics of the fragment are considered unknown and, effectively, unconstrained." This story is mapped into the world of resources by defining "primary" and "secondary" resources, although the story gets messy quickly.
Exiting the tricky world of resource and representations, a number of pragmatic decisions are made that clear up long-standing issues in the XML and wider web community. One original motivating principle behind XML is upheld: "a specification SHOULD allow authors to separate content from both presentation and interaction concerns." Second, this principle of allowing components to be separable is generalized to the entire web, since "identification, interaction, and representation are orthogonal concepts, meaning that technologies used for identification, interaction, and representation may evolve independently."
The keys to evolution are orthogonality and extensibility. The AWWW mandates that if at all possible specifications should themselves be orthogonal to allow easy decoupling from each other. The AWWW defines extensibility as "the property of a technology that promotes evolution without sacrificing interoperability". Extensibility can be a cheap excuse to actually violate standards, so "extensibility must not interfere with conformance to the original specification". The user should be able to detect non-compliance to web standards, and agents have a moral duty in this regard: "Agents that recover from error by making a choice without the user's consent are not acting on the user's behalf." I myself may not be too worried if my browser renders ill-formed HTML, yet it would be nice to know. Lastly, agents should specify how to deal with extensions they do not know, either through failing or ignoring. The AWWW document also points out that in the case of XML causing agents to fail on incorrect data forces the web community to use standards correctly.
Extensibility is the "X" in XML and the AWWW deals with XML-specific concerns. The most surprising results regard the feature of XML everyone loves to hate, namespaces. First, namespaces qualify everything: "A specification that establishes an XML vocabulary SHOULD place all element names and global attribute names in a namespace." For extending and combining XML documents, namespaces can be absolutely crucial. The AWWW mandates that version information should be included in every format. In order to let namespaces be extensible, it also goes into depth about how XML namespaces should include change policies that allow a namespace to "evolve gracefully" while minimizing impact on existing software.
The AWWW also admits that XML is not perfect. XML is just a syntax for hierarchically structured textual data. It has no formal semantics, and no approved ID semantics or universal way of specifying how namespaces interact within a document. The AWWW does clear up the thorn of QName usage: "Do not allow both QNames and URIs in attribute values or element content where they are indistinguishable." QNames must provide a mapping to URI.
Web Architecture Evolves
The AWWW is a tremendously ambitious document, and the authors should be applauded for creating a readable normative document that summarizes the web and takes a hard and principled stance on several outstanding issues. The AWWW is a "philosophy of the web" that avoids philosophical terminology, and this allows it to avoid many long-standing problems about the nature of representation. It is clear in several not-so-subtle moves that the AWWW is paving the way for the Architecture of the Semantic Web through referencing OWL, URI persistence, and so on. Yet it does not lose sight of the original power of the web, and wisdom such as "a specification SHOULD allow Web-wide linking" is preserved. Depending on URIs and an underspecified idea of what a resource is has gotten the web quite far, so it is for all our benefit that these notions are articulated. Every person involved in the web should read this document, and its contributions to web architecture will doubtless further the development of the web. Remember, where this is only Volume One of the Architecture of the Web. Volume Two, and the future of the web, is unwritten.