TAG: Fragment Identifiers, Subsets, and Metadata
April 16, 2003
Periodically, it falls to me to investigate the state of the W3C's Technical Architecture Group (TAG). For a geeky journalist, or for anyone who cares about the infrastructure of the Web broadly conceived, watching the TAG can be an incredibly efficient use of one's time. Some of the most engaging, vital technical issues regularly fly over the TAG's transom -- often in volumes which, or so I have suggested in the past, threaten to swamp TAG members. In short, if you want to take the technical pulse of the Web, surveying the lines and directions of its future development, watching the TAG at work is ideal.
In what follows, I review the TAG's issues list, discussing the new and especially noteworthy additions.
By my rough reckoning, the TAG has accepted 10 new issues since the last time I wrote about it. The issues range over XML and other web infrastructural concerns, but cluster around XML.
What do fragment identifiers identify? HTTP URIs, for example, can identify things
least) two distinct namespaces; first, the global namespace of resources (for example,
http://monkeyfist.com/); second, the local namespace of the specific resource
But what are the constraints, if any, of the resources local namespace, as named by a fragment identifier? In the semantics of HTTP and HTML, fragment identifiers name parts of an HTML document. But what about generally? As the issue is formulated in the TAG's issues list, "Do fragment identifiers refer to a syntactice [sic.] element (at least for XML content), or can they refer to abstractions?"
The trade-offs seem to be arranged in this way: as XML is used more and more to represent resources that don't look very much like documents (SVG images and protocol exchanges, for example), it increases the representational complexity and power of XML to allow fragment identifiers to name domain-specific abstractions, but at the cost of overall complexity. Forbidding fragment identifiers from naming domain-specific abstractions increases overall simplicity at the cost of representational power.
In accepting this issue, the TAG identified several specific areas upon which it impinges: WSDL component designators, the identity of resources and URIs (a big RDF and Semantic Web issue, which I discussed in Identity Crisis), XInclude, SVG.
This issue has not yet be assigned to a TAG member for resolution.
As I suggested recently (in " The Pace of Innovation"), ubiquity of use is a key indicator of technology success. XML enjoys a very wide range of use. The disadvantage of ubiquity, of course, is the tension and stress caused precisely by a wide, and not necessarily coherent, range of use cases. In the case of XML, this tension is compounded by one of XML's original design goals -- "to have as few ["ideally zero", as the XML 1.0 specification says] optional features as possible", as Paul Grosso pointed out recently.
These tensions and stressors could be alleviated in one of two ways: once-for-all or by developing a rational, standardized way of creating XML subsets according to domain-specific constraints and requirements. In fact, there is a certain measure of a posteriori standardization here, given that many XML vocabularies -- SOAP being only the most prominent -- already rely on a subset of XML proper.
After discussing this issue over the course of several TAG meetings, a recommended resolution has been reached. The resolution begins by mentioning all the reasons why profiles are a bad idea:
Profiling XML, providing more implementation options, will necessarily increase the possibility of interoperability problems and it would be best to avoid doing so. Profiles are a bad idea on general principles and are in direct conflict with one of the original goals of XML: "the number of optional features in XML is to be kept to the absolute minimum, ideally zero."
Such strong, unambiguous language should be a tip-off to the careful reader that a profile is exactly what will, however reluctantly, be suggested:
Unfortunately, a number of user communities have expressed a need to work with only a subset of XML. The TAG is concerned that if these needs are not addressed quickly (and centrally), a number of slightly different XML subsets will arise and if this trend continues, the stability of XML as the basis of a whole range of technologies could be jeopardized.
Rather than create a mechanism whereby subsets of XML can be created in an ad hoc fashion, the TAG suggests creating one subset of XML for use in many contexts -- call it the utility infielder of the XML set. As the TAG's resolution put it,
However, precisely how the subset is defined requires careful consideration as this is an exercise that should be conducted only once. The subset selected must be small enough so that no further subset will be required but also complete enough to be useful for a wide range of applications (emphasis added).
If you've been reading the XML-Deviant column since it was Leigh Dodds's baby, you know that subsetting XML is one of those perennial XML-DEV permathreads that never seem to die. In fact, subsetting XML is a permathread that has actually led to useful work being done in a variety of ways. With this resolution of the XML Profiling issue, the long story of subsetting has entered a new, perhaps definitive chapter.
Robin Berjon asked the TAG to clarify its views on the issue of creating a binary version of XML (more to the point, a binary infoset or PSVI). Thus far TAG member Chris Lilley has written up a summary of the issues surrounding binary XML. I refer eager, curious readers to Lilley's summary. If I were forced to speculate on the outcome of this issue, I would adopt Dan Connolly's position: for exchanges between maximally trusted, internally- or institutionally-related parties, the TAG takes no position on whether a binary XML should be used. However, for exchanges in every other situation, gzip-compressed XML is the best choice.
Should there be a standardized way to encode metadata, particularly resource version information, in URIs, and should there be a standard URI for the resource that contains (or is) the metadata of another resource? The TAG has not resolved this issue formally. Its preliminary finding is that there should not be such standards.
While I think that the W3C's habit of encoding resource versioning information in resource URIs is a candidate for best practice in URI design, it's hard to see what could be gained by standardizing such a practice, just as it's not entirely clear that this very thin version information really counts as metadata, at least not metadata in any rich or highly ramified way. The second part of this issue, whether there should be a standard way of discovering the metadata resource of a resource for which you know the URI, is related conceptually to the next issue, the one about embedding RDF in XHTML.
Whatever the Semantic Web turns out to be (or fails to be), there is going to be a period of transition between the present Web and one in which there is a lot more machine-readable information, mainly RDF. There is, then, a need for a generalized way, perhaps more than one, of relating machine-readable resources with regular, human-readable ones.
For example, if I want to create an RDF vocabulary for describing the sources, notes, and research that go into writing XML-Deviant columns (so that my readers can follow the writerly trail for themselves), it would be useful if I had a way for your machine to discover that metadata about an XML-Deviant column in a predictable, routine way. One way of creating this association that is already being deployed is to embed RDF into (X)HTML comments, a practice I analyzed critically in a recent XML.com article, "Creative Comments: On the Uses and Abuses of Markup".
There are other, less problematic, existing ways of making this association, including
<link>; but this really is the sort of issue that the
TAG should consider, especially since it is likely that among the ways of making
associations between resources and machine-readable RDF is to embed the RDF within
representations resources. I don't favor the embedding strategy as the sole means
of creating these associations, but it would be a useful option to have from among
plurality of others. The trick with embedding is to preserve the ability to validate
resulting XHTML, so the issues are not trivial.
In my view, the low-hanging fruit in this area is to standardize on a (preferably lo-tech) linking strategy of some sort, which buys time for the harder work to be done on an embedding strategy. While there has been no public resolution of this issue so far, and it's only been discussed at one TAG meeting, Dan Connolly has been assigned to work on it.
The last new TAG issue I want to address concerns the jumble of different ways to establish site-wide metadata, including conventional URIs like
http://a.site.org/robots.txt http://a.site.org/w3c/p3p http://a.site.org/favicon.png
One might also add to this mix the panoply of site-wide metadata uses to which RSS files are now routinely being put. In Tim Berners-Lee's formulation, the task is to
find a solution, or put in place steps for a solution to be found, which allows the metadata about a site, including that for later applications, to be found with the minimum overhead and no use of reserved URIs within the server space.
Tim Bray suggested a straw man proposal, which not only proposes a new HTTP header, but also takes a step toward formalizing web sites as a kind of distinct entity. As Tim says,
let's introduce a formal notion of a "Web Site", which is a collection of Resources, each identified by URI. A resource can be in more than one site -- not an obvious choice, but it seems it would be hard to enforce a rule to the contrary.
Since a Web Site is an interesting and important thing, it ought to be a resource and ought to have a URI. There is no point trying to write any rules about whether all the resources on a site ought to be on the same host or whether the site's URI should look like those of the resources.
Also in XML-Deviant
The second part of the proposal is to add a new HTTP header, "Site", for example,
origin server could optionally include in response to an HTTP
HEAD request. The content of that header would (presumably) be a URI to a
machine-readable resource, most probably represented in RDF, that unifies all the
site-wide metadata about the site: robot exclusion rules, graphical wingdings, RSS
URIs, privacy and self-ratings information, other legal or statutory requirements,
alternative representations of the site, and so on. It's a clever idea and one which
arisen in response to evolving, existing practice, rather than being imposed by a
I have in the past thought that the TAG was overwhelmed by its role as a kind of court of last appeal for the architecture of the Web. My dominant reaction to this latest round of catching up with the TAG is different. Looking at the kinds of issues that are being raised, accepted, and addressed, I think it's clear that the Web is maturing nicely and that there seems to be clear paths, both of thought and action, for how it's going to reinvent itself in the future. We live in interesting times.