Creative Comments: On the Uses and Abuses of Markup

January 15, 2003

Whether you think of the Semantic Web as a new and exciting promise or as a fantastic and impractical threat, it will not be a separate web but, rather, overlay the existing one. The Semantic Web isn't a replacement, it's a supplement. Both the existing web (the "Human Web") and the Semantic Web (for my purposes here, the "Machine Web") will inhabit the same conceptual space and share considerable infrastructure, including Unicode, URI, HTTP, XML.

How, then, do we distinguish the Human from the Machine Web? The easiest way is by distinguishing the identity and nature of each web's dominant agent. The dominant agent of the Human Web is the natural person. The Human Web is made for humans; the information and knowledge it contains is intended for human consumption. The Human Web's primary language, HTML, is best suited to presenting information to human agents.

Of course there are some machine agents at play in the fields of the Human Web, but they are massively outnumbered, don't know the game well, and their play is generally hampered by the strange environs. As every programmer who's tried to screen scrape someone else's web site knows, HTML isn't a very good way to express information to a machine. When it works, it does so only because of considerable, daily care and feeding.

The Machine Web's dominant agent is a computer process, a machine. The information contained in the Machine Web is intended for machine consumption. RDF, the Machine Web's primary language, at least for now, is best suited to describing information for machine agents.

Thus far, I've made no new claims, having merely laid out the conventional picture. In the remainder of this article I want to draw your attention to the transitional period -- the period during which the Machine and Human Webs will begin to inhabit the same conceptual space and technical infrastructure. We are now living in the early days of this transitional period and there are some issues specific to it which may be worth considering.

Machine Content and Human Comments

The issue I want to raise here is the increasingly widespread practice of embedding information -- mainly using, but not limited to, RDF -- intended for machine consumption in a format, HTML comments, which is intended for human consumption.

When I realized people were embedding RDF in HTML comments, claiming that the resulting document is part of the Semantic Web, I was confused. Surely, I wondered, they know that putting RDF into HTML comments is an inelegant way of relating human and machine-consumable resources? Creative Commons, which has taken on the laudable task of creating RDF descriptions of common licensing terms for intellectual property, suggests its users associate machine-consumable licensing terms such as this:


<rdf:RDF xmlns="http://web.resource.org/cc/"

    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<License rdf:about="http://creativecommons.org/licenses/by-nc-sa/1.0">

   <requires rdf:resource="http://web.resource.org/cc/Attribution" />

   <permits rdf:resource="http://web.resource.org/cc/Reproduction" />

   <permits rdf:resource="http://web.resource.org/cc/Distribution" />

   <permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />

   <requires rdf:resource="http://web.resource.org/cc/ShareAlike" />

   <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" />

   <requires rdf:resource="http://web.resource.org/cc/Notice" />

</License>

</rdf:RDF>

with the web resources to which they apply by embedding RDF directly in HTML comments, like this:


<!-- <rdf:RDF xmlns="http://web.resource.org/cc/"

    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<License rdf:about="http://creativecommons.org/licenses/by-nc-sa/1.0">

   <requires rdf:resource="http://web.resource.org/cc/Attribution" />

   <permits rdf:resource="http://web.resource.org/cc/Reproduction" />

   <permits rdf:resource="http://web.resource.org/cc/Distribution" />

   <permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />

   <requires rdf:resource="http://web.resource.org/cc/ShareAlike" />

   <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" />

   <requires rdf:resource="http://web.resource.org/cc/Notice" />

</License>

</rdf:RDF> -->

(For what it's worth, Movable Type's TrackBack system also works by embedding RDF descriptions of web resources into (X)HTML comments; most of what I say about the Creative Commons case applies to TrackBack, too.)

From a conceptual point of view, setting aside the exigencies of the actual world for the moment, this is not a sound or even coherent strategy. The point of describing the licensing terms of a web resource in RDF is to enable a machine to consume those licensing terms and, based on choices a programmer has already made, take appropriate action with regard to that web resource. In other words, licensing terms constitute a constraint on what a machine may legally do with the resource they address; for example, to distribute copies of the resource or to refrain from distributing copies. The point of HTML comments is to allow humans to include information which is solely intended for human-consumption in a resource. In short, markup language comments are for communicating with humans, not with machines. The problem with incoherent strategies is that it's not always possible to predict all the ways in which they will fail or go bad.

From a practical standpoint, embedding RDF in XML or (X)HTML comments works, but only under a limited range of contexts and conditions. Consider what you have to do, in the general case, to consume Creative Commons RDF licensing terms in an XHTML comment of a web resource. You have to decide whether to consume the XHTML web resource as XHTML -- in other words, to pass it to an XML parser and then to interact with it by means of some API -- or as an opaque string of characters. If you've decided to treat XHTML as XHTML, your XML processing framework has to preserve XML comments and then make them available programmatically. (And you still have to sort through all the comments contained in the parsed representation of an XML resource, trying to figure out if any of the comments contain something that looks like or is RDF, which you can do either by using a regular expression on the contents of each comment or by trying to parse the contents of each comment as XML...) Otherwise, you don't have a choice: you must treat the resource as a string of characters. Some XML parsing frameworks do not preserve comments, and it's hard to see how they can be said to be doing the wrong thing by not preserving them.

If, on the other hand, if you've decided to treat an XHTML resource as an opaque string of characters -- refusing to take advantage of all the value offered by XHTML resources in the first place -- you're stuck with the task of using, say, regular expressions to comb through the string, looking for bits of text which look like a particular kind of RDF -- a brittle operation at best. Once you've identified some bits of text which may be RDF, and which may be RDF descriptions of licensing terms, you still have to consume them, either by writing an ad hoc parser or by parsing them as RDF.

The sole advantage of embedding RDF into markup language comments is that it's simple. It doesn't require the person doing the embedding to understand much about the web beyond cut-and-paste. That is a real advantage, but it's not clear how much it's worth, especially when there are alternatives. The main alternatives to embedding RDF in (X)HTML comments -- and I don't see any good reasons to think these alternatives cannot coexist -- is to turn the machine-consumable licensing terms into a first class web resource or to put them into (or associate them with) an RSS file. I am agnostic as to which of these solutions is best, in large part because "best" is context-dependent and interest-relative. What is best in this case depends almost entirely on what you need to do and where you need to do it.

RSS and Linking

The RSS 1.0 and 2.0 communities have both managed, in their own distinct ways, to accommodate the Creative Commons project. There's an RSS 1.0 Creative Commons module, mod_cc, and there's also an RSS 2.0 creativeCommons RSS Module. Despite the various divergences and differences of opinion between these two communities, their Creative Commons solutions are similar. In each case, the approach is to associate the license terms of a web resource with an RSS file, which is itself a machine-consumable, alternate version of a web resource or a collection of resources. What's shaking out in the transitional period, during which the Human and Machine Webs are learning to cohabitate, is that RSS, of whatever variety, is becoming, as a matter of convention and social agreement, the place to put machine-consumable metadata about a resource or a collection of resources (i.e., a "site").

The key aspect of RSS's success is convention and social agreement. The part of this story which is yet to be told is whether there will be any widespread convention and social agreement about the third way of dealing with RDF and (X)HTML. In this way, the various permutations of RDF licensing terms become first-class web resources of their own, which means giving them a URI (as, for example, Creative Commons has done). Once the licensing terms you prefer -- which may be a mixture of Creative Commons RDF vocabularies and other RDF predicates and terms; there is no reason you cannot include, say, Dublin Core predicates and terms in your licensing resource -- are web resources, you associate them with the web resources you wish to license by linking to them. One way to do that, and the way which RSS communities have used to foster automatic discovery of RSS resources, is by placing a link element inside the head of the resource in question. For example,


<link rel="license-terms" href="/License.rdf" type="application/rdf+xml" />

The content of the rel attribute is, in my view, the conventional, yet crucial bit. If widespread convention and social agreement evolve about the content of the rel attribute, then people will be able to program machines to look for link children in head elements that have the conventional rel-attribute value, with some justified confidence that the resource which the link points to is one which contains machine-consumable description of licensing terms for the resource in question. That's not ideal in every circumstance, but it's sane, elegant, and clean enough to become a viable alternative. The problem, so far, is that, unlike RSS, the link solution has no natural constituency or community to push it, which means that the requisite convention, based on social agreement, has been slow to coalesce.

Transitional periods are exciting, interesting times. But they are also dangerous because it's never quite clear which temporary, transitional solutions -- ones which everyone agrees are ugly, inelegant hacks -- are going to outlast the transition. The history of technology is full of examples of transitional strategies, which wouldn't die and couldn't be or simply weren't killed, turning into the very problems which the next big solution is designed to solve. Among the dangers of the present transitional period, as we move from Human Web to a Human Web with a Machine supplement, embedding machine-consumable information within human-consumable comments may well turn out to be one we end up living with for far longer than anyone intended or imagined.