Language Instincts

September 17, 2003

Back in April I made the case for writing weblog entries in XHTML, using CSS for a dual purpose: to control presentation and as hooks for structured search. I then started to accumulate well-formed content, writing CSS class attributes with an eye toward data mining, and flowing XHTML content through my RSS feed. Here's a recap of the basic elements of the plan sketched out in my June column:

(X)HTML source	<span class="minireview">SpiderPhone</span> I'm always...
CSS directives	.minireview { font-weight: bold } .minireview:before { content: "MINI-REVIEW: " }
Rendering
XPath query	Find items containing minireviews: //*[@class = "minireview"]/ancestor::channel/item/link

The backstory is as follows. I'd noticed that other bloggers had begun to develop an informal convention -- they were using the term "minireview" to identify items that were (or that contained) brief reviews of products. A minireview might be an entire weblog item or just a paragraph within an item. In the monkey-see, monkey-do tradition of the Web, I decided to imitate this behavior. I also wanted to expand its scope. The long-term goal would be to enable me (or anyone) to identity these kinds of elements in a way that would facilitate intelligent search and recombination. But that would require writers to categorize their material, and we all know that's a non-starter. The absence of a universal taxonomy is the least of our problems. Even if such a thing existed (or could exist), we'd be loath to apply it because we are lazy creatures of habit. We invest effort expecting immediate return, not some distant future reward.

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

What would motivate somebody to tag a chunk of content? It struck me that people care intensely about appearances, self-presentation, and social conformity. Look at the carefully handcrafted arrangements of links on blogrolls -- some ordered by ascending width, some undulating like candlesticks. We do these things despite our inherent laziness because we have seen others do them, because we want to express solidarity with the tribe, and because we hope to be trend-setters, not just trend-followers. Maybe we can leverage the machinery of meme propagation to achieve some semantic enrichment of the Web. Start with visual effects that people can easily create and that other people will want to copy. Tie those effects to tags that can also provide structural hooks. Then exploit the hooks.

RSS, XHTML, and XML databases

In the original plan, RSS was the conduit through which the enhanced content would flow. If a meme did propagate, search services that compiled the XHTML content of blog items into their databases could aggregate along this new axis, thus amplifying the effect. I still envision that scenario, but I'm as much a seeker of instant gratification as the next person, and I wanted immediate use of my own enhanced content. So I extracted the XHTML content I'd been accumulating in my Radio UserLand database, stuck it in a file, and put together a JavaScript/XSLT kit for searching it (1, 2, 3). And then a funny thing happened: the XML file took on a life of its own.

For no particularly good reason, I'd decided to tag quotations like so:

<p class="quotation" source="...">

Over on the Bitflux blog, Roger Fischer noted correctly that this was kind of silly. It unnecessarily invents a 'source' attribute that doesn't exist in XHTML, and that should therefore appear in another namespace. But in any case it's overkill because XHTML affords a natural solution:

<blockquote cite="...">

I agreed with Roger, so I made the change in the XML file (it was just a simple XSLT transform), and made a corresponding change to the canned XPath query that finds quotations in my blog. My next instinct was to republish the affected items. But on second thought, why? In the HTML rendering of my blog, the two styles look the same. And the items had already fallen off the RSS event horizon. Republishing wouldn't cause them to appear in the feed. Even if it did, the purely structural changes would be invisible and thus puzzling to readers.

This creates a slightly odd situation. The canonical version of my weblog is no longer the published one. Rather, it's an XML document-database the structure (but not content) of which is evolving and the API of which is XPath search. At some point I'll probably want to resynchronize the two, but for now I'm just interested to see where the experiment leads.

From pidgin to creole

After I posted the blog entries describing this approach, a number of people asked me to specify the tagging conventions I'm using or intend to use. There is no plan or specification. I'd be satisfied for now if people could routinely and easily create styled elements, associate those elements with CSS attributes, embed the CSS in well-formed content, usefully navigate and search the stuff, and easily adjust the tagging across their own content repositories. Meme propagation could and arguably should drive collective decisions about which kinds of elements to name and what to name them.

In The Language Instinct, Steven Pinker describes the transition from pidgin to creole. A pidgin language, which arises when speakers with no native language in common are thrown together and must communicate, lacks a complete grammar. Amazingly, the children of pidgin speakers spontaneously create creole languages that are grammatically complete. It is perhaps a stretch to relate these processes to the evolution of modes of written communication on the Web. But even if you don't buy the whole analogy, it's worth thinking about how human communities can and do converge on naming conventions and then on a grammar. The process is intensely interactive. People imitate other people's ways of communicating, introducing variations that sometimes catch on and sometimes don't.

O'Reilly Emerging Technology Conference.

I don't think the Semantic Web will come from a specification that tells us how to name and categorize everything. But it could arise, I suspect, from our linguistic instincts and from the social contexts that nurture them. If that's true, then we need to be able to

Speak easily and naturally.

The structural symbols we embed in our writing, when we write for the Web, have to be easy to understand and use. Style attributes strike me as the likely approach because while limited in scope, they're available and can be manipulated in familiar ways.
Hear what we are saying.

At first I was deaf to the structural language I was trying to speak. I'd invent a use for a CSS class attribute and apply it, in what I thought was a consistent way, but it was really just a promise to the future. Some day I'd get around to harvesting what grew from the seeds I was planting. But when I finally did, I found that my tagging conventions had drifted over time. When I closed the feedback loop on my own weblog's content, by making it available to structured search, I could finally hear -- and thus correct -- that drift.
Imitate and be imitated.

My search mechanism has some interesting properties. For example, the canned XPath queries on the form not only make XPath usable for those who don't grok it intuitively, they also advertise the structural hooks that are available. I think of this as an invitation to imitators. Of course I'm an imitator too. When I see a good idea -- for example, Roger Fischer's suggestion -- I want to copy it. Having the searchable content in one place, available to XSLT or even just find-and-replace, makes quick work of that.

The dictionary of the Semantic Web may one day be written. But not until we've done a lot of yammering, a lot of listening, and a lot of imitating. We need to find ways to help these behaviors flourish.