
Structured Writing, Structured Search
The theme of my talk last month at the Open Source Content Management conference (OSCOM) was: "Everything you need to know about content management, you (should have) learned in grade school." I spent a lot of time talking about why and how to use URIs and HTML document titles in principled ways. These two namespaces are metadata stores that we typically fail to manage, but that can deliver powerful benefits.
I should have moved more quickly through that material, though, because what I really wanted to highlight was the same idea as applied to the XHTML namespace. In The Semantic Blog I suggested that we could achieve much more with simple XHTML content than we currently do. Two months down the road, the picture's a bit clearer than it was.
In that column, I said that I was going to start including an <xhtml:body> element in my RSS feed. Now that it's been in place for two months, I admit it hasn't been entirely smooth sailing. I did work out how to use HTML Tidy to clean up the stuff I post through Radio UserLand. But in the end, that's not quite satisfactory. If the goal is to produce clean XHTML, you want more interaction than Tidy affords. Currently I wind up checking my work in an XML-savvy browser: IE or Mozilla. I'd like to be able to toggle between XML and HTML modes, but haven't sorted that out yet.
We are still in desperate need of lightweight WYSIWYG editing components that make XHTML useful to non-emacs-using (i.e. normal) people. I keep hearing about new efforts -- most recently, Mozile -- but so far, I've seen nothing that delivers the user experience I enjoyed back in 1996 in the Netscape mail/news client. It's fascinating to look back on this 1999 article, a condensation of my book. On the one hand, blogs have utterly reshaped collaborative knowledge management, which I then envisioned in terms of NNTP. On the other hand, the authoring capabilities I enjoyed then somehow elude us today. In a section of that article, subtitled "Not Your Grandfather's Newsreader," I wrote:
Communicator's Messenger and Internet Explorer's Outlook Express can both render and compose HTML. The mail/news composer in either product is a good-enough HTML authoring tool -- not for snazzy production Web pages, but for the kinds of office documents that people typically create. You can, in a WYSIWYG manner, compose HTML that uses tables, inline images (which can be dragged and dropped into place), hyperlinks, fonts, and color. You can't do these things in Usenet postings, because many people don't run HTML-aware mail/news clients. But rich messaging is entirely appropriate on a LAN or intranet where Communicator or IE is universally deployed.
Putting a modest proposal into practice
Of course rich content is the standard in blogspace, and yet here we are in 2003 hammering raw HTML into our grandmother's TEXTAREA widget. It seems crazy to do things this way, but the popularity of blogging proves that the effort-to-reward ratio is greater than one. Let's assume that we'll get our WYSIWYG XHTML editor someday, but maybe not soon. How can we achieve a bigger payoff now? In The Semantic Blog, I proposed that the availability of structured search can motivate some very simple but useful kinds of structured writing. I gave a bunch of XPath search examples that were based on RSS metadata, but not on structured content. The idea, which I've now begun putting into practice, was to also use inline CSS class and/or id attributes and to do so in a dual manner, as both stylistic and descriptive markup. Here's a real example of what I had in mind (see Computer/telephone integration: Why don't we expect more?):
| Rendering |
|
| (X)HTML source | <span class="minireview">SpiderPhone</span> I'm always... |
| CSS directives |
.minireview { font-weight: bold } .minireview:before { content: "MINI-REVIEW: " } |
| XPath query |
Find URLs of items containing minireviews: //*[@class = "minireview"]/ancestor::channel/item/link |
In other words, having CSS-tagged this blog item as a "minireview" I can brand its appearance on my blog and, at the same time, expose it to XPath search. Any subscriber to my RSS feed (an individual or, more likely, a service) can collect my XHTML items, merge them with others, and offer this kind of search. My notion is that if we close the gap between effort and reward, useful naming conventions can evolve from the grassroots.
I must admit, though, that I've yet to begin collecting my own XHTML blog items in an XML-savvy database. There are a bunch of these now and more all the time. I've written about Virtuoso, SleepyCat XML and Xindice, and PostgreSQL. Mark Wilcox just pointed me to eXist. Still, for most people, and even for me, the activation threshold's a bit steep to get going with one of these.
Serverless Structured Search
With all the capability packed into modern browsers, it struck me that we ought to be able to use XPath much more simply and interactively. So I took another look at my OSCOM slideshow and added an XPath search to it:
I got this working in IE first, and I wasn't sure I could achieve the same effect in Mozilla, but Brendan Eich set me straight and this screenshot is the proof.
Here's how it the whole setup works. I start with an HTML template:
<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css"/>
<script src="script.js" />
</head>
<body onKeyPress="next()"> // MSIE only?
<script>document.write(header)</script>
<!-- XHTML goes here -->
</body>
</html>
I clone it to files whose names and titles I manage in a JavaScript array, as seen in the screenshot. Given this setup, the slideshow is always live. The script sourced into each page takes care of both sequential and random-access navigation. The OSCOM folks wanted slides in SlideML format, though, so I produced that using a simple Python script:
#! /usr/bin/python
import re
from xml.dom.minidom import parse
f = open('script.js')
s = f.read();
l = re.findall ('\'([^\']+)\'\s+\+s\(\)\+\s+\"([^\"]+)\"', s)
print open('slideml.txt').read()
for i in l:
file = i[0]
title = i[1]
dom = parse(file)
body = dom.getElementsByTagName('body')[0].toxml()
body = re.sub ('^<body[^>]+>', '', body);
body = re.sub ('</body>$', '', body);
print '''
<s:slide s:id="%s">
<s:title>%s</s:title>
<s:content>
%s
</s:content>
</s:slide>
''' % (file,title,body)
print '</s:slideset>'
Using a regex keyed to the format of the JavaScript array of names and titles, this script reads in each slide's XHTML and builds a single file wrapped with SlideML's prescribed metadata. Which seems backwards, and I guess it is, but my notion of the value of this kind of wrapper format differs from the prevailing view. If I can develop, use, and deploy a slideshow made of simple parts -- XHTML pages and a controlling script, usable directly -- then why would I want to write a complex package that has to be exploded and transformed into the simple parts that people will actually consume?
The all-in-one package of XML data is useful, but it's useful in a very different way. The XPath search feature is implemented as an XSLT stylesheet that queries the whole package. Here's how that works.
Parameterizing XPath Queries in XSLT
I naively thought that it would be possible to pass an XPath query
into a style sheet as an XSLT parameter. No such luck. The match
attribute of the <xsl:template</> element is a fixed
quantity, you don't get to call it $query and swap it for an
incoming parameter. Solving this took a bit of head-scratching. I wound
up with this stylesheet:
<?xml version="1.0" encoding="us-ascii"?>
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:s="http://www.oscom.org/2003/SlideML/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en" >
<xsl:output method="html" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="query" >
<p><b>
<xsl:value-of select="ancestor::s:slide/s:title" />,
<a>
<xsl:attribute name="href">
<xsl:value-of select="ancestor::s:slide/@s:id" />
</xsl:attribute>
<xsl:value-of select="ancestor::s:slide/@s:id" />
</a>
</b>
<div>
<xsl:copy-of select="."/>
</div>
<hr align="left" width="20%" />
</p>
</xsl:template>
<xsl:template match="text()">
</xsl:template>
</xsl:stylesheet>
I'll note in passing that the most complex thing here is the list of namespace declarations. Five of the six are required by SlideML, a heavy burden for a format that wraps a fairly thin layer of metadata around the XHTML content. I'm sure there were good arguments in favor of each namespace, but every time you propose adding another one, it's worth thinking about the downstream effects.
Pages: 1, 2 |