Menu

Structured Writing, Structured Search

June 10, 2003

Jon Udell

The theme of my talk last month at the Open Source Content Management conference (OSCOM) was: "Everything you need to know about content management, you (should have) learned in grade school." I spent a lot of time talking about why and how to use URIs and HTML document titles in principled ways. These two namespaces are metadata stores that we typically fail to manage, but that can deliver powerful benefits.

I should have moved more quickly through that material, though, because what I really wanted to highlight was the same idea as applied to the XHTML namespace. In The Semantic Blog I suggested that we could achieve much more with simple XHTML content than we currently do. Two months down the road, the picture's a bit clearer than it was.

In that column, I said that I was going to start including an <xhtml:body> element in my RSS feed. Now that it's been in place for two months, I admit it hasn't been entirely smooth sailing. I did work out how to use HTML Tidy to clean up the stuff I post through Radio UserLand. But in the end, that's not quite satisfactory. If the goal is to produce clean XHTML, you want more interaction than Tidy affords. Currently I wind up checking my work in an XML-savvy browser: IE or Mozilla. I'd like to be able to toggle between XML and HTML modes, but haven't sorted that out yet.

We are still in desperate need of lightweight WYSIWYG editing components that make XHTML useful to non-emacs-using (i.e. normal) people. I keep hearing about new efforts -- most recently, Mozile -- but so far, I've seen nothing that delivers the user experience I enjoyed back in 1996 in the Netscape mail/news client. It's fascinating to look back on this 1999 article, a condensation of my book. On the one hand, blogs have utterly reshaped collaborative knowledge management, which I then envisioned in terms of NNTP. On the other hand, the authoring capabilities I enjoyed then somehow elude us today. In a section of that article, subtitled "Not Your Grandfather's Newsreader," I wrote:

Communicator's Messenger and Internet Explorer's Outlook Express can both render and compose HTML. The mail/news composer in either product is a good-enough HTML authoring tool -- not for snazzy production Web pages, but for the kinds of office documents that people typically create. You can, in a WYSIWYG manner, compose HTML that uses tables, inline images (which can be dragged and dropped into place), hyperlinks, fonts, and color. You can't do these things in Usenet postings, because many people don't run HTML-aware mail/news clients. But rich messaging is entirely appropriate on a LAN or intranet where Communicator or IE is universally deployed.

Putting a modest proposal into practice

Of course rich content is the standard in blogspace, and yet here we are in 2003 hammering raw HTML into our grandmother's TEXTAREA widget. It seems crazy to do things this way, but the popularity of blogging proves that the effort-to-reward ratio is greater than one. Let's assume that we'll get our WYSIWYG XHTML editor someday, but maybe not soon. How can we achieve a bigger payoff now? In The Semantic Blog, I proposed that the availability of structured search can motivate some very simple but useful kinds of structured writing. I gave a bunch of XPath search examples that were based on RSS metadata, but not on structured content. The idea, which I've now begun putting into practice, was to also use inline CSS class and/or id attributes and to do so in a dual manner, as both stylistic and descriptive markup. Here's a real example of what I had in mind (see Computer/telephone integration: Why don't we expect more?):

Rendering
(X)HTML source <span class="minireview">SpiderPhone</span> I'm always...
CSS directives .minireview { font-weight: bold }
.minireview:before { content: "MINI-REVIEW: " }
XPath query Find URLs of items containing minireviews:
//*[@class = "minireview"]/ancestor::channel/item/link

In other words, having CSS-tagged this blog item as a "minireview" I can brand its appearance on my blog and, at the same time, expose it to XPath search. Any subscriber to my RSS feed (an individual or, more likely, a service) can collect my XHTML items, merge them with others, and offer this kind of search. My notion is that if we close the gap between effort and reward, useful naming conventions can evolve from the grassroots.

I must admit, though, that I've yet to begin collecting my own XHTML blog items in an XML-savvy database. There are a bunch of these now and more all the time. I've written about Virtuoso, SleepyCat XML and Xindice, and PostgreSQL. Mark Wilcox just pointed me to eXist. Still, for most people, and even for me, the activation threshold's a bit steep to get going with one of these.

Serverless Structured Search

With all the capability packed into modern browsers, it struck me that we ought to be able to use XPath much more simply and interactively. So I took another look at my OSCOM slideshow and added an XPath search to it:

I got this working in IE first, and I wasn't sure I could achieve the same effect in Mozilla, but Brendan Eich set me straight and this screenshot is the proof.

Here's how it the whole setup works. I start with an HTML template:

<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css"/>
<script src="script.js" />
</head>
<body onKeyPress="next()"> // MSIE only?
<script>document.write(header)</script>

<!-- XHTML goes here -->

</body>
</html>

I clone it to files whose names and titles I manage in a JavaScript array, as seen in the screenshot. Given this setup, the slideshow is always live. The script sourced into each page takes care of both sequential and random-access navigation. The OSCOM folks wanted slides in SlideML format, though, so I produced that using a simple Python script:

#! /usr/bin/python
import re
from xml.dom.minidom import parse
f = open('script.js')
s = f.read();
l = re.findall ('\'([^\']+)\'\s+\+s\(\)\+\s+\"([^\"]+)\"', s)
print open('slideml.txt').read()
for i in l:
    file = i[0]
    title = i[1]
    dom = parse(file)
    body = dom.getElementsByTagName('body')[0].toxml()
    body = re.sub ('^<body[^>]+>', '', body);
    body = re.sub ('</body>$', '', body);
    print '''
        <s:slide s:id="%s">
        <s:title>%s</s:title>
        <s:content>
        %s
        </s:content>
        </s:slide>
   ''' % (file,title,body)
print '</s:slideset>'

Using a regex keyed to the format of the JavaScript array of names and titles, this script reads in each slide's XHTML and builds a single file wrapped with SlideML's prescribed metadata. Which seems backwards, and I guess it is, but my notion of the value of this kind of wrapper format differs from the prevailing view. If I can develop, use, and deploy a slideshow made of simple parts -- XHTML pages and a controlling script, usable directly -- then why would I want to write a complex package that has to be exploded and transformed into the simple parts that people will actually consume?

The all-in-one package of XML data is useful, but it's useful in a very different way. The XPath search feature is implemented as an XSLT stylesheet that queries the whole package. Here's how that works.

Parameterizing XPath Queries in XSLT

I naively thought that it would be possible to pass an XPath query into a style sheet as an XSLT parameter. No such luck. The match attribute of the <xsl:template</> element is a fixed quantity, you don't get to call it $query and swap it for an incoming parameter. Solving this took a bit of head-scratching. I wound up with this stylesheet:

<?xml version="1.0" encoding="us-ascii"?>
<xsl:stylesheet version='1.0' 
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
    xmlns:s="http://www.oscom.org/2003/SlideML/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en" >

<xsl:output method="html" indent="yes" encoding="us-ascii"/>

<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>

<xsl:template match="query" >

<p><b>
<xsl:value-of select="ancestor::s:slide/s:title" />, 
<a>
<xsl:attribute name="href">
<xsl:value-of select="ancestor::s:slide/@s:id" />
</xsl:attribute>
<xsl:value-of select="ancestor::s:slide/@s:id" />
</a>
</b>
<div>
<xsl:copy-of select="."/>
</div>
<hr align="left" width="20%" />
</p>
</xsl:template>

<xsl:template match="text()">
</xsl:template>

</xsl:stylesheet>

I'll note in passing that the most complex thing here is the list of namespace declarations. Five of the six are required by SlideML, a heavy burden for a format that wraps a fairly thin layer of metadata around the XHTML content. I'm sure there were good arguments in favor of each namespace, but every time you propose adding another one, it's worth thinking about the downstream effects.

Since I couldn't parameterize the query string, I left it as a placeholder (match="query") and looked to DOM manipulation as a way to reach in and change it. Here's what I came up with:

<html>
<head>
<link rel="stylesheet" type="text/css" href="style.css"/>
<script>
var xslurl = 't.xsl';
var xmlurl = 'ml.xml';

function transform(queryText)
{
var appName = navigator.appName;
var appVersion = navigator.appVersion;
if (appName == 'Netscape')
    {
    MOZtransform(queryText);
    return;
    };
if (appName == 'Microsoft Internet Explorer')
    {
    IEtransform(queryText);
    return;
    }
alert('unsupported: ' + appName + ', ' + appVersion);
}

function MOZtransform(queryText)
{
var xsl;
var xml;

try 
    {
    xsl = document.implementation.createDocument("", "xslt", null);
    xsl.async = false;
    xsl.load (xslurl);
    var queryTemplate = xsl.getElementsByTagName('template')[1];
    queryTemplate.setAttribute('match', queryText);
    }
catch(e)
    { 
    alert('error: modify xsl: ' + e.message); 
    }

try
    {
    xml = document.implementation.createDocument("", "xml", null);
    xml.async = false;
    xml.load (xmlurl);
    }
catch(e)
    { 
    alert('error: load xml: ' + e.message); 
    }

try 
    { 
    var xslp = new XSLTProcessor();
    xslp.importStylesheet ( xsl );
    var results = xslp.transformToFragment(xml,document);
    var resultDiv = document.getElementsByTagName('div')[0];
    resultDiv.innerHTML = '';
    resultDiv.appendChild(results);
    document.queryBox.q.value = queryText;
    }
catch(e)
    {
    alert('error: do xslt: ' + e.message); 
    }
}   

function IEtransform(queryText)
{
var xsl;
var xml;
try 
    {
    xsl = new ActiveXObject("MSXML2.FreeThreadedDOMDocument");
    xsl.async = false;
    xsl.load(xslurl);
    var xsldoc = xsl.documentElement;
    var nodelist = xsldoc.selectNodes('//*[@match="query"]');
    var queryTemplate = nodelist.item(0);
    queryTemplate.setAttribute('match', queryText);
    }
catch(e)
    { 
    alert('error: modify xsl: ' + e.description); 
    }

try
    {
    xml = new ActiveXObject("MSXML.DOMDocument");
    xml.async = false;
    xml.load(xmlurl);
    }
catch(e)
    { 
    alert('error: load xml: ' + e.description); 
    }

try { 
    var templ = new ActiveXObject("MSXML2.XSLTemplate");
    templ.stylesheet = xsl; 
    var xslp = templ.createProcessor();
    xslp.input = xml;
    xslp.transform();
    var results = xslp.output;
    var resultDiv = document.getElementsByTagName('div')[0];
    resultDiv.innerHTML = results;
    document.queryBox.q.value = queryText;
    }
catch(e)
    {
    alert('error: do xslt: ' + e.description); 
    }
}
</script>
</head>

<body>
<table>

<tr>
<td>choose xpath query from list</td>
<td>enter or modify xpath query</td>
</tr>

<tr><td>

<form name="queryList" method="post">
<select name="q" 
  onChange="javascript:transform(document.queryList.q.value)">
<option value="/">choose your query</option>

<option value="//s:title[contains( . , 'SlideML')]">
  slide titles containing 'SlideML'</option>

<option value="//img">
  image references</option>

<option value="//img[contains(@src, 'zope')]">
  image references containing 'zope'</option>

<option value="//p[contains(. , 'OpenOffice')]">
  paragraphs containing 'OpenOffice'</option>

<option value="//*[@class='code']">
  elements with class='code'</option>

<option value="//*[@class='code' and contains(@id, 'python')]">
  //class='code' and id contains 'python'</option>

<option value="//a[contains(@href , 'bray')]">
  links with URL containing 'bray'</option>

<option value="//a[contains(./text() , 'bray')]">
  links with text containing 'bray'</option>

<option value="//a[contains(  translate ( 
   text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 
   'abcdefghijklmnopqrstuvwxyz'), 'bray')]">
   links with text containing 'bray', case-insensitive</option>

</select>
</form>

</td><td>

<form name="queryBox" method="post" 
  action="javascript:transform(document.queryBox.q.value)">
<input name="q" size="60">
</form>

</td></tr></table>

<div class="results">
</div>

</body>
</html>

In slightly different ways, MSIE and Mozilla are following the same recipe:

  1. Load the stylesheet into an XML DOM.

  2. Find the <xsl:template match="query"> element.

  3. Reset the value of its match attribute to the XPath string obtained from one or the other of the UI widgets.

  4. Load the package of SlideML data into another XML DOM.

  5. Create an XSLT processor.

  6. Apply the modified XSLT to the SlideML data.

  7. Replace a DIV element with the search results.

Using XPath Search

From a user's point of view, XPath query strings are pretty darned geeky. I'm hopeless with them myself unless I have examples in front of me. I find that having a list of examples available in the context of my own live data, and synchronizing it to an input box in which examples can be modified, leads me to discover and record more useful patterns. A subtler thing happens too. As you're writing the XHTML, the search possibilities begin to guide your choices.

For example, I chose a very simple markup strategy for the slideshow. Rather than go with complex outlining, I decided that I really only needed two levels of indentation. I attached those levels to <p> and <div>. For purposes of indentation, it didn't matter whether I wrote like this:

<p>...</p>
<div>...</div>
<div>...</div>

Or like this:

<p>
<div>...</div>
<div>...</div>
</p>

I chose the latter style because I sensed that I wanted a <p> to enclose a complete thought. That was a somewhat abstract notion, but it suddenly became crystal clear when I made a simple change to the XSLT stylesheet. The change was from

<xsl:value-of select="."/>

to

<xsl:copy-of select="."/>
    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

In other words, instead of simply dumping the text of the found element -- which is what search engines almost universally do, since they can't rely on the markup in the text they find -- this engine returns well-formed fragments. Images display as images, links as proper links, tables as tables, and when the query says "find a paragraph that contains" the result is the complete XHTML paragraph element, rendered as it is in its original context.

Sooner or later, I'll be using a real XML database to enjoy this level of control over the XHTML content I post to my weblog and that others post to theirs. With a little luck, I won't have to provide that service myself. Somebody will build one that latches onto my XHTML feed and others. Meanwhile, being lazy and having some RAM to spare, I'll probably see how far I can push this serverless approach.