Menu

The Semantic Blog

April 15, 2003

Jon Udell

When the mainstream trade press first started writing about XML, one of the key benefits invariably cited was precise search. You don't hear much about that any more. It wasn't, and still isn't, the wrong idea, but XML-savvy search requires an investment in data preparation that virtually nobody was or is willing to make. There are isolated examples, of course. One of my favorites is the ability of Safari (the electronic reference library, not the browser) to search within code fragments. Here, for example, is a query that finds sections of books containing code fragments that illustrate the use of Perl's Net::LDAP module:

http://safari.oreilly.com/JVXSL.asp?x=1&srchText=%28CODE+NET::ldap%29

We'd love it if what we write ourselves -- in email, on weblogs -- could behave this way. But we'd hate to be saddled with the rigorous data preparation that the Safari production teams slog through. That's the Semantic Web dilemma in a nutshell. Where's the sweet spot? How can we marry spontaneity and structure? Recent trends in blogspace, plus emerging XML-savvy databases suggest a way forward.

When I reviewed OpenLink Software's Virtuoso for InfoWorld, I used my collection of inbound RSS feeds as sample data. After fetching each of the feeds into an indexed column of the database, I was able to perform the following query:

create view rss as

  select feedname, '1.0' as version from feeds 

    where xpath_contains(feeddata,'//RDF')

union 

  select feedname,version from feeds 

    where xpath_contains(feeddata,'//rss/@version', version);

I could then query that view:

select version, count(version) as ct from rss 

  group by version 

  order by ct desc 

  for xml auto;

and feed the resulting XML data in Excel 11 for analysis:

Excel Screenshot

Mark Pilgrim's inside joke is hilarious. But the real message here is that you can concisely extract an analytical result from the chaotic stew of RSS flavors. A couple of XPath queries suffice to characterize the two dominant flavors of RSS that are relevant to the distribution of versions in a sample of feeds. It is remarkably powerful to be able to join those XPath queries in the context of a SQL query, and then further manipulate the results in SQL. In this example, Virtuoso has indexed the XML with the statement:

CREATE TEXT INDEX XML on feeds (feeddata)

Otherwise, it would have to scan through all the XML documents loaded into the feeddata column. Here's a similar example in Oracle, from Steve Muench's Building Oracle XML Applications:

/* 

** Performance is excellent with intermediaText 

** CONTAINS( ) to narrow down the millions of 

** documents to the few matching ones. Then,

** xpath.valueOf can be used in the SELECT 

** statement to operate on the few matching documents. 

*/

SELECT claimid, xpath.valueOf(damagereport,'//Cause') AS Cause

  FROM ins_claim

 WHERE CONTAINS(damagereport,'brakes WITHIN Cause') > 0

Here Oracle's intermediaText handles the indexing, and its XPath engine picks out the data. All the big commercial databases either already support this hybrid model or soon will. On the open source front, PostgreSQL is heading in this direction, and I expect MySQL eventually will too.

Taking it to the desktop

As we consume more of our information by way of RSS feeds, the inability to store, index, and precisely search those feeds becomes more painful. I'd like to be able to work with my RSS data locally, even while offline, in much more powerful ways. One emerging option is the XML layer being added to Sleepycat's Berkeley DB, the database that's embedded in the Mozilla mail/news client, in Movable Type, and in a slew of other programs. Given a Perl array @feeds containing the URLs of my RSS feeds, here's how you can load up those feeds into a Berkeley DB XML table:

use strict;

use warnings;



use Sleepycat::DbXml 'simple' ;

use LWP::Simple;

use XML::Parser;



my @feeds = enumerateFeeds();



eval

  {

  my $container = new XmlContainer("test");

  $container->open(Db::DB_CREATE);

  $container->declareIndex("","title","node-element-equality-string");

  my @feeds = feeds();

  my $p = new XML::Parser;

  foreach my $feed (@feeds)

    {

    my $content = get $feed;

    eval

      { $p->parsestring($content); };

    if ($@)

      {  warn $@; }

    else

      {

      print "$feed: ok\n";

      my $document = new XmlDocument ;

      $document->setContent($content);

      $container->putDocument($document);

      }

    }

  $container->close();

  }

  

  my $e;



  if ($e = catch XmlException)

    {  warn $e->what(), "\n"; }

  elsif ($e = catch std::exception)

    {  warn $e->what(), "\n"; }

  elsif ($@)

    {  warn $@; }

  }

Note that the index on the title element will apply to all such elements -- in this case, to //channel/title and to //channel/item/title. The "node-element-equality-string" lingo is part of an elaborate indexing system that enables, but also forces, you to specify the path type (edge or node), the node type (element or attribute), the key type (equality, presence, or substring), and the data type (number, string). For this reason, and also because you can't dynamically drop or add indexes to a container once you've loaded documents into it, Berkeley DB XML -- like Berkeley DB itself, as its FAQ explains -- is probably not the best foundation for a database-backed RSS reader, especially since the XML structures fed to that reader are likely to continue to morph. Nevertheless, since we've loaded up the data, let's look at some queries. Here's a program, written using Berkeley DB XML's Java binding, that will issue XPath queries and give back results:

package com.sleepycat.examples;



import com.sleepycat.db.*;

import com.sleepycat.dbxml.*;



public class XpathSearch

  {

  public static void main(String[] args) throws Exception 

    {

    System.out.println(args[0]);

    XmlContainer container = new XmlContainer(null, "test", 0);

    container.open(null, Db.DB_CREATE, 0);

    XmlQueryContext context = 

          new XmlQueryContext (

          XmlQueryContext.ResultValues, XmlQueryContext.Eager);

    context.setNamespace("dc","http://purl.org/dc/elements/1.1/");

    XmlResults results = container.queryWithXPath(null, args[0], context, 0);

		for (XmlValue value; (value = results.next(null)) != null; ) 

      {

      System.out.println(value.asString(context));

      }

    container.close(0);

    }

  }

And here are some questions and answers:

Which channel titles contain 'Jon'?

//channel/title[contains(text(),'Jon')]/text()

Jon Schull's Weblog

Jon's Radio

What are the URLs of channel titles that contain 'Jon'?

//channel/title[contains(text(),'Jon')]/ancestor::channel/link/text()

http://radio.weblogs.com/0104369/

http://weblog.infoworld.com/udell/

What are the titles of channels that use dc:date?

//channel/dc:date/ancestor::channel/title/text()

algorhythm

dive into mark

What are the titles of channels with more than 25 items?

//*[count(item)>25]/title/text()

New York Times: Business

New York Times: Technology

Mozquito XForms

Mono Project News

Web Services Articles from The Stencil Group

What are the titles of channels with descriptions longer than 5000 characters?

//*[string-length(description)>5000]/ancestor::channel/title/text()

ScottGu's Blog

Jon Schull's Weblog

Clemens Vasters: Enterprise Development & Alien Abductions

Mark O'Neill's Radio Weblog

Jeremy Allaire's Radio 

Jamie Lewis

Better Living Through Software

DJ's Weblog

jbond's blog at voidstar.com

What are the titles of items with descriptions containing XPath or xpath?

//channel/item/description[contains(text(),'XPath') or contains(text(),'xpath')]/ancestor::item/title/text()

Just how RESTful is TV?

xhtml in rss 2.0

Beyond being saved, DENG Featured On Flashguru.co.uk

Degrees of freedom

xhtml in rss 2.0

Native XML Scripting

What are the titles of channels with items whose descriptions contain XPath or xpath?

//channel/item/description[contains(text(),'XPath') or contains(text(),'xpath')]/ancestor::channel/title/text()

Clemens Vasters: Enterprise Development & Alien Abductions

Sjoerd Visscher's weblog

Mozquito XForms

Jon's Radio

Sam Ruby

TheArchitect.co.uk - Jorgen Thelin's weblog

You can do the same kind of thing with Apache Xindice, which is probably a better fit for the purpose. Xindice lacks the robust, high-performance characteristics of the Sleepycat's XML DB, which supports all of Berkeley DB's transactional features. But it's a more flexible ad-hoc indexer and searcher. Xindice enables you to use wildcards to index all elements or all attributes and to freely add and drop indexes.

Dropping the other shoe

The kinds of searches shown here are fun, up to a point, But the novelty quickly wears off because the only XML available for searching is metadata (channel titles, item titles, dates), not content. Here's where the other shoe drops. I've long dreamed of using RSS to produce and consume XML content. We're so close. RSS content is HTML, which is almost XHTML, a gap that HTML Tidy can close. In current practice, the meat of an RSS item appears in the <description> tag, either as an HTML-escaped (aka entity-encoded) string or as a CDATA element. As has been often observed, it'd be really cool to have the option to use XHTML as well. Then I could write blog items in which the <pre> tag, or perhaps a class="codeFragment" attribute, marks regions for precise search. You or I could aggregate those items into personal XPath-aware databases in order to do those searches locally (perhaps even offline), and public aggregators could offer the same capability over the Web.

So I was delighted to see the recent ping-pong match between Sam Ruby and Don Box, each of whom has now demonstrated a valid RSS 2.0 feed (Sam, Don) that includes a <body> element, properly namespaced as XHTML, which carries XPath-friendly content. Excellent!

If this idea takes hold, the <description> tag could in principle revert to its original purpose which (at least in my view) was to advertise, rather than fully convey an item. In practice, I doubt that will happen. Too much infrastructure expects entire items to be encoded in the <description>. That's okay; there are lots of options. For example, a blog can easily offer variant feeds. One could carry the full item encoded in the <description>, without an <xhtml:body>, for use by humans wanting to read items in RSS readers. Another could carry a brief description in <description> and the full item in <xhtml:body>, for use by programs that gather, index, and search feeds.

    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

Nothing will break, and nobody will be forced to adopt a new tool or learn a new behavior. But those who want to start sprinkling semantic cues into their blogs will be free to do so. For me, frankly, it'll be a selfish activity first and foremost. I refer to my own stuff more than you do. And I already write in XHTML. Why shouldn't I enjoy precise search of my own archive? Extending that same capability to you, by way of RSS, is a nice bonus.

I haven't pulled the trigger on this yet, partly because every stage in the evolution of RSS provokes such tumultuous debate that I wanted to float a trial balloon here and gauge the reaction. My hunch, though, is that before long I'll be producing, consuming, and storing a small but growing number of XHTML-enhanced blogs. And then, I hope, we'll start to see conventions bubbling up from the grass roots. We don't need a grand ontology in order to be able to mark up things like code fragments. My guess is that these kinds of enhancements will come easily and naturally, the way RSS autodiscovery was born. A friend of mine likes to say that the semantic Web isn't a destination, it's a journey. I'm starting to see how to take the first few baby steps.