Menu

Lightweight XML Search Servers, Part 2

February 18, 2004

Jon Udell

In last month's installment I showed a simple search service that uses libxslt to reduce a file of XML content (my weblog writing) to just the elements matching an XPath expression. This month's challenge was to scale up to a database-backed implementation using Berkeley DB XML. Since my own XML corpus doesn't present much of a challenge to DB XML, I also decided to include all of the 200+ blogs I read. This proved interesting because only a handful of them provide well-formed XHTML content. So before I dive into the details of the DB XML implementation, here's the rundown on how I built an RSS aggregator that renders feeds as well-formed XHTML.

The short answer is that a pair of modules did all the heavy lifting. The first is HTML Tidy, originally by Dave Raggett and now maintained by "a quorum of developers" on SourceForge. The second is Mark Pilgrim's universal feed parser which, as it turns out, neatly encapsulates HTML Tidy with the help of Marc-Andre Lemburg's mxTidy.

Databasing the feeds

The complete loader script is here. First, the setup:

import string, sys,  urllib, re, traceback, pickle, os, \

  feedparser, mx.DateTime, libxml2



from dbxml import *

Of the non-standard modules, feedparser is trivial to install: just drop it in the current directory. Installers for mx.DateTime and mx.Tidy (used by feedparser) are available at egenix.com. Creating the Python bindings for libxml2 was, as mentioned last time, not quite a no-brainer. Likewise the Python bindings for DB XML on Windows and Mac OS X, my two everyday environments. For OS X, see Kimbro Staken's recipe. For Windows, I realized that Chandler contains the bits I needed, so I cheated and used those. Mea culpa.

More laziness: my aggregator, like everyone's first aggregator, initially ignored the two popular ways to check if a feed has changed since last fetched: ETag and "conditional GET". But feedparser makes it easy to use both methods. You just need to remember the values of the ETag and Last-Modified headers sent from the server, and hand them to feedparser's parse() method when you ask it to fetch a feed. I could have stored the headers in DB XML, but it seemed easier to just pickle a Python dictionary:

try:

  os.stat(dictEtagFile)

except:

  print "CreatingDictEtag"

  f = open(dictEtagFile,'wb')

  dictEtag = {}

  pickle.dump(dictEtag,f)

  f.close()

try:

  f = open(dictEtagFile, 'rb')

  dictEtag = pickle.load(f)

  f.close()

except:

  raise "CannotLoadDictEtag"

The list of feeds to fetch is avaiable from an OPML file that's rewritten whenever I add or remove an RSS subscription using my Radio UserLand feedreader. The combination of Python and libxml2 makes quick work of pulling the URLs for those feeds into a Python list:

opmldata = urllib.urlopen(opml).read()

opmlxml = libxml2.parseDoc(opmldata)

urls = opmlxml.xpathEval('//@xmlUrl')

urls = map ( libxml2.xmlNode.getContent, urls)

Using the ETag and Last-Modified headers stored in the pickled dictionary, you can avoid fetching feeds that haven't changed. But when a feed has changed, which of its items should be added to the database? RSS 2.0 provides a "guid" that signals item-level changes, but RSS .9x and 1.0 feeds don't. Yet more laziness: for now I'm just hashing the item content and keeping track that way. Items are stored in the database like so:

<item channel="Jon's Radio" hash="-125385335">

<title>...</title>

<date>...</date>

<link>...</link>

<content>...</content>

</item>

To fetch these into a Python dictionary, the script opens the database and runs an XPath query:

container = XmlContainer(None, './dbxml/blog.dbxml')

container.open(None,DB_CREATE)

context = XmlQueryContext(1,0)

result  = container.queryWithXPath(None, '/item/@hash', context)

hashValues = {}

try:

  for i in range(r.size()):

    val  = result.next().asString(None)

    hashValues[val] = val

except:

  print "CannotCreateHashDict"

container.close()

DB XML supports Berkeley DB transactions, but I'm not using them here, so the first argument to container.open() is None. Queries, by default, return whole matching documents, and we'll use that mode later in the search engine. But here, we just want to extract hash values, so the first argument to XmlQueryContext() -- the returnType -- is 1. (The second argument -- evaluationType -- defaults to "eager" but can be switched to "lazy," though I haven't found a reason to do that yet.)

Now here's the main loop. For each URL, it looks up the ETag and Last-Modified headers, calls the feedparser, calls a helper method to unpack the Python data structure returned by the feedparser, and -- if any new items showed up -- it plugs them into the database.

for url in urls:

  print "%s " % url

  try:

    try:

      dictEtag[url]

    except KeyError:

      dictEtag[url] = {}

    try:

      etag = dictEtag[url]['etag']

    except:

      etag = None

    try:

      mod  = dictEtag[url]['mod']

    except:

      mod  = None

    result = feedparser.parse(url, etag, mod)

    try:

      etag = result['etag']

      dictEtag[url]['etag'] = etag

    except:

      etag = None

      pass

    try:

      mod = result['modified']

      dictEtag[url]['mod'] = mod

    except:

      mod = None

      pass

    if ( etag == None and mod == None ):

      print "%s: no etag or mod" % url 

    if ( result['status'] == 304 ):

      continue

    items = unpack ( result )

    if len ( items ):

      container.open(None,DB_CREATE)

      for item in ( items ):

        doc = XmlDocument()

        doc.setContent(item)

        print container.putDocument(None, doc)

      container.close()

  except:

    print formatExceptionInfo()

    continue

The unpack() method bails on items found in the dictionary of item hashes, and normalizes the data in a couple of ways. For example, the content can show up in a couple of different ways in the Python dictionary returned by feedparser.parse(). Likewise, there are a few different date formats. For now I'm converting all content to this style:

<content><body>...</body></content>

And all dates to this style:

<date>2004/02/04></date>

Finally, the script pickles the dictionary of item hashes for next time.

Searching the Feeds

As before, the server is based on Python's BaseHTTPServer. Last month, the server's query() method used libxslt's XPath engine. This month it uses DB XML's queryWithXPath(). The complete search script is here.

The default DB XML search context, which returns documents rather than just matching nodes, is the right strategy here. Each found document provides the channel, title, date, and link used to contextualize one or more hits within the document.

The hits still need to be isolated from their containing documents, though. So libxml2's xpathEval() does subqueries, within each found document, to yield one or more contextualized fragments.

The HTML output is accumulated as a Python list of date-content pairs, and then sorted by date. The reason for this is that XPath itself can't sort results.

Here's the query() method:

  def query(self,q):



    css = getCss()

    script = getScript()

    preamble = getPreamble(q)



    try:

      container = XmlContainer(None, db)

      container.open(None)

      xmlResults = container.queryWithXPath(None, q)

    except:

      print "QueryFailure"

      container.close()

      return "bad query: %s" % q



    if ( xmlResults.size() == 0 ):

      return """<html><head><title>XPath query of Jon's feeds</title>

<style>%s</style>

<script>%s</script></head>

<body>%s <b>no results</b></body></html> """ % (css, script, preamble)



    l = []



    for i in range(xmlResults.size()):

      result = xmlResults.next().asString(None)

      try:

        xmlFragment = libxml2.parseDoc( result )

      except:

        print "NotWellFormed"

        xmlFragment.freeDoc()

        container.close()

        return "bad query: %s" % q



      xpathChannel = "//item/@channel"

      xpathTitle = "//title"

      xpathDate = "//date"

      xpathLink = "//link"



      xpathHits = q



      try:

        channel = xmlFragment.xpathEval(xpathChannel)[0].content

        title = xmlFragment.xpathEval(xpathTitle)[0].content

        date = xmlFragment.xpathEval(xpathDate)[0].content

        link = urllib.unquote(xmlFragment.xpathEval(xpathLink)[0].content)

        hits = xmlFragment.xpathEval(xpathHits)

      except:

        print "CannotSearchWithinFoundDocument"

        xmlFragment.freeDoc()

        container.close()

        return "bad query: %s" % q



      try:

        frags = ''

        for j in range(len(hits)):

          frags = frags + '<div>' + hits[j].serialize() + \

            '</div><hr align="left" width="20%"/>'

      except:

        print "CannotSerializeHits"

        xmlFragment.freeDoc()

        container.close()

        return "bad query: %s" % q



      xmlFragment.freeDoc()



      xhtml = '<p><a href="%s">%s</a> (<b>%s</b>, <b>%s</b>)</p> %s' % (

           link,

           title,

           channel,

           date,

           frags

           )



      l.append ( ( date, xhtml ) )



    container.close()



    l.sort( lambda x, y: cmp ( y[0], x[0] ) )



    xhtml = ''



    for i in range(len(l)):

      xhtml = xhtml + l[i][1]



    page = """

<html><head><title>XPath query of Jon's feeds</title><style>%s</style>

<script>%s</script></head><body>%s %s</body></html>

""" % (css, script, preamble, xhtml)



    return page
    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

Language Instincts

Not shown in this example is DB XML's indexing. I've tinkered with it, but the database of blog items I've built so far -- about 4000 documents, and 7MB -- doesn't really require indexes yet. As the database grows, I plan to learn more about DB XML's (somewhat bewildering) set of indexing strategies.

I'm also looking forward to using this growing repository to try out other XML-aware databases, and to explore XQuery as well as XPath implementations. It's been a bit of a revelation to see how easily random HTML scooped up from RSS feeds can be XHTML-ized and exposed to structured search. Can this approach scale to the level required to create network effects? I'd love to see Technorati, Feedster, and eventually Google give it a try.