Menu

Lightweight XML Search Servers

January 21, 2004

Jon Udell

In earlier installments of this column, I made the case for exploiting the combination of XHTML and CSS, and I demonstrated a browser-based technique for searching XHTML/CSS content using XPath. I've been using a variation of this technique on my weblog. It works, and it's been a revelation to see what's possible using nothing but JavaScript, the DOM, and the XML and XSLT processors embedded in both MSIE and Mozilla. But as my corpus of well-formed content grew it became impractical to load it into a browser in order to perform structured searches. In the spirit of the lightweight browser-based solution, I decided to create an equally lightweight server-based version based on Python and libxml2/libxslt. (I'm also working on a slightly heftier, but more powerful variation based on Berkeley DB XML; we'll explore that one next time.) The minimal search server is packaged into a single Python script which contains:

  • A mini-httpd that extends Python's BaseHTTPServer class

  • An XSLT stylesheet, with markers for a replaceable XPath query

  • Various search-page elements: HTML forms, CSS stylesheet, JavaScript helper

Python bindings for libxml2 and libxslt

To run the script you need two modules not included with the standard Python kit: libxml2 and libxslt. These, in turn, depend on corresponding Gnome C libraries. Because I wanted to colocate my search server with an instance of Radio UserLand, I started the project on a Windows box. From a standing start, with Python not yet installed, I was up and running with Python and libxml2/libxslt in a matter of minutes. First I installed ActiveState's binary distribution of Python. Next I installed Stephane Bidoul's binary distribution of the libxml2/libxslt bindings for Python, which bundles private copies of the required Gnome libraries.

When I later replicated the same setup on Mac OS X, things went much less smoothly. Even though Panther comes with the latest version 2.3 of Python, and includes libxml2/libxslt binaries, it's not clear how to materialize the Python bindings to libxml2/libxslt. For libxml2, I found the answer on Kimbro Staken's weblog. The trick, Kimbro discovered, was to configure libxml2 (I used version 2.6.4) like so:

./configure --with-python=/System/Library/Frameworks/Python.framework/Versions/2.3/

Then, rebuild and reinstall libxml2. The procedure for libxslt (using version 1.1.2) is similar, and I did succeed in building the library with associated Python bindings, but there were a few twists along the way which, I'm embarrassed to say that I did not document and cannot now reproduce the process. Perhaps a reader of this article will attach the canonical procedure as a comment to this article. And perhaps a benefactor like Stephane Bidoul will package up the results so that the incredibly useful Python/libxml2/libxslt combination is as easy to materialize on Mac OS X as it is on Windows. I confess that I don't enjoy sorting out build scenarios and cherish that level of convenience.

With the intrastructure in place, I started with the same XSLT stylesheet that I use in the client-side solution. It contains two instances of a placeholder, __QUERY__, which is replaced by a user-supplied XPath expression. The first instance occurs in an XSLT template that counts matching elements. The second instance occurs in another XSLT template that packages the element as a search result, along with a link to the blog entry containing the matching element. The strategy of the styesheet, as a whole, is to reduce a single file of concatenated XHTML blog entries to the subset of elements matching the query. Here's the stylesheet:

<?xml version="1.0"?>

<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>



<xsl:output method="html" indent="yes"/>



<xsl:template match="/">

<div>Results:

<xsl:value-of select="count(__QUERY__)"/>

</div>

<xsl:apply-templates />

<br clear="all"/>

<p>Entries searched: <xsl:value-of select="count(//item)" /></p>

<p>Date of oldest entry searched: 

   <xsl:value-of select="//item[position()=last()]/date" /></p>

<p>Date of newest entry searched:

   <xsl:value-of select="//item[position()=1]/date" /></p>

</xsl:template>



<xsl:template match="__QUERY__" >

<p><b>

<a>

<xsl:attribute name="href">

http://weblog.infoworld.com/udell/<xsl:value-of select="ancestor::item/date" />\.html# \

  <xsl:value-of select="ancestor::item/@num"/>

</xsl:attribute>

<xsl:value-of select="ancestor::item/title" />

</a> (<xsl:value-of select="ancestor::item/date" />)

</b>

<div>

<xsl:copy-of select="."/>

<xsl:if test="local-name(.)='blockquote' and @cite != ''">

Source: <xsl:value-of select="@cite"/>

</xsl:if>

</div>

<hr align="left" width="20%" />

</p>

</xsl:template>



<xsl:template match="text()"/>



</xsl:stylesheet>

A sample blog entry in the XML file that is transformed by the search looks like this:

<item num="a883">

<title>Server-based XPath search</title>

<date>2004/01/10</date>

<body>

<p>

...arbitrary XHTML content...

</p>

</body>

</item>

Preparing the XSLT stylesheet

A simple search-and-replace against the text of the stylesheet would be one way to replace __QUERY__ with the user-supplied XPath expression. And in fact, it would probably be the simplest way. But since the client-side solution used DOM scripting to do that job, I took that route for the server-side solution too. In the process, I learned something about using namespaces with XPath expressions: you can't. The author of libxml2, Daniel Veillard, explains:

XPath was not designed to be used in isolation, as a result there is no way to provide namespace bindings within the XPath syntax. There are APIs to provide these bindings in the libxml2 XPath module. [gnome.org mail archives]

The question has come up repeatedly. Elsewhere on the list, Veillard writes:

You cannot define a default namespace for XPath, period, don't try you can't, the XPath spec does not allow it. This can't work and trying to add it to libxml2 would simply make it non-conformant to the spec. In a nutshell forget about using default namespace within XPath expressions, this will never work, you can't! [gnome.org mail archives]

For our purposes here, this means that if you want to find the XSLT templates containing __QUERY__ using an XPath expression like this:

//xsl:template//xsl:value-of[@select='count(__QUERY__)']

then you must first create a context and register the xsl namespace with that context, as shown in this method:

def createStylesheet(q):

    styledoc = libxml2.parseDoc( getXsltTemplate() )

    ctxt = styledoc.xpathNewContext()

    ctxt.xpathRegisterNs('xsl','http://www.w3.org/1999/XSL/Transform')



    xpath = "//xsl:template//xsl:value-of[@select='count(__QUERY__)']"

    nodelist = ctxt.xpathEval(xpath)

    nodelist[0].setProp('select', 'count(%s)' % q)



    xpath = "//xsl:template[@match='__QUERY__']"

    nodelist = ctxt.xpathEval(xpath)

    nodelist[0].setProp('match', q)



    style = libxslt.parseStylesheetDoc(styledoc)



    ctxt.xpathRegisteredNsCleanup()

    ctxt.xpathFreeContext()



    return style

Alternatively you could just use Python and libxml2 to locate a nodes using XPath search, then operate on them -- in this case, for example, to update the select and match attributes of templates in the XSLT stylesheet. As Kimbro Staken, Sam Ruby, Simon Willison, and others have recently pointed out, this is a wildly convenient technique for XML processing.

Subclassing BaseHTTPServer

I've had a long love affair with tiny Web servers and Python's BaseHTTPServer appeals to my sense of minimalism. So my XML search server extends that class like so:

class myHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):



    def query(self,q):



        self.style  = createStylesheet(q)

        self.doc = libxml2.parseFile( blogfile )



        try:

            self.result = self.style.applyStylesheet(self.doc, None)

        except:

            self.cleanup()

            return "bad query: %s" % q



        strResult = self.style.saveResultToString(self.result)



        css = getCss()

        script = getScript()

        preamble = getPreamble(q)



        page = """

<html><head><title>XPath query of Jon's Radio</title><style>%s</style>

<script>%s</script></head><body>%s %s</body></html>

""" % (css, script, preamble, strResult)



        self.cleanup()

        return page



    def cleanup(self):

        try:

            self.doc.freeDoc()

            self.style.freeStylesheet()

            self.result.freeDoc()

        except:

            pass



    def do_GET(self):

        xhtml = self.send_head()

        self.wfile.write(xhtml)



    def send_head(self):

        q = self.requestline.split()[1]

        q = re.sub('^/\?','',q)

        q = urllib.unquote(q)

        xhtml = self.query(q)



        self.send_response(200)

        self.send_header("Content-type", "text/html")

        self.send_header("Content-Length", len(xhtml))

        self.end_headers()



        if ( len (xhtml) < maxchars ):

            return xhtml

        else:

	    return "query returned more than %d characters" % maxchars

The query() method begins by passing the user-supplied XPath expression to the createStylesheet() method we've already seen. Then it parses the file of XML content and transforms it using the modified stylesheet. The results of the transformation are serialized and interpolated into the delivered Web page, along with HTML, CSS, and JavaScript elements.

In a long-running process such as a Web server, it's particularly important to free up the memory allocated by the parsed XML documents and the stylesheet. That isn't very Pythonic, but since the Python bindings to libxml2 and libxslt map closely to the supporting C libraries, you'll quickly chew up all available memory if you don't free those resources. And as a matter of fact, eagle-eyed readers will have noted that the createStylesheet() method defined above does not free the styledoc object it creates. That's because I haven't figured out which dependent object has to be freed first -- perhaps a reader will post the solution in a comment.

Starting up the server

The startup code is typical for a BaseHTTPServer-derived service:

def run(port,HandlerClass = myHTTPRequestHandler, 

        ServerClass = BaseHTTPServer.HTTPServer, protocol="HTTP/1.0"):

    server_address = ('', port)

    HandlerClass.protocol_version = protocol

    httpd = ServerClass(server_address, HandlerClass)

    sa = httpd.socket.getsockname()

    print "Serving HTTP on", sa[0], "port", sa[1], "..."

    httpd.serve_forever()



if __name__ == '__main__':

    maxchars = 250000

    if sys.argv[1:]:

        port = int(sys.argv[1])

    else:

        port = 8000

    if sys.argv[2:]:

        externalhost = int(sys.argv[2])

    else:

        externalhost = gethostbyname(gethostname())

    if sys.argv[3:]:

        blogfile = sys.argv[3]

    else:

        blogfile = 'blog.xml'

    run(port)

I also experiment with a threaded listener, which involved making these changes:

class myHTTPServer (SocketServer.ThreadingMixIn, 

    BaseHTTPServer.BaseHTTPRequestHandler): pass



def run(port, HandlerClass = myHTTPRequestHandler, 

        ServerClass = myHTTPServer, protocol="HTTP/1.0"):

This works for me on Mac OS X, but crashes and burns on Windows Server 2003 for reasons I haven't figured out. Although the basic nonthreaded service works reasonably well, I'm seeing some timeouts that suggest threading might be appropriate.

Of course there are a million ways to skin this cat, with or without Python, using any kind of HTTP server and XML infrastructure. The point of this article is really to motivate you to experiment with structured search. To that end, the Python/libxslt combo (build issues notwithstanding) makes for a convenient playground.

If you want to try this yourself, the various pieces of the solution -- HTML, CSS, JavaScript, XSLT -- are included within the downloadable script. For sample content, you can use the XML file containing my last several hundred blogs entries, available here; by default the server looks for a file called blog.xml in its current directory.

Against that file, which is nearing a megabyte in size, the server exhibits subsecond response time when XPath expressions test for attribute equality. Queries that only use contains() clauses take several seconds. That wouldn't surprise me were it not for the fact that the client-side solution, either with MSIE (using the MSXML processor) or Mozilla (using the Transformiix processor), deliver instantaneous results even for contains() queries.

In its first day of use, the service provoked some blog discussion [1, 2] on the question of whether it's reasonable to expose any Web-connected database to arbitrary query. My own little service, although it won't return a result file larger than a quarter-megabyte, make no effort to limit the resources burned at the server in order to satisfy a query. If we can't find a way to enforce such limits, then there isn't a bright future for query on the URL-line. OpenLink Software's Kingsley Idehen believes that we can enforce such limits, and I hope he's right because this is powerful stuff.

    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

The Social Life of XML

Interactive Microcontent

Language Instincts

Consider, for example, the notion of categorizing blog entries. A while back I abandoned the practice of tagging my entries with category labels, because it felt too static and I had a hunch a more dynamic method would become available. Just today, I posted an entry that nicely illustrates that dynamic approach. It contains a reference to a book, and it also contains a reference to an MP3 clip. Were I still categorizing entries, I'd have been tempted to assign this one to a books category, and also an AV category. Instead, I wrote a couple of queries that parse out entries along those dimensions:

books: //p[contains(.//a/@href,'amazon.com') or contains(.//a/@href,'allconsuming')]

AV clips: //p[contains(.//a/@href,'.mp3') or contains(.//a/@href,'.wav') or contains(.//a/@href,'.mov') or contains(.//a/@href,'.ram')]

Interestingly, although the item that prompted me to write these two queries is found by both of them, they return different paragraphs from the item -- one which embeds the MP3 link, another which embeds the book link. And since the results are merely a reduction of the original XHTML content, they are contextualized by the surrounding paragraph elements and retain their look and feel.

Not bad for one little script backed by an XML file. Of course the all-in-one-XML-file approach is bound to run out of gas sooner or later. So next time we'll look at a database-backed alternative based on Sleepycat's Berkeley DB XML.