Lightweight XML Search Servers
January 21, 2004
A mini-httpd that extends Python's BaseHTTPServer class
An XSLT stylesheet, with markers for a replaceable XPath query
Python bindings for libxml2 and libxslt
To run the script you need two modules not included with the standard Python kit: libxml2 and libxslt. These, in turn, depend on corresponding Gnome C libraries. Because I wanted to colocate my search server with an instance of Radio UserLand, I started the project on a Windows box. From a standing start, with Python not yet installed, I was up and running with Python and libxml2/libxslt in a matter of minutes. First I installed ActiveState's binary distribution of Python. Next I installed Stephane Bidoul's binary distribution of the libxml2/libxslt bindings for Python, which bundles private copies of the required Gnome libraries.
When I later replicated the same setup on Mac OS X, things went much less smoothly. Even though Panther comes with the latest version 2.3 of Python, and includes libxml2/libxslt binaries, it's not clear how to materialize the Python bindings to libxml2/libxslt. For libxml2, I found the answer on Kimbro Staken's weblog. The trick, Kimbro discovered, was to configure libxml2 (I used version 2.6.4) like so:
Then, rebuild and reinstall libxml2. The procedure for libxslt (using version 1.1.2) is similar, and I did succeed in building the library with associated Python bindings, but there were a few twists along the way which, I'm embarrassed to say that I did not document and cannot now reproduce the process. Perhaps a reader of this article will attach the canonical procedure as a comment to this article. And perhaps a benefactor like Stephane Bidoul will package up the results so that the incredibly useful Python/libxml2/libxslt combination is as easy to materialize on Mac OS X as it is on Windows. I confess that I don't enjoy sorting out build scenarios and cherish that level of convenience.
With the intrastructure in place, I started with the same XSLT stylesheet that I use in the client-side solution. It contains two instances of a placeholder, __QUERY__, which is replaced by a user-supplied XPath expression. The first instance occurs in an XSLT template that counts matching elements. The second instance occurs in another XSLT template that packages the element as a search result, along with a link to the blog entry containing the matching element. The strategy of the styesheet, as a whole, is to reduce a single file of concatenated XHTML blog entries to the subset of elements matching the query. Here's the stylesheet:
<?xml version="1.0"?> <xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'> <xsl:output method="html" indent="yes"/> <xsl:template match="/"> <div>Results: <xsl:value-of select="count(__QUERY__)"/> </div> <xsl:apply-templates /> <br clear="all"/> <p>Entries searched: <xsl:value-of select="count(//item)" /></p> <p>Date of oldest entry searched: <xsl:value-of select="//item[position()=last()]/date" /></p> <p>Date of newest entry searched: <xsl:value-of select="//item[position()=1]/date" /></p> </xsl:template> <xsl:template match="__QUERY__" > <p><b> <a> <xsl:attribute name="href"> http://weblog.infoworld.com/udell/<xsl:value-of select="ancestor::item/date" />\.html# \ <xsl:value-of select="ancestor::item/@num"/> </xsl:attribute> <xsl:value-of select="ancestor::item/title" /> </a> (<xsl:value-of select="ancestor::item/date" />) </b> <div> <xsl:copy-of select="."/> <xsl:if test="local-name(.)='blockquote' and @cite != ''"> Source: <xsl:value-of select="@cite"/> </xsl:if> </div> <hr align="left" width="20%" /> </p> </xsl:template> <xsl:template match="text()"/> </xsl:stylesheet>
A sample blog entry in the XML file that is transformed by the search looks like this:
<item num="a883"> <title>Server-based XPath search</title> <date>2004/01/10</date> <body> <p> ...arbitrary XHTML content... </p> </body> </item>
Preparing the XSLT stylesheet
A simple search-and-replace against the text of the stylesheet would be one way to replace __QUERY__ with the user-supplied XPath expression. And in fact, it would probably be the simplest way. But since the client-side solution used DOM scripting to do that job, I took that route for the server-side solution too. In the process, I learned something about using namespaces with XPath expressions: you can't. The author of libxml2, Daniel Veillard, explains:
XPath was not designed to be used in isolation, as a result there is no way to provide namespace bindings within the XPath syntax. There are APIs to provide these bindings in the libxml2 XPath module. [gnome.org mail archives]
The question has come up repeatedly. Elsewhere on the list, Veillard writes:
You cannot define a default namespace for XPath, period, don't try you can't, the XPath spec does not allow it. This can't work and trying to add it to libxml2 would simply make it non-conformant to the spec. In a nutshell forget about using default namespace within XPath expressions, this will never work, you can't! [gnome.org mail archives]
For our purposes here, this means that if you want to find the XSLT templates containing __QUERY__ using an XPath expression like this:
then you must first create a context and register the xsl namespace with that context, as shown in this method:
def createStylesheet(q): styledoc = libxml2.parseDoc( getXsltTemplate() ) ctxt = styledoc.xpathNewContext() ctxt.xpathRegisterNs('xsl','http://www.w3.org/1999/XSL/Transform') xpath = "//xsl:template//xsl:value-of[@select='count(__QUERY__)']" nodelist = ctxt.xpathEval(xpath) nodelist.setProp('select', 'count(%s)' % q) xpath = "//xsl:template[@match='__QUERY__']" nodelist = ctxt.xpathEval(xpath) nodelist.setProp('match', q) style = libxslt.parseStylesheetDoc(styledoc) ctxt.xpathRegisteredNsCleanup() ctxt.xpathFreeContext() return style
Alternatively you could just use Python and libxml2 to locate a nodes using XPath search, then operate on them -- in this case, for example, to update the select and match attributes of templates in the XSLT stylesheet. As Kimbro Staken, Sam Ruby, Simon Willison, and others have recently pointed out, this is a wildly convenient technique for XML processing.
I've had a long love affair with tiny Web servers and Python's BaseHTTPServer appeals to my sense of minimalism. So my XML search server extends that class like so:
class myHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler): def query(self,q): self.style = createStylesheet(q) self.doc = libxml2.parseFile( blogfile ) try: self.result = self.style.applyStylesheet(self.doc, None) except: self.cleanup() return "bad query: %s" % q strResult = self.style.saveResultToString(self.result) css = getCss() script = getScript() preamble = getPreamble(q) page = """ <html><head><title>XPath query of Jon's Radio</title><style>%s</style> <script>%s</script></head><body>%s %s</body></html> """ % (css, script, preamble, strResult) self.cleanup() return page def cleanup(self): try: self.doc.freeDoc() self.style.freeStylesheet() self.result.freeDoc() except: pass def do_GET(self): xhtml = self.send_head() self.wfile.write(xhtml) def send_head(self): q = self.requestline.split() q = re.sub('^/\?','',q) q = urllib.unquote(q) xhtml = self.query(q) self.send_response(200) self.send_header("Content-type", "text/html") self.send_header("Content-Length", len(xhtml)) self.end_headers() if ( len (xhtml) < maxchars ): return xhtml else: return "query returned more than %d characters" % maxchars
In a long-running process such as a Web server, it's particularly important to free up the memory allocated by the parsed XML documents and the stylesheet. That isn't very Pythonic, but since the Python bindings to libxml2 and libxslt map closely to the supporting C libraries, you'll quickly chew up all available memory if you don't free those resources. And as a matter of fact, eagle-eyed readers will have noted that the createStylesheet() method defined above does not free the styledoc object it creates. That's because I haven't figured out which dependent object has to be freed first -- perhaps a reader will post the solution in a comment.
Starting up the server
The startup code is typical for a BaseHTTPServer-derived service:
def run(port,HandlerClass = myHTTPRequestHandler, ServerClass = BaseHTTPServer.HTTPServer, protocol="HTTP/1.0"): server_address = ('', port) HandlerClass.protocol_version = protocol httpd = ServerClass(server_address, HandlerClass) sa = httpd.socket.getsockname() print "Serving HTTP on", sa, "port", sa, "..." httpd.serve_forever() if __name__ == '__main__': maxchars = 250000 if sys.argv[1:]: port = int(sys.argv) else: port = 8000 if sys.argv[2:]: externalhost = int(sys.argv) else: externalhost = gethostbyname(gethostname()) if sys.argv[3:]: blogfile = sys.argv else: blogfile = 'blog.xml' run(port)
I also experiment with a threaded listener, which involved making these changes:
class myHTTPServer (SocketServer.ThreadingMixIn, BaseHTTPServer.BaseHTTPRequestHandler): pass def run(port, HandlerClass = myHTTPRequestHandler, ServerClass = myHTTPServer, protocol="HTTP/1.0"):
This works for me on Mac OS X, but crashes and burns on Windows Server 2003 for reasons I haven't figured out. Although the basic nonthreaded service works reasonably well, I'm seeing some timeouts that suggest threading might be appropriate.
Of course there are a million ways to skin this cat, with or without Python, using any kind of HTTP server and XML infrastructure. The point of this article is really to motivate you to experiment with structured search. To that end, the Python/libxslt combo (build issues notwithstanding) makes for a convenient playground.
Against that file, which is nearing a megabyte in size, the server exhibits subsecond response time when XPath expressions test for attribute equality. Queries that only use contains() clauses take several seconds. That wouldn't surprise me were it not for the fact that the client-side solution, either with MSIE (using the MSXML processor) or Mozilla (using the Transformiix processor), deliver instantaneous results even for contains() queries.
In its first day of use, the service provoked some blog discussion [1, 2] on the question of whether it's reasonable to expose any Web-connected database to arbitrary query. My own little service, although it won't return a result file larger than a quarter-megabyte, make no effort to limit the resources burned at the server in order to satisfy a query. If we can't find a way to enforce such limits, then there isn't a bright future for query on the URL-line. OpenLink Software's Kingsley Idehen believes that we can enforce such limits, and I hope he's right because this is powerful stuff.
More from Jon Udell
Consider, for example, the notion of categorizing blog entries. A while back I abandoned the practice of tagging my entries with category labels, because it felt too static and I had a hunch a more dynamic method would become available. Just today, I posted an entry that nicely illustrates that dynamic approach. It contains a reference to a book, and it also contains a reference to an MP3 clip. Were I still categorizing entries, I'd have been tempted to assign this one to a books category, and also an AV category. Instead, I wrote a couple of queries that parse out entries along those dimensions:
books: //p[contains(.//a/@href,'amazon.com') or
AV clips: //p[contains(.//a/@href,'.mp3') or contains(.//a/@href,'.wav') or
contains(.//a/@href,'.mov') or contains(.//a/@href,'.ram')]
Interestingly, although the item that prompted me to write these two queries is found by both of them, they return different paragraphs from the item -- one which embeds the MP3 link, another which embeds the book link. And since the results are merely a reduction of the original XHTML content, they are contextualized by the surrounding paragraph elements and retain their look and feel.
Not bad for one little script backed by an XML file. Of course the all-in-one-XML-file approach is bound to run out of gas sooner or later. So next time we'll look at a database-backed alternative based on Sleepycat's Berkeley DB XML.