
Lightweight XML Search Servers
In earlier installments of this column, I made the case for exploiting the combination of XHTML and CSS, and I demonstrated a browser-based technique for searching XHTML/CSS content using XPath. I've been using a variation of this technique on my weblog. It works, and it's been a revelation to see what's possible using nothing but JavaScript, the DOM, and the XML and XSLT processors embedded in both MSIE and Mozilla. But as my corpus of well-formed content grew it became impractical to load it into a browser in order to perform structured searches. In the spirit of the lightweight browser-based solution, I decided to create an equally lightweight server-based version based on Python and libxml2/libxslt. (I'm also working on a slightly heftier, but more powerful variation based on Berkeley DB XML; we'll explore that one next time.) The minimal search server is packaged into a single Python script which contains:
A mini-httpd that extends Python's BaseHTTPServer class
An XSLT stylesheet, with markers for a replaceable XPath query
Various search-page elements: HTML forms, CSS stylesheet, JavaScript helper
Python bindings for libxml2 and libxslt
To run the script you need two modules not included with the standard Python kit: libxml2 and libxslt. These, in turn, depend on corresponding Gnome C libraries. Because I wanted to colocate my search server with an instance of Radio UserLand, I started the project on a Windows box. From a standing start, with Python not yet installed, I was up and running with Python and libxml2/libxslt in a matter of minutes. First I installed ActiveState's binary distribution of Python. Next I installed Stephane Bidoul's binary distribution of the libxml2/libxslt bindings for Python, which bundles private copies of the required Gnome libraries.
When I later replicated the same setup on Mac OS X, things went much less smoothly. Even though Panther comes with the latest version 2.3 of Python, and includes libxml2/libxslt binaries, it's not clear how to materialize the Python bindings to libxml2/libxslt. For libxml2, I found the answer on Kimbro Staken's weblog. The trick, Kimbro discovered, was to configure libxml2 (I used version 2.6.4) like so:
./configure --with-python=/System/Library/Frameworks/Python.framework/Versions/2.3/
Then, rebuild and reinstall libxml2. The procedure for libxslt (using version 1.1.2) is similar, and I did succeed in building the library with associated Python bindings, but there were a few twists along the way which, I'm embarrassed to say that I did not document and cannot now reproduce the process. Perhaps a reader of this article will attach the canonical procedure as a comment to this article. And perhaps a benefactor like Stephane Bidoul will package up the results so that the incredibly useful Python/libxml2/libxslt combination is as easy to materialize on Mac OS X as it is on Windows. I confess that I don't enjoy sorting out build scenarios and cherish that level of convenience.
With the intrastructure in place, I started with the same XSLT stylesheet that I use in the client-side solution. It contains two instances of a placeholder, __QUERY__, which is replaced by a user-supplied XPath expression. The first instance occurs in an XSLT template that counts matching elements. The second instance occurs in another XSLT template that packages the element as a search result, along with a link to the blog entry containing the matching element. The strategy of the styesheet, as a whole, is to reduce a single file of concatenated XHTML blog entries to the subset of elements matching the query. Here's the stylesheet:
<?xml version="1.0"?>
<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output method="html" indent="yes"/>
<xsl:template match="/">
<div>Results:
<xsl:value-of select="count(__QUERY__)"/>
</div>
<xsl:apply-templates />
<br clear="all"/>
<p>Entries searched: <xsl:value-of select="count(//item)" /></p>
<p>Date of oldest entry searched:
<xsl:value-of select="//item[position()=last()]/date" /></p>
<p>Date of newest entry searched:
<xsl:value-of select="//item[position()=1]/date" /></p>
</xsl:template>
<xsl:template match="__QUERY__" >
<p><b>
<a>
<xsl:attribute name="href">
http://weblog.infoworld.com/udell/<xsl:value-of select="ancestor::item/date" />\.html# \
<xsl:value-of select="ancestor::item/@num"/>
</xsl:attribute>
<xsl:value-of select="ancestor::item/title" />
</a> (<xsl:value-of select="ancestor::item/date" />)
</b>
<div>
<xsl:copy-of select="."/>
<xsl:if test="local-name(.)='blockquote' and @cite != ''">
Source: <xsl:value-of select="@cite"/>
</xsl:if>
</div>
<hr align="left" width="20%" />
</p>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
A sample blog entry in the XML file that is transformed by the search looks like this:
<item num="a883">
<title>Server-based XPath search</title>
<date>2004/01/10</date>
<body>
<p>
...arbitrary XHTML content...
</p>
</body>
</item>
Preparing the XSLT stylesheet
A simple search-and-replace against the text of the stylesheet would be one way to replace __QUERY__ with the user-supplied XPath expression. And in fact, it would probably be the simplest way. But since the client-side solution used DOM scripting to do that job, I took that route for the server-side solution too. In the process, I learned something about using namespaces with XPath expressions: you can't. The author of libxml2, Daniel Veillard, explains:
XPath was not designed to be used in isolation, as a result there is no way to provide namespace bindings within the XPath syntax. There are APIs to provide these bindings in the libxml2 XPath module. [gnome.org mail archives]
The question has come up repeatedly. Elsewhere on the list, Veillard writes:
You cannot define a default namespace for XPath, period, don't try you can't, the XPath spec does not allow it. This can't work and trying to add it to libxml2 would simply make it non-conformant to the spec. In a nutshell forget about using default namespace within XPath expressions, this will never work, you can't! [gnome.org mail archives]
For our purposes here, this means that if you want to find the XSLT templates containing __QUERY__ using an XPath expression like this:
//xsl:template//xsl:value-of[@select='count(__QUERY__)']
then you must first create a context and register the xsl namespace with that context, as shown in this method:
def createStylesheet(q):
styledoc = libxml2.parseDoc( getXsltTemplate() )
ctxt = styledoc.xpathNewContext()
ctxt.xpathRegisterNs('xsl','http://www.w3.org/1999/XSL/Transform')
xpath = "//xsl:template//xsl:value-of[@select='count(__QUERY__)']"
nodelist = ctxt.xpathEval(xpath)
nodelist[0].setProp('select', 'count(%s)' % q)
xpath = "//xsl:template[@match='__QUERY__']"
nodelist = ctxt.xpathEval(xpath)
nodelist[0].setProp('match', q)
style = libxslt.parseStylesheetDoc(styledoc)
ctxt.xpathRegisteredNsCleanup()
ctxt.xpathFreeContext()
return style
Alternatively you could just use Python and libxml2 to locate a nodes using XPath search, then operate on them -- in this case, for example, to update the select and match attributes of templates in the XSLT stylesheet. As Kimbro Staken, Sam Ruby, Simon Willison, and others have recently pointed out, this is a wildly convenient technique for XML processing.
Subclassing BaseHTTPServer
I've had a long love affair with tiny Web servers and Python's BaseHTTPServer appeals to my sense of minimalism. So my XML search server extends that class like so:
class myHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def query(self,q):
self.style = createStylesheet(q)
self.doc = libxml2.parseFile( blogfile )
try:
self.result = self.style.applyStylesheet(self.doc, None)
except:
self.cleanup()
return "bad query: %s" % q
strResult = self.style.saveResultToString(self.result)
css = getCss()
script = getScript()
preamble = getPreamble(q)
page = """
<html><head><title>XPath query of Jon's Radio</title><style>%s</style>
<script>%s</script></head><body>%s %s</body></html>
""" % (css, script, preamble, strResult)
self.cleanup()
return page
def cleanup(self):
try:
self.doc.freeDoc()
self.style.freeStylesheet()
self.result.freeDoc()
except:
pass
def do_GET(self):
xhtml = self.send_head()
self.wfile.write(xhtml)
def send_head(self):
q = self.requestline.split()[1]
q = re.sub('^/\?','',q)
q = urllib.unquote(q)
xhtml = self.query(q)
self.send_response(200)
self.send_header("Content-type", "text/html")
self.send_header("Content-Length", len(xhtml))
self.end_headers()
if ( len (xhtml) < maxchars ):
return xhtml
else:
return "query returned more than %d characters" % maxchars
The query() method begins by passing the user-supplied XPath expression to the createStylesheet() method we've already seen. Then it parses the file of XML content and transforms it using the modified stylesheet. The results of the transformation are serialized and interpolated into the delivered Web page, along with HTML, CSS, and JavaScript elements.
In a long-running process such as a Web server, it's particularly important to free up the memory allocated by the parsed XML documents and the stylesheet. That isn't very Pythonic, but since the Python bindings to libxml2 and libxslt map closely to the supporting C libraries, you'll quickly chew up all available memory if you don't free those resources. And as a matter of fact, eagle-eyed readers will have noted that the createStylesheet() method defined above does not free the styledoc object it creates. That's because I haven't figured out which dependent object has to be freed first -- perhaps a reader will post the solution in a comment.
Starting up the server
The startup code is typical for a BaseHTTPServer-derived service:
def run(port,HandlerClass = myHTTPRequestHandler,
ServerClass = BaseHTTPServer.HTTPServer, protocol="HTTP/1.0"):
server_address = ('', port)
HandlerClass.protocol_version = protocol
httpd = ServerClass(server_address, HandlerClass)
sa = httpd.socket.getsockname()
print "Serving HTTP on", sa[0], "port", sa[1], "..."
httpd.serve_forever()
if __name__ == '__main__':
maxchars = 250000
if sys.argv[1:]:
port = int(sys.argv[1])
else:
port = 8000
if sys.argv[2:]:
externalhost = int(sys.argv[2])
else:
externalhost = gethostbyname(gethostname())
if sys.argv[3:]:
blogfile = sys.argv[3]
else:
blogfile = 'blog.xml'
run(port)
I also experiment with a threaded listener, which involved making these changes:
class myHTTPServer (SocketServer.ThreadingMixIn,
BaseHTTPServer.BaseHTTPRequestHandler): pass
def run(port, HandlerClass = myHTTPRequestHandler,
ServerClass = myHTTPServer, protocol="HTTP/1.0"):
This works for me on Mac OS X, but crashes and burns on Windows Server 2003 for reasons I haven't figured out. Although the basic nonthreaded service works reasonably well, I'm seeing some timeouts that suggest threading might be appropriate.
Of course there are a million ways to skin this cat, with or without Python, using any kind of HTTP server and XML infrastructure. The point of this article is really to motivate you to experiment with structured search. To that end, the Python/libxslt combo (build issues notwithstanding) makes for a convenient playground.
If you want to try this yourself, the various pieces of the solution -- HTML, CSS, JavaScript, XSLT -- are included within the downloadable script. For sample content, you can use the XML file containing my last several hundred blogs entries, available here; by default the server looks for a file called blog.xml in its current directory.
Against that file, which is nearing a megabyte in size, the server exhibits subsecond response time when XPath expressions test for attribute equality. Queries that only use contains() clauses take several seconds. That wouldn't surprise me were it not for the fact that the client-side solution, either with MSIE (using the MSXML processor) or Mozilla (using the Transformiix processor), deliver instantaneous results even for contains() queries.
In its first day of use, the service provoked some blog discussion [1, 2] on the question of whether it's reasonable to expose any Web-connected database to arbitrary query. My own little service, although it won't return a result file larger than a quarter-megabyte, make no effort to limit the resources burned at the server in order to satisfy a query. If we can't find a way to enforce such limits, then there isn't a bright future for query on the URL-line. OpenLink Software's Kingsley Idehen believes that we can enforce such limits, and I hope he's right because this is powerful stuff.
More from Jon Udell | |
Consider, for example, the notion of categorizing blog entries. A while back I abandoned the practice of tagging my entries with category labels, because it felt too static and I had a hunch a more dynamic method would become available. Just today, I posted an entry that nicely illustrates that dynamic approach. It contains a reference to a book, and it also contains a reference to an MP3 clip. Were I still categorizing entries, I'd have been tempted to assign this one to a books category, and also an AV category. Instead, I wrote a couple of queries that parse out entries along those dimensions:
books: //p[contains(.//a/@href,'amazon.com') or contains(.//a/@href,'allconsuming')]
AV clips: //p[contains(.//a/@href,'.mp3') or contains(.//a/@href,'.wav') or contains(.//a/@href,'.mov') or contains(.//a/@href,'.ram')]
Interestingly, although the item that prompted me to write these two queries is found by both of them, they return different paragraphs from the item -- one which embeds the MP3 link, another which embeds the book link. And since the results are merely a reduction of the original XHTML content, they are contextualized by the surrounding paragraph elements and retain their look and feel.
Not bad for one little script backed by an XML file. Of course the all-in-one-XML-file approach is bound to run out of gas sooner or later. So next time we'll look at a database-backed alternative based on Sleepycat's Berkeley DB XML.
Share your comments on this article in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Christmas Lights Decoration Installation 1-310-925-1720 Los Angeles
2008-10-16 10:04:24 orellytos [Reply]
Christmas Lights Decoration Installation 1-310-925-1720 Los Angeles
Create a dazzling outdoor holiday light show with the finest outdoor Christmas displays and commercial outdoor decor from Lvhsystems Holiday Lights Installations Los Angeles.
lvhsystems Home Automation
We are you trusted source. lvhsystems Home Automation is insured and Licensed by the California Contractors State License Board and is one of the largest home technology and automation company in the state of California.
Licensed & Insurance.
Defining Holiday lights Home Decorations
1-310-925-1720
Welcome: Holiday lights Christmas lights (also sometimes called fairy lights, twinkle lights or holiday lights in the United States) are strands of electric lights used to decorate homes, public/commercial buildings and Christmas trees during the Christmas season. Christmas lights come in a dazzling array of configurations and colors. The small "midget" bulbs commonly known as fairy lights are also called Italian lights in some parts of the U.S.,
Experience pays off! Our experience can save you hundreds, if not thousands, of dollars by determining the best combination of services to meet your needs — that means every project we build is customized for you, not all home Christmas lights decorations project are identical.
We are known for our reliability, superior workmanship & impeccable service. Using only quality materials, our standards of excellence provide you the most return for your investment. Over the years, we have developed a deep respect for the importance of individual expression in home Christmas lights decor. Right from the start of every project, we strive to fully understand and incorporate your individuality into every phase of planning, design and Christmas lights Sale and decorations.
We offer the following Products and Services:
Christmas Lighting New inside / outside christmas planter that lights up
Full service sales and installation departments
Custom pole-mounted banner sales and installation
Large animated holiday displays
Custom holiday displays
Leasing and rental programs
In house graphic arts department
Knowledgeable and helpful year round staff.
Lvhsystems
full-service approach begins with the assignment of a project manager, engineer, and draftsman who work closely with you throughout the process to ensure a design reflective of your aesthetic preferences, programming that meets your control requirements, and an Christmas Lighting installation that is efficient and trouble-free. This level of client commitment and systems expertise allows Lvhsystems to stand apart as a premier integrator of design home Christmas lights decoration and solutions throughout the southern California communities of Los Angeles, Santa Monica, Beverly Hills, Calabasas, Agoura Hills, Woodland Hills, Pasadena, Burbank, Glandale and Sherman Oaks.
- Full instructions to build libxslt on Mac OS 10.3
2004-02-02 18:50:50 patrick chanezon [Reply]
Hi Jon, thanks for this very interesting experimentation, as usual.
While installing Syncato on my Mac, I was obliged to figure out the detailed steps to build libxslt, so I figured out I'd post it here as a comment, for future readers (thanks, your link to kimbro's solution for libxml saved me a good deal of time).
The full post is at http://www.chanezon.com/pat/weblog/archives/000131.html
What misses from your article is mainly the configure options to tell libxslt where to find the libxml2 you just built. I built libxml with no particular option, so it installs in /usr/local. In order to configure libxslt you need to do:
./configure --with-python=/System/Library/Frameworks/Python.framework/Versions/2.3/ --prefix=/usr/local --with-libxml-prefix=/usr/local --with-libxml-include-prefix=/usr/local/include --with-libxml-libs-prefix=/usr/local/lib
This worked very well for me.
- Building libxslt/Python on Mac OS X
2004-01-28 03:30:15 Matt Patterson [Reply]
When I built it, I made sure that I'd built and installed libxml2 first, and then ran configure with
/configure --with-python=/System/Library/Frameworks/Python.framework/Versions/2.3/
as with libxml2, and I had no problems (I'm running 10.3.2 with dev tools).
It might be worth mentioning that you don't need to do any setup.py stuff in the build/python directory: that is handled during the install. It took me a while to figure that out - I'd assumed an extra step was needed.
- One of the million ways to skin this cat...
2004-01-25 03:22:48 Evan Lenz [Reply]
Hi Jon,
Your article inspired me to dust off an old prototype I created when working for XYZFind. Perhaps you'll find it interesting. It's once again accessible here:
http://xmlportfolio.com/transquery/demo
With details on my blog:
http://evan.pcseattle.org/archives/000122.html#000122
Evan
- namespace binding
2004-01-22 06:10:55 Oleg Tkachenko [Reply]
In fact, XPath allows //xsl:template expression to be written in namespace-binding-free way:
//*[local-name()='template' and namespace-uri()="http://www.w3.org/1999/XSL/Transform"]
It's long, but means exactly the same.
