Full XML Indexes with Gnosis

December 8, 2004

I covered the data binding feature of David Mertz's Gnosis Utilities in my earlier article, "XML Data Bindings in Python, Part 2". As I mentioned, Gnosis Utilities is a Python package with a variety of utility classes for data management and especially for XML processing. Another useful module in Gnosis is the indexer, which creates full-text XPath indices of XML documents.

This time, I used version 1.1.1 of Gnosis Utilities, which I downloaded and installed similarly to the procedure described in the earlier article:

$ python setup.py build   #Build step still required before install

$ python setup.py install

I'm running Python 2.3.4.

Yet More Location

In my previous article, "Location, Location, Location," I demonstrated some techniques for tracking the position within a document of a given DOM node or SAX event, expressed in XPath. The functionality in the Gnosis indexer follows this theme, except that it executes an overall analysis of the document, looking for text patterns and indexing them in a persistent store so that you can make a query for text and get back a list of matching XPaths.

To get a quick idea of how the indexer works, I first planned to write a brief run-through of its command line features. The command line is the only documented usage of the Gnosis indexer I could find (documented in a couple of older articles by Mertz himself), and yet, I could not get it to work. In fact, upon examination of the code, after copying files from the installed code-base to make the command line scripts available, I couldn't figure out how it might have ever worked. My problems with the command line aren't all that important, though. As always, the crux of the topic in this column is "how do I use the Python API?" While combing the source code of Gnosis indexer to figure out how to use it, it was clear to me that the code is an excellent contribution and worth the effort to figure out. On to the Python API, then.

Getting Past First Base

Unfortunately, I ran into a few other problems as I worked through the Python API. It feels as if there is some code rust setting in as the xml.indexer module doesn't work properly out of the box with the xml.objectify module it depends on. On indexing a file using XML namespaces (XHTML in the example I tried), I got very strange errors that seemed to stem from the Objectify module. I switched to a simpler file that did not use namespaces: one of the files from Norm Walsh's XML Bookmark Exchange Language (XBEL) file collection. I downloaded whatsnew.xml, but any attempt to index it would result in tracebacks such as the following:

Traceback (most recent call last):

  File "indexer.py", line 145, in ?

    ndx.add_file(file)

  File "indexer.py", line 75, in add_file

    self.recurse_nodes(py_obj)

  File "indexer.py", line 93, in recurse_nodes

    self.recurse_nodes(member[i], xpath.encode('UTF-8'))

  File "indexer.py", line 83, in recurse_nodes

    for membname in currnode.__dict__.keys():

After a lot of debugging, I had a patch (Listing 1) that seemed to get things working properly.

Listing 1 (gnosis-fixes.pat): Patch to fix show-stopper bugs in Gnosis Indexer

--- indexer.py.orig	2004-12-05 20:15:53.555985025 -0700

+++ indexer.py	2004-12-05 20:17:44.531804353 -0700

@@ -79,9 +79,9 @@

         if hasattr(currnode, '_XML'):   # maybe present literal XML of object

             text = currnode._XML.encode('UTF-8')

             self.add_nodetext(text, xpath_suffix)

-        else:

+        elif not isinstance(currnode, unicode):

             for membname in currnode.__dict__.keys():

-                if membname == "__parent__":

+                if membname in ["__parent__", "_seq"]:

                    continue             # ExpatFactory uses bookeeping attribute

                 member = getattr(currnode, membname)

                 if type(member) is InstanceType:

And Around to Home Plate

With this patch in place and further API sleuthing, I was able to work up the basic example code in Listing 2.

Listing 2: Example code for using the Gnosis Indexer

import os

import tempfile

from gnosis.xml import indexer



INDEXDB_NAME = 'SAMPLE_NDX'

INDEXDB = os.path.join(tempfile.gettempdir(), INDEXDB_NAME)

FILES_TO_INDEX = ['whatsnew.xml']



def substring_after(outer, inner):

    "Convenience function similar to XPath substring-after()"

    return inner.join(outer.split(inner)[1:])



#Create the persistent index

ndx = indexer.XML_Indexer(INDEXDB=INDEXDB)

ndx.load_index()

for fname in FILES_TO_INDEX:

    ndx.add_file(fname)

ndx.save_index()



#Use the index to find elements with occurences of search terms

WORDS_TO_FIND = ['xml']

result = ndx.find(WORDS_TO_FIND)

#Extract the XPath from each index locator (part after the first '::')

xpaths = [ substring_after(loc, '::') for loc in result.values() ]

print 'Search words', WORDS_TO_FIND, 'found at XPaths:'

for xp in xpaths:

    print '\t', xp

You will want to pay special attention to how you construct your instance of indexer.XML_Indexer. The first question is how to store the index data. The meat of Gnosis Indexer is in the module gnosis.indexer, which gets imported into gnosis.xml.indexer. Gnosis Indexer supports several persistence mechanisms, including flat file, Python shelve, Python pickle, XML pickle, and a couple of home-grown pickle formats compressed with the zlib module. The module gnosis.indexer sets up a preferred method (based on speed and space benchmarks) in the following line:

PreferredIndexer = SlicedZPickleIndexer

If you would prefer to use a different persistence mode, you would want to modify PreferredIndexer before it is imported into gnosis.xml.indexer, probably with code such as the following snippet:

from gnosis import indexer

indexer.PreferredIndexer = indexer.FlatIndexer

#Will replace the earlier imported symbol "indexer"

#Remember that import is much like assignment in Python

from gnosis.xml import indexer

The initializer of indexer.XML_Indexer is where you can set some other important parameters. In Listing 2, I set the name of the index database. See the code for the method gnosis.indexer.GenericIndexer.configure to learn of other parameters you might wish to tweak. For example, you can pass CASESENSITIVE=True in order to support case-sensitive searches (at the cost of a dramatic ballooning of index database size). The resulting indexer.XML_Indexer instance provides methods to create and search indices. The only built-in way to index data is by passing in an XML filename, but it shouldn't be too hard to add code to index based on given URLs, strings, or file-like objects. The find method returns a dictionary structure, result, where it seems the values are of the most interest. Running Listing 2 against whatsnew.xml, the value of the result is {8: 'whatsnew.xml::/folder/bookmark[1]/desc', 12: 'whatsnew.xml::/folder/bookmark[2]/desc'}. The console output is as follows:

$ python listing2.py

Search words ['xml'] found at XPaths:

        /folder/bookmark[1]/desc

        /folder/bookmark[2]/desc

The Gnosis Indexer is very useful stuff. It does generate huge indices, but this is a fair trade-off for its Pythonic simplicity and conveneience. The more pressing reservation is that the code is clearly a bit rusty and needs work on bug fixes and usability features. It may make sense to use some of the XPath location techniques presented in my last article to update the XML index generation. Interfaces for indexing from URL, string, or file-like sources would be nice, as well as more encapsulation of features for managing index files (most index methods produce an explosion of files). But this is all par for the open source course. Once you apply my patch for the show-stopper bugs, and assuming problems with namespaces are swiftly fixed, you'll find the module a very useful complement to persistent XML applications such as XML-driven websites.

News in Big Packages

This month, our landscape was dominated by a couple of big announcements: Python 2.4 and PyXML 0.8.4. These announcements are closely tied together. Python 2.4 will refuse to work with any version of PyXML lesser than 0.8.4. The biggest crops of changes to PyXML is in Expat, which has been bumped up to 1.95.8. The pyexpat wrapper API has also been expanded to expose more expat features, particularly the data members CurrentLineNumber, CurrentColumnNumber, and CurrentByteIndex on xml.parsers.expat.XMLParser instances. There are also a few SAX and pyexpat bug fixes. Python 2.4 incorporates similar pyexpat features and fixes some minor SAX bugs. I've only had time for brief review and experimentation with the new packages, but as far as I can tell, the software dependency decision tree I presented in "Practical SAX Notes" is still valid if you also take note of the PyXML 0.8.4 restriction in Python 2.4. Overall, If you're interested in what Python 2.4 brings to the table, don't miss Andrew Kuchling's "What's New In Python 2.4?".

The timing is pretty nifty to illustrate the usage of the new additions to pyexpat in Python 2.4. Listing 3 is a (partial) translation of last article's SAX code for regex search of element content (also Listing 3 in that article).

Listing 3: pyexpat code for regex search of element content

import sys

import re

import xml.parsers.expat



file_to_search = sys.argv[1]



#These values will be used int eh nested scope of characters()

parser = xml.parsers.expat.ParserCreate()

search_str = sys.argv[2]

search_pat = re.compile(search_str)



def characters(text):

    line = parser.CurrentLineNumber

    col = parser.CurrentColumnNumber

    results = search_pat.finditer(text)

    for match in results:

        #Display information for each match

        print 'match "' + match.group() + '" at offset', match.pos,

        print 'from line', line,', column', col

    return



parser.CharacterDataHandler = characters



parser.ParseFile(open(file_to_search))

The code is simpler than the earlier SAX code, which is to be expected; SAX trades off simplicity and some speed for a layer that provides interoperability between parsers. The pyexpat version does require basic understanding of nested scopes. Using labels.xml from the last article:

$ python2.4 listing3.py labels.xml "CT"

match "CT" at offset 0 from line 11 , column 13

$ python2.4 listing3.py labels.xml "[0-9]+"

match "3" at offset 0 from line 9 , column 14

match "45" at offset 0 from line 17 , column 14

News in Smaller Bites

It doesn't end with the big announcements this month. After far too long a delay between packaged releases (despite heavy activity in CVS code), I announced 4Suite 1.0a4; 4Suite is a comprehensive library for XML processing in Python. Python 2.2.1 is now the minimum required version (we did test with the Python 2.4 betas). Domlette got feature enhancements, including parsing from general entities (in short, XML that is well-formed besides having multiple root elements) and a XPath convenience method for nodes. There are also improvements to packaging and installation code, numerous performance enhancements, and bug fixes. See the announcement. Adam Souzis announced Rx4RDF 0.4.2 mainly as a 4Suite 1.0a4 compatability update.

Fredrik Lundh released ElementTree 1.2.2, providing "a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory." The news is basically an improved HTML to ElementTree parser. See the announcement.

Paul Boddie mentioned his own entry in the libxml2 wrapper sweepstakes: libxml2dom 0.1.1: "The libxml2dom package provides a traditional DOM wrapper around the Python bindings for libxml2. In contrast to the libxml2 bindings, libxml2dom provides an API reminiscent of minidom, pxdom and other Python-based and Python-related XML toolkits." This, in itself, is a nice step forward, but I wonder whether integration with the newly merged lxml/vlibxml2 efforts would finally be the key to a Pythonic binding to the rich C library.

Finally, the release of Python 2.4 seems a good time to mention a very handy resource I found: Awaretek's link list of Python tutorials. I always knew there were very many Python tutorials, but one hundred? Pythoneers love to teach, it seems. There are topic-specific sections, including one on Python tutorials for HTML and XML processing.