Menu

Building Dictionaries With SAX

January 14, 2004

Uche Ogbuji

People have always liked to complain that Python is slow. People have always liked to complain that XML processing is slow. I've always thought both complaints are typically misguided. Neither Python nor XML emerged out of the need for ultrasonic speed at runtime. Python emerged as an expressive language for programming applications, and XML as an expressive system for building data representation into applications. In both cases the goal for the resulting applications is ease of maintenance. This ease primarily comes from the expressiveness that Python and XML allow. One should expect that such gain in productivity would be matched by a corresponding loss in some area, and sometimes the deficit comes from performance. Complaints about the downsides of this tradeoff often stem from premature optimization, when developers insist on basing key decisions on meaningless benchmarks rather than considering the entire profile under which an application operates.

All that having been said, there are certainly real world cases where one has chosen Python and XML tools with an open mind, and then decided that a performance boost is in order upon careful profiling. There are many ways to squeeze performance out of Python XML applications, some of which take advantage of Python or XML's expressive power. Each application will need its own mix of practical optimizations, but in this article I shall present a modest technique that might be handy to have in your war chest.

The dictionary express

Dictionaries are perhaps the core data structure most dear to the heart of Python. The basic object model of Python is largely constructed upon dictionaries. They have been optimized to the utmost by some of the brightest minds in the computer industry. Conventional wisdom suggests that if you need to optimize a bit of code in Python, you should take advantage of the excellent Python/C interface. A better bit of advice is that you might first want to check whether you are taking full advantage of dictionaries. One detail to keep in mind is that dictionaries are more suited to fast lookup operations than memory savings: again there is a tradeoff, space for speed in this case.

My pet description of XML's fundamental data model is "labeled strings in nested packages". The labeling and nesting are what differentiate XML from good old comma or tab-delimited value and tabular ("square") data models such as spreadsheets and classic SQL databases. This same labeling and nesting makes for a natural accommodation of data from XML in Python dictionaries. Sometimes writing tortured SAX code for complex processing just doesn't cut it as an alternative to loading documents into heavyweight DOM or the like for XPath access. You may be better off writing SAX code that is just sophisticated enough to shape the key bits from XML into a structure of nested dictionaries.

Take for example the address labels format I've been using as an example in some earlier installments, reproduced in listing 1.

Listing 1: Sample XML file (labels.xml) containing address labels
<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label added="2003-06-20">

    <quote>

      <!-- Mixed content -->

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>  

Well, it's not an exact reproduction. In earlier articles I used the character &#133;, mistakenly thinking that it was the ellipsis in Unicode. In fact, it is NEL, an end of line character used most often in IBM mainframe data. I erroneously looked up a table based on Windows code pages rather than Unicode, which illustrates all too well the danger I cautioned against earlier of confusing character sets. Starting from this article I'm using the correct character &#8230;. As an aside, there has been a lot of controversy in the XML community because NEL, the character I happened to mistake for ellipsis, has been added as a standard white space character in a revision that will probably become XML 1.1. Some claim this breaks backward compatibility and is not worth the convenience of a (relatively) few mainframe users. The fact that &#133 is whitespace from the Unicode point of view explains what I thought was confusing treatment of the character by Python string processing APIs.

But back to the main topic of the article. Listing 2 contains some SAX code to create a handy dictionary of dictionaries from labels.xml. Python 2.x is the only requirement: I used Python 2.3.3.

Listing 2: a program (listing2.py) to create Python dictionaries from the labels XML format
import sys

from xml import sax

from textnormalize import text_normalize_filter



#Subclass from ContentHandler in order to gain default behaviors

class label_dict_handler(sax.ContentHandler):

    #Define constants for important states

    CAPTURE_KEY = 1

    CAPTURE_LABEL_ITEM = 2

    CAPTURE_ADDRESS_ITEM = 3



    def __init__(self):

        self.label_dict = {}

        #Track the item being constructed in the current dictionary

        self._item_to_create = None

        self._state = None

        return



    def startElement(self, name, attributes):

        if name == u"label":

            self._curr_label = {}

        if name == u"address":

            self._address = {}

        if name == u"name":

            self._state = self.CAPTURE_KEY

        if name == u"quote":

            self._item_to_create = name

            self._state = self.CAPTURE_LABEL_ITEM

        if name in [u"street", u"city", u"state"]:

            self._item_to_create = name

            self._state = self.CAPTURE_ADDRESS_ITEM

        return



    def endElement(self, name):

        if name == u"address":

            self._curr_label["address"] = self._address

        if name in [u"quote", u"name", u"street", u"city", u"state"]:

            self._state = None

        return



    def characters(self, text):

        if self._state == self.CAPTURE_KEY:

            self.label_dict[text] = self._curr_label

        curr_dict = None

        if self._state == self.CAPTURE_ADDRESS_ITEM:

            curr_dict = self._address

        if self._state == self.CAPTURE_LABEL_ITEM:

            curr_dict = self._curr_label

        print repr(text), curr_dict

        if curr_dict is not None:

            if curr_dict.has_key(self._item_to_create):

                curr_dict[self._item_to_create] += text

            else:

                curr_dict[self._item_to_create] = text

        return



if __name__ == "__main__":

    parser = sax.make_parser()

    downstream_handler = label_dict_handler()

    #upstream: the parser; downstream: the next handler in the chain

    filter_handler = text_normalize_filter(parser, downstream_handler)

    #XMLFilterBase is designed so that the filter takes on much of the

    #interface of the parser itself, including the "parse" method

    filter_handler.parse(sys.argv[1])

    label_dict = downstream_handler.label_dict  

textnormalize is a SAX filter that I contributed to The Python Cookbook. A SAX parser can report contiguous text using multiple characters events. In other words, given the following XML document:

<spam>abc</spam>  

The text "abc" could technically be reported as three characters events: one for the "a", one for the "b" and a third for the "c". Such an extreme case is unlikely in real life, but in any case textnormalize ensures that all such broken-up events are reported to downstream SAX handlers in the manner most developers would expect. In the above case, the filter would consolidate the three characters events into a single one for the entire text node "abc". This frees downstream handlers from implementing code to deal with broken-up text nodes, which can be rather cumbersome when mixed into typical SAX-style state machine logic. I've reproduced the filter class as listing 3 for sake of convenience.

Listing 3: a SAX filter (textnormalize.py) to normalize characters SAX events
import sys

from xml import sax

from textnormalize import text_normalize_filter



#Subclass from ContentHandler in order to gain default behaviors

class label_dict_handler(sax.ContentHandler):

    #Define constants for important states

    CAPTURE_KEY = 1

    CAPTURE_LABEL_ITEM = 2

    CAPTURE_ADDRESS_ITEM = 3



    def __init__(self):

        self.label_dict = {}

        #Track the item being constructed in the current dictionary

        self._item_to_create = None

        self._state = None

        return



    def startElement(self, name, attributes):

        if name == u"label":

            self._curr_label = {}

        if name == u"address":

            self._address = {}

        if name == u"name":

            self._state = self.CAPTURE_KEY

        if name == u"quote":

            self._item_to_create = name

            self._state = self.CAPTURE_LABEL_ITEM

        if name in [u"street", u"city", u"state"]:

            self._item_to_create = name

            self._state = self.CAPTURE_ADDRESS_ITEM

        return



    def endElement(self, name):

        if name == u"address":

            self._curr_label["address"] = self._address

        if name in [u"quote", u"name", u"street", u"city", u"state"]:

            self._state = None

        return



    def characters(self, text):

        if self._state == self.CAPTURE_KEY:

            self.label_dict[text] = self._curr_label

        curr_dict = None

        if self._state == self.CAPTURE_ADDRESS_ITEM:

            curr_dict = self._address

        if self._state == self.CAPTURE_LABEL_ITEM:

            curr_dict = self._curr_label

        print repr(text), curr_dict

        if curr_dict is not None:

            if curr_dict.has_key(self._item_to_create):

                curr_dict[self._item_to_create] += text

            else:

                curr_dict[self._item_to_create] = text

        return



if __name__ == "__main__":

    parser = sax.make_parser()

    downstream_handler = label_dict_handler()

    #upstream: the parser; downstream: the next handler in the chain

    filter_handler = text_normalize_filter(parser, downstream_handler)

    #XMLFilterBase is designed so that the filter takes on much of the

    #interface of the parser itself, including the "parse" method

    filter_handler.parse(sys.argv[1])

    label_dict = downstream_handler.label_dict  

Working with the resulting structure

In listing 2 label_dict_handler is my special SAX handler for the labels format. It's a somewhat rough case and does not go into the detail required to handle some edge cases, but it does accomplish the required task. It constructs a dictionary of labels, each of which is a dictionary of label details. The address details are given as yet another nested set of dictionaries. The main module code constructs the chain from the parser through the textnormalize filter to the label_dict_handler handler. I ran this code and prepared to experiment with the resulting dictionary as follows:

$ python -i listing2.py labels.xml  

Python's standard pprint module is a friendly way to inspect the contents of dictionaries:

>>> import pprint

>>> pprint.pprint(label_dict)

{u'Ezra Pound': {'address': {u'city': u'Hailey',

                             u'state': u'ID',

                             u'street': u'45 Usura Place'}},

 u'Thomas Eliot': {'address': {u'city': u'Stamford',

                               u'state': u'CT',

                               u'street': u'3 Prufrock Lane'},

                   u'quote':

u'\n      \n      Midwinter Spring is its own season\u2026\n    '}}

>>>  

Notice the proper maintenance of the Unicode from the XML document, including the tricky ellipsis character. In this form, the labels data is easy to query:

>>> #Get the Eliot quote

...

>>> print repr(label_dict[u"Thomas Eliot"].get(u"quote"))

u'\n      \n      Midwinter Spring is its own season\u2026\n    '

>>> #How about a Pound quote?

...

>>> print repr(label_dict[u"Ezra Pound"].get("quote"))

None

>>> #Tell me which labels do have quotes

...

>>> print [ k for k, v in label_dict.iteritems() if v.has_key(u"quote") ]

[u'Thomas Eliot']  

The equivalent XPath to the last sample query is:

/labels/label[quote]/name  

The Python dictionary query is certainly a bit more wordy, and it uses Python trickery such as list comprehensions; but it's more efficient than most XPath implementations, and if you're using Python to a fair extent you're likely to be familiar with all the neat idioms anyway. Notice that I use the method iteritems to spool out the contents of the dictionary rather than the more common items. iteritems requires Python 2.2 and saves memory by creating the key/value pairs for each item in the dictionary on demand rather than all at once. On older Python versions just replace with items.

The dictionary data is just as easily manipulated although it would take some additional work to save any such updates back into the XML format.

Limitations and all that

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

This article's fare is not a technique for all seasons. For one thing, even a trimmed down SAX handler that does nothing but create dictionaries can become rather complicated and tricky to debug. The code to generate equivalent data bindings and Python structures in any of the many tools I've examined in this column is almost trivial compared to the SAX presented in this article. But using plain SAX to construct dictionaries is likely to offer more raw speed and less memory overhead. It also has the advantage of requiring no third party modules. One possible variation is to use SAX to construct specialized objects rather than dictionaries, which makes the query code a bit neater, but does reintroduce some overhead.

It has been a slow couple of months in the Python/XML community. One interesting development has been the announcement of Rx4Rdf. As its web page says:

Rx4RDF is a specification and reference implementation for querying, transforming and updating W3C's RDF by specifying a deterministic mapping of the RDF model to the XML data model defined by XPath. Rx4RDF shields developers from the complexity of RDF by enabling you to use familiar XML technologies like XPath, XSLT and XUpdate. We call their RDF equivalents RxPath, RxSLT, and RxUpdate respectively.

Browsing through the code, Rx4RDF appears to start with various modules from 4Suite, but combines and enhances these in very interesting ways for an RDF-centric approach to data processing. I especially found intriguing the description of Racoon, which sounds like a variation on the idea behind the popular Cocoon server, but using RDF and Python rather than XML/XSLT and Java.