Building Dictionaries With SAX
People have always liked to complain that Python is slow. People have always liked to complain that XML processing is slow. I've always thought both complaints are typically misguided. Neither Python nor XML emerged out of the need for ultrasonic speed at runtime. Python emerged as an expressive language for programming applications, and XML as an expressive system for building data representation into applications. In both cases the goal for the resulting applications is ease of maintenance. This ease primarily comes from the expressiveness that Python and XML allow. One should expect that such gain in productivity would be matched by a corresponding loss in some area, and sometimes the deficit comes from performance. Complaints about the downsides of this tradeoff often stem from premature optimization, when developers insist on basing key decisions on meaningless benchmarks rather than considering the entire profile under which an application operates.
All that having been said, there are certainly real world cases where one has chosen Python and XML tools with an open mind, and then decided that a performance boost is in order upon careful profiling. There are many ways to squeeze performance out of Python XML applications, some of which take advantage of Python or XML's expressive power. Each application will need its own mix of practical optimizations, but in this article I shall present a modest technique that might be handy to have in your war chest.
Dictionaries are perhaps the core data structure most dear to the heart of Python. The basic object model of Python is largely constructed upon dictionaries. They have been optimized to the utmost by some of the brightest minds in the computer industry. Conventional wisdom suggests that if you need to optimize a bit of code in Python, you should take advantage of the excellent Python/C interface. A better bit of advice is that you might first want to check whether you are taking full advantage of dictionaries. One detail to keep in mind is that dictionaries are more suited to fast lookup operations than memory savings: again there is a tradeoff, space for speed in this case.
My pet description of XML's fundamental data model is "labeled strings in nested packages". The labeling and nesting are what differentiate XML from good old comma or tab-delimited value and tabular ("square") data models such as spreadsheets and classic SQL databases. This same labeling and nesting makes for a natural accommodation of data from XML in Python dictionaries. Sometimes writing tortured SAX code for complex processing just doesn't cut it as an alternative to loading documents into heavyweight DOM or the like for XPath access. You may be better off writing SAX code that is just sophisticated enough to shape the key bits from XML into a structure of nested dictionaries.
Take for example the address labels format I've been using as an example in some earlier installments, reproduced in listing 1.
Listing 1: Sample XML file (labels.xml) containing address labels<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label added="2003-06-20">
<quote>
<!-- Mixed content -->
<emph>Midwinter Spring</emph> is its own season…
</quote>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
<label added="2003-06-10">
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
Well, it's not an exact reproduction. In earlier articles I used
the character …, mistakenly thinking that it was
the ellipsis in Unicode. In fact, it is NEL, an end of line
character used most often in IBM mainframe data. I erroneously
looked up a table based on Windows code pages rather than Unicode,
which illustrates all too well the danger I cautioned against
earlier of confusing character sets. Starting from this article
I'm using the correct character …. As an
aside, there has been a lot of controversy in the XML community
because NEL, the character I happened to mistake for ellipsis, has
been added as a standard white space character in a revision that
will probably become XML 1.1. Some claim this breaks backward
compatibility and is not worth the convenience of a (relatively)
few mainframe users. The fact that … is
whitespace from the Unicode point of view explains what I thought
was confusing treatment of the character by Python string
processing APIs.
But back to the main topic of the article. Listing 2 contains some SAX code to create a handy dictionary of dictionaries from labels.xml. Python 2.x is the only requirement: I used Python 2.3.3.
Listing 2: a program (listing2.py) to create Python dictionaries from the labels XML formatimport sys
from xml import sax
from textnormalize import text_normalize_filter
#Subclass from ContentHandler in order to gain default behaviors
class label_dict_handler(sax.ContentHandler):
#Define constants for important states
CAPTURE_KEY = 1
CAPTURE_LABEL_ITEM = 2
CAPTURE_ADDRESS_ITEM = 3
def __init__(self):
self.label_dict = {}
#Track the item being constructed in the current dictionary
self._item_to_create = None
self._state = None
return
def startElement(self, name, attributes):
if name == u"label":
self._curr_label = {}
if name == u"address":
self._address = {}
if name == u"name":
self._state = self.CAPTURE_KEY
if name == u"quote":
self._item_to_create = name
self._state = self.CAPTURE_LABEL_ITEM
if name in [u"street", u"city", u"state"]:
self._item_to_create = name
self._state = self.CAPTURE_ADDRESS_ITEM
return
def endElement(self, name):
if name == u"address":
self._curr_label["address"] = self._address
if name in [u"quote", u"name", u"street", u"city", u"state"]:
self._state = None
return
def characters(self, text):
if self._state == self.CAPTURE_KEY:
self.label_dict[text] = self._curr_label
curr_dict = None
if self._state == self.CAPTURE_ADDRESS_ITEM:
curr_dict = self._address
if self._state == self.CAPTURE_LABEL_ITEM:
curr_dict = self._curr_label
print repr(text), curr_dict
if curr_dict is not None:
if curr_dict.has_key(self._item_to_create):
curr_dict[self._item_to_create] += text
else:
curr_dict[self._item_to_create] = text
return
if __name__ == "__main__":
parser = sax.make_parser()
downstream_handler = label_dict_handler()
#upstream: the parser; downstream: the next handler in the chain
filter_handler = text_normalize_filter(parser, downstream_handler)
#XMLFilterBase is designed so that the filter takes on much of the
#interface of the parser itself, including the "parse" method
filter_handler.parse(sys.argv[1])
label_dict = downstream_handler.label_dict
textnormalize is a SAX filter that I contributed to
The Python Cookbook. A SAX parser can
report contiguous text using multiple characters
events. In other words, given the following XML document:
<spam>abc</spam>
The text "abc" could technically be reported as three
characters events: one for the "a", one for the "b" and
a third for the "c". Such an extreme case is unlikely in real life,
but in any case textnormalize ensures that all such
broken-up events are reported to downstream SAX handlers in the
manner most developers would expect. In the above case, the filter
would consolidate the three characters events into a
single one for the entire text node "abc". This frees downstream
handlers from implementing code to deal with broken-up text nodes,
which can be rather cumbersome when mixed into typical SAX-style
state machine logic. I've reproduced the filter class as listing 3
for sake of convenience.
import sys
from xml import sax
from textnormalize import text_normalize_filter
#Subclass from ContentHandler in order to gain default behaviors
class label_dict_handler(sax.ContentHandler):
#Define constants for important states
CAPTURE_KEY = 1
CAPTURE_LABEL_ITEM = 2
CAPTURE_ADDRESS_ITEM = 3
def __init__(self):
self.label_dict = {}
#Track the item being constructed in the current dictionary
self._item_to_create = None
self._state = None
return
def startElement(self, name, attributes):
if name == u"label":
self._curr_label = {}
if name == u"address":
self._address = {}
if name == u"name":
self._state = self.CAPTURE_KEY
if name == u"quote":
self._item_to_create = name
self._state = self.CAPTURE_LABEL_ITEM
if name in [u"street", u"city", u"state"]:
self._item_to_create = name
self._state = self.CAPTURE_ADDRESS_ITEM
return
def endElement(self, name):
if name == u"address":
self._curr_label["address"] = self._address
if name in [u"quote", u"name", u"street", u"city", u"state"]:
self._state = None
return
def characters(self, text):
if self._state == self.CAPTURE_KEY:
self.label_dict[text] = self._curr_label
curr_dict = None
if self._state == self.CAPTURE_ADDRESS_ITEM:
curr_dict = self._address
if self._state == self.CAPTURE_LABEL_ITEM:
curr_dict = self._curr_label
print repr(text), curr_dict
if curr_dict is not None:
if curr_dict.has_key(self._item_to_create):
curr_dict[self._item_to_create] += text
else:
curr_dict[self._item_to_create] = text
return
if __name__ == "__main__":
parser = sax.make_parser()
downstream_handler = label_dict_handler()
#upstream: the parser; downstream: the next handler in the chain
filter_handler = text_normalize_filter(parser, downstream_handler)
#XMLFilterBase is designed so that the filter takes on much of the
#interface of the parser itself, including the "parse" method
filter_handler.parse(sys.argv[1])
label_dict = downstream_handler.label_dict
In listing 2 label_dict_handler is my special SAX
handler for the labels format. It's a somewhat rough case and
does not go into the detail required to handle some edge cases,
but it does accomplish the required task. It constructs a
dictionary of labels, each of which is a dictionary of label
details. The address details are given as yet another nested set
of dictionaries. The main module code constructs the chain from
the parser through the textnormalize filter to the
label_dict_handler handler. I ran this code and prepared to
experiment with the resulting dictionary as follows:
$ python -i listing2.py labels.xml
Python's standard pprint module is a friendly way to
inspect the contents of dictionaries:
>>> import pprint
>>> pprint.pprint(label_dict)
{u'Ezra Pound': {'address': {u'city': u'Hailey',
u'state': u'ID',
u'street': u'45 Usura Place'}},
u'Thomas Eliot': {'address': {u'city': u'Stamford',
u'state': u'CT',
u'street': u'3 Prufrock Lane'},
u'quote':
u'\n \n Midwinter Spring is its own season\u2026\n '}}
>>>
Notice the proper maintenance of the Unicode from the XML document, including the tricky ellipsis character. In this form, the labels data is easy to query:
>>> #Get the Eliot quote
...
>>> print repr(label_dict[u"Thomas Eliot"].get(u"quote"))
u'\n \n Midwinter Spring is its own season\u2026\n '
>>> #How about a Pound quote?
...
>>> print repr(label_dict[u"Ezra Pound"].get("quote"))
None
>>> #Tell me which labels do have quotes
...
>>> print [ k for k, v in label_dict.iteritems() if v.has_key(u"quote") ]
[u'Thomas Eliot']
The equivalent XPath to the last sample query is:
/labels/label[quote]/name
The Python dictionary query is certainly a bit more wordy, and it uses
Python trickery such as list comprehensions; but it's more efficient
than most XPath implementations, and if you're using Python to a fair
extent you're likely to be familiar with all the neat idioms anyway.
Notice that I use the method iteritems to spool out the
contents of the dictionary rather than the more common
items. iteritems requires Python 2.2 and
saves memory by creating the key/value pairs for each item in the
dictionary on demand rather than all at once. On older Python
versions just replace with items.
The dictionary data is just as easily manipulated although it would take some additional work to save any such updates back into the XML format.
Also in Python and XML | |
Should Python and XML Coexist? | |
This article's fare is not a technique for all seasons. For one thing, even a trimmed down SAX handler that does nothing but create dictionaries can become rather complicated and tricky to debug. The code to generate equivalent data bindings and Python structures in any of the many tools I've examined in this column is almost trivial compared to the SAX presented in this article. But using plain SAX to construct dictionaries is likely to offer more raw speed and less memory overhead. It also has the advantage of requiring no third party modules. One possible variation is to use SAX to construct specialized objects rather than dictionaries, which makes the query code a bit neater, but does reintroduce some overhead.
It has been a slow couple of months in the Python/XML community. One interesting development has been the announcement of Rx4Rdf. As its web page says:
Rx4RDF is a specification and reference implementation for querying, transforming and updating W3C's RDF by specifying a deterministic mapping of the RDF model to the XML data model defined by XPath. Rx4RDF shields developers from the complexity of RDF by enabling you to use familiar XML technologies like XPath, XSLT and XUpdate. We call their RDF equivalents RxPath, RxSLT, and RxUpdate respectively.
Browsing through the code, Rx4RDF appears to start with various modules from 4Suite, but combines and enhances these in very interesting ways for an RDF-centric approach to data processing. I especially found intriguing the description of Racoon, which sounds like a variation on the idea behind the popular Cocoon server, but using RDF and Python rather than XML/XSLT and Java.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.