
Introducing the Amara XML Toolkit
As part of my roundup of Python data bindings, I introduced my own Anobind project. Over the column's history, I've also developed other code to meet some need emphasized in one of the previous articles. I recently collected all of these various little projects together into one open source package of XML processing add-ons, Amara XML Toolkit. Amara is meant to complement 4Suite in that 4Suite works towards fidelity to XML technical ideals, while Amara works towards fidelity to Python conventions, taking maximum advantage of Python's strengths. The main components of Amara XML Toolkit are the following:
- Bindery: data binding tool. The code that was formerly available standalone as "Anobind" but with extensive improvements and additions, including a move of the fundamental framework from DOM to SAX.
- Scimitar: an implementation of the ISO Schematron schema language for XML. It also used to be a standalone project, which I've announced here in the past. It converts Schematron files to standalone Python scripts.
- domtools: helper routines for working with Python DOMs, many of which first made their appearance in previous articles such as "Generating DOM Magic" and "Location, Location, Location."
- saxtools: helper frameworks and routines for easier use of Python's SAX implementation, many of which first made their appearance in previous articles such as " Decomposition, Process, Recomposition".
- Flextyper: implementation of Jeni Tennison's Data Type Library Language (DTLL) (on track to become part 5 of ISO Document Schema Definition Languages (DSDL). You can use Flextyper to generate Python modules containing data types classes that can be used with 4Suite's RELAX NG library, although it won't come into its full usefulness until the next release of 4Suite.
In this article I introduce parts of Amara, focusing on several
little, common tasks it's supposed to help with. Some of these are
tasks you will recognize from earlier articles in this column. Amara
requires Python 2.3 or later and 4Suite 1.0a4 or later. I used Python
2.3.4 to run all listings presented, working with Amara 0.9.2. With
the prerequisites in place, installation is the usual matter of
python setup.py install.
Best of SAX and DOM
The very first sample task needs very little preamble. See listing 1, a form of the address label example I so often use.
Listing 1: Sample XML file (labels.xml) containing Address Labels
<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label id="tse" added="2003-06-20">
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
<quote>
<emph>Midwinter Spring</emph> is its own season…
</quote>
</label>
<label id="ep" added="2003-06-10">
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
<quote>
What thou lovest well remains, the rest is dross…
</quote>
</label>
<!-- Throw in 10,000 more records just like this -->
<label id="lh" added="2004-11-01">
<name>Langston Hughes</name>
<address>
<street>10 Bridge Tunnel</street>
<city>Harlem</city>
<state>NY</state>
</address>
</label>
</labels>
Listing 2 is code to print out all people and their street addresses.
Listing 2 (listing2.py): Amara Pushdom code to print out all people and their street addresses
from amara import domtools
for docfrag in domtools.pushdom('/labels/label', source='labels.xml'):
label = docfrag.firstChild
name = label.xpath('string(name)')
city = label.xpath('string(address/city)')
print name, 'of', city
The code is extremely simple, but it does print what a quick glance might lead you to expect:
$ python listing2.py
Thomas Eliot of Stamford
Ezra Pound of Hailey
Langston Hughes of Harlem
The trick is how it does this. domtools.pushdom is a generator
which yields a DOM document fragment at a time, such that the entire document
is broken down into a series of subtrees given by the pattern passed in: /labels/label.
The full document is never in memory (in fact, the code never takes up much
more memory than it takes to maintain a DOM node for a single label element.
If, as the comment in listing 1 suggests, there were 10,000 more label elements,
the memory usage wouldn't be much greater; although, if your loop iterates
faster than Python can reclaim each discarded node, you might want to add
an explicit gc.collect() at the end of the loop. Each node yielded
by the generator is a basic Domlette node, with all the usual properties
and methods this makes available, including the useful xpath() method.
Compare listing 2 above to listing 4 of "Decomposition, Process, Recomposition" and you'll get a sense of how this wrappering of ideas from that article simplifies things.
If DOM Is Too Lame for You
Pythonic APIs are meant to make life easier for the many users who find DOM too arcane and alien for use in Python. Almost all of the earlier article on Anobind is still valid in Amara. The biggest change is in the imports. I also added some concessions to people who really don't want to worry about URL and file details and the like; the eight lines of listing 1 from the earlier article can now be reduced to two lines (the top two of listing 3). Listing 3 is an example of how I could use Amara Bindery to display names and cities from listing 1, the functional equivalent of listing 2.
Listing 3: Amara Bindery code to print out all people and their street addresses
from amara import binderytools
container = binderytools.bind_file('labels.xml')
for l in container.labels.label:
print l.name, 'of', l.address.city
binderytools.bind_file takes a file name, parses the file,
and returns a data binding, rooted at the object container,
which represents the XML root node. Each element is a specialized object
that permits easy access to the data using Python idioms, with object property
names based on the names of XML tags and attributes. In a typical expression
of the prevalent attitude in the Python community, one blogger called it "turning
XML into something useful."
The Natural Next Step: Push Binding
One possible problem with listing 3 is that the entire XML document is converted to Python objects, which could mean a lot of memory usage for large documents, for example, if labels.xml were expanded to have 10,000 entries in label elements. Amara Bindery does mitigate this a little bit by using SAX to create data bindings, but this may not be good enough. What would be great is some way to use the pushdom approach from listing 2 while still having the ease-of-use advantage of Amara Bindery. This option is available as the Push binding, illustrated in listing 4.
Listing 4: Amara Push binding code to print out all people and their street addresses
from amara import binderytools
for subtree in binderytools.pushbind('/labels/label', source='labels.xml'):
print subtree.label.name, 'of', subtree.label.address.city
You use patterns just as in listing 2 to break up the document, and just
as in listing 2, binderytools.pushbind is a generator that instantiates
part of the document at a time, thus never using up the memory needed to
represent the entire document. This time, however, the values yielded by
the generator are subtrees of an Amara binding rather than DOM nodes, so you can use the more natural Python idioms to access the data, if you
prefer.
Modification
Amara Bindery makes it pretty easy to modify XML objects in place and reserialize them back to XML. As an example, listing 5 makes some changes to one of the label elements and then prints the result back out.
Listing 5: Amara Bindery code to update an address label entry
from amara import binderytools
container = binderytools.bind_file('labels.xml')
#Add a quote to the Langston Hughes entry
#The quote text to be added
new_quote_text = \
u'\u2026if dreams die, life is a broken winged bird that cannot fly.'
#The ID of Hughes's entry
id = 'lh'
#Cull to a list of entries with the desired ID
lh_label = [ label for label in container.labels.label
if label.id == 'lh' ]
#We know there's only one, so get it
lh_label = lh_label[0]
#Now we have an element object. Add a child element to the end
#xml_element is a factory method for elements.
#Specify no namespace, 'quote' local name
#Append the result to the label element
lh_label.xml_append(container.xml_element(None, u'quote'))
#Now set the child text on the new quote element
#Notice how easily the new quote element can be accessed
lh_label.quote.xml_children.append(new_quote_text)
#Change the added attribute
#Even easier than adding an element
lh_label.added = u'2005-01-10'
#Print the updated label element back out
print lh_label.xml()
#If you want to print the entire, updated document back out, use
#print container.xml()
Again, the code's comments should provide all the needed explanation.
Taming SAX
Sometimes, though perhaps rarely, you may need to process huge files that
cannot easily be broken into simple patterns. You may need to write SAX code,
but of course as discussed often in this column, SAX isn't always an easy
tool to use. Amara provides several tools to help make SAX easier to use,
including a module saxtools.xpattern_sax_state_machine which
can write SAX state machines for you, given patterns. In fact, this module
is used in domtools.pushdom and binderytools.pushbind.
There is also a framework, Tenorsax, to help effectively linearize SAX logic.
With Tenorsax, you register callback generators rather than callback functions,
and, using the magic of Python generators, each callback actually receives
multiple SAX events within its logic, so you can use local variables
and manage state more easily than in most SAX code. Listing 6 is an example
using Tenorsax to also go through the labels XML file and print names and
addresses. Tenorsax is overkill for such a purpose, and you've already seen
how to accomplish it much more easily with Amara, but it should illustrate
the workings of Tenorsax.
Listing 6: Tenorsax code to print out all people and their street address
import sys
from xml import sax
from amara import saxtools
class label_handler:
def __init__(self):
self.event = None
self.top_dispatcher = {
(saxtools.START_ELEMENT, None, u'labels'):
self.handle_labels,
}
return
def handle_labels(self, end_condition):
dispatcher = {
(saxtools.START_ELEMENT, None, u'label'):
self.handle_label,
}
#First round through the generator corresponds to the
#start element event
yield None
#delegate is a generator that handles all the events "within"
#this element
delegate = None
while not self.event == end_condition:
delegate = saxtools.tenorsax.event_loop_body(
dispatcher, delegate, self.event)
yield None
#Element closed. Wrap up
return
def handle_label(self, end_condition):
dispatcher = {
(saxtools.START_ELEMENT, None, 'name'):
self.handle_leaf,
(saxtools.START_ELEMENT, None, 'city'):
self.handle_leaf,
}
delegate = None
yield None
while not self.event == end_condition:
delegate = saxtools.tenorsax.event_loop_body(
dispatcher, delegate, self.event)
yield None
return
def handle_leaf(self, end_condition):
element_name = self.event[2]
yield None
name = u''
while not self.event == end_condition:
if self.event[0] == saxtools.CHARACTER_DATA:
name += self.params
yield None
#Element closed. Wrap up
print name,
if element_name == u'name':
print 'of',
else:
print
return
if __name__ == "__main__":
parser = sax.make_parser()
#The "consumer" is our own handler
consumer = label_handler()
#Initialize Tenorsax with handler
handler = saxtools.tenorsax(consumer)
#Resulting tenorsax instance is the SAX handler
parser.setContentHandler(handler)
parser.setFeature(sax.handler.feature_namespaces, 1)
parser.parse('labels.xml')
Tenorsax allows you to define a hierarchy of generators which handle subtrees
of the document. Each generator gets multiple SAX events. Tenorsax takes
advantage of the fact that Python generators can be suspended and resumed.
Each time a Tenorsax handler generator yields, it is suspended, and when
the next SAX event comes along, it's resumed. The current event information
is always available as self.event. Tenorsax allows you to define
dispatcher dictionaries which map SAX event details to subsidiary generators.
The current subsidiary generator is called delegate in listing
6, because the relationship between a generator and its subsidiaries basically
forms a delegation pattern.
Tenorsax automatically creates and runs the delegates within the main event
loop, while not self.event == end_condition. The body of this
loop is usually a call back to the Tenorsax framework, although you can also
add specialized logic for the events that you want each generator to handle
itself. end_condition is provided by Tenorsax so that generators
know when to quit. For a start element, the end condition is set up to be
the event that marks the corresponding end element. handle_leaf is an example of linear logic across SAX events.
It aggregates text from multiple character events into one string, either
the contents of the name element or the city element.
It builds this using a local variable, which is not possible with regular
SAX. Usually, you'd have to use a class variable that is governed by a state
machine (so that it is not grabbing text from the wrong events). Listing
6 is certainly much more ponderous than all the other sample code so far.
Again, you would not usually use the heavy artillery for Tenorsax unless
you had logic that was very hard to force into one of the other facilities
in Amara.
Wrapping Up
There is a lot more to Amara XML Toolkit than I can cover in this article. The aim of the project is versatility—giving the developer many flexible ways of processing XML using idioms and native advantages of Python. Because of the popularity of languages such as Java, many XML standards have evolved in directions that don't match up with Python's strengths. Amara looks to bridge that gap. If you're curious about the project name, see this posting.
As often happens in the holiday season, activity has been a bit slow. Holiday revels are also a good excuse for an announcement entitled "xsdb does XML, SQL is dead as disco." Seems Aaron Watters's xsdb project, "a framework for distributing querying and combining tabular data over the Internet," has been renamed "xsdbXML." The announcement is a bit sketchy on the role of XML, but looking at the use cases, it seems xsdbXML is based on pure XML expressions of relational tables, meaning it effectively short-circuits SQL (which is, after all, but one realization of the relational calculus, and one that many relational purists consider flawed). The queries are also expressed in XML. This is a very interesting project, and coming from the brains behind Gadfly, you can expect the highest technical standards. Perhaps less whimsical announcements will help it gain the notice it deserves.
Walter Döwald announced XIST 2.8. "XIST is an extensible HTML/XML generator written in Python. XIST is also a DOM parser (built on top of SAX2) with a very simple and Pythonesque tree API." This release now requires Python 2.4 and there have been some API changes. See the announcement.
Dave Kuhlman announced generateDS 8a. generateDS is a data binding that generates Python data structures from a W3C XML Schema. I covered generateDS an earlier article. This release adds support for mixed content, structured type extensions (limited support), attribute groups, and substitution groups (limited support). See the announcement.