Menu

Introducing the Amara XML Toolkit

January 19, 2005

Uche Ogbuji

As part of my roundup of Python data bindings, I introduced my own Anobind project. Over the column's history, I've also developed other code to meet some need emphasized in one of the previous articles. I recently collected all of these various little projects together into one open source package of XML processing add-ons, Amara XML Toolkit. Amara is meant to complement 4Suite in that 4Suite works towards fidelity to XML technical ideals, while Amara works towards fidelity to Python conventions, taking maximum advantage of Python's strengths. The main components of Amara XML Toolkit are the following:

  • Bindery: data binding tool. The code that was formerly available standalone as "Anobind" but with extensive improvements and additions, including a move of the fundamental framework from DOM to SAX.
  • Scimitar: an implementation of the ISO Schematron schema language for XML. It also used to be a standalone project, which I've announced here in the past. It converts Schematron files to standalone Python scripts.
  • domtools: helper routines for working with Python DOMs, many of which first made their appearance in previous articles such as "Generating DOM Magic" and "Location, Location, Location."
  • saxtools: helper frameworks and routines for easier use of Python's SAX implementation, many of which first made their appearance in previous articles such as " Decomposition, Process, Recomposition".
  • Flextyper: implementation of Jeni Tennison's Data Type Library Language (DTLL) (on track to become part 5 of ISO Document Schema Definition Languages (DSDL). You can use Flextyper to generate Python modules containing data types classes that can be used with 4Suite's RELAX NG library, although it won't come into its full usefulness until the next release of 4Suite.

In this article I introduce parts of Amara, focusing on several little, common tasks it's supposed to help with. Some of these are tasks you will recognize from earlier articles in this column. Amara requires Python 2.3 or later and 4Suite 1.0a4 or later. I used Python 2.3.4 to run all listings presented, working with Amara 0.9.2. With the prerequisites in place, installation is the usual matter of python setup.py install.

Best of SAX and DOM

The very first sample task needs very little preamble. See listing 1, a form of the address label example I so often use.

Listing 1: Sample XML file (labels.xml) containing Address Labels

<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label id="tse" added="2003-06-20">

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

    <quote>

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

  </label>

  <label id="ep" added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

    <quote>

      What thou lovest well remains, the rest is dross&#8230;

    </quote>

  </label>

  <!-- Throw in 10,000 more records just like this -->

  <label id="lh" added="2004-11-01">

    <name>Langston Hughes</name>

    <address>

      <street>10 Bridge Tunnel</street>

      <city>Harlem</city>

      <state>NY</state>

    </address>

  </label>

</labels>  

Listing 2 is code to print out all people and their street addresses.

Listing 2 (listing2.py): Amara Pushdom code to print out all people and their street addresses

from amara import domtools



for docfrag in domtools.pushdom('/labels/label', source='labels.xml'):

    label = docfrag.firstChild

    name = label.xpath('string(name)')

    city = label.xpath('string(address/city)')

    print name, 'of', city

The code is extremely simple, but it does print what a quick glance might lead you to expect:

$ python listing2.py

Thomas Eliot of Stamford

Ezra Pound of Hailey

Langston Hughes of Harlem  

The trick is how it does this. domtools.pushdom is a generator which yields a DOM document fragment at a time, such that the entire document is broken down into a series of subtrees given by the pattern passed in: /labels/label. The full document is never in memory (in fact, the code never takes up much more memory than it takes to maintain a DOM node for a single label element. If, as the comment in listing 1 suggests, there were 10,000 more label elements, the memory usage wouldn't be much greater; although, if your loop iterates faster than Python can reclaim each discarded node, you might want to add an explicit gc.collect() at the end of the loop. Each node yielded by the generator is a basic Domlette node, with all the usual properties and methods this makes available, including the useful xpath() method.

Compare listing 2 above to listing 4 of "Decomposition, Process, Recomposition" and you'll get a sense of how this wrappering of ideas from that article simplifies things.

If DOM Is Too Lame for You

Pythonic APIs are meant to make life easier for the many users who find DOM too arcane and alien for use in Python. Almost all of the earlier article on Anobind is still valid in Amara. The biggest change is in the imports. I also added some concessions to people who really don't want to worry about URL and file details and the like; the eight lines of listing 1 from the earlier article can now be reduced to two lines (the top two of listing 3). Listing 3 is an example of how I could use Amara Bindery to display names and cities from listing 1, the functional equivalent of listing 2.

Listing 3: Amara Bindery code to print out all people and their street addresses

from amara import binderytools



container = binderytools.bind_file('labels.xml')

for l in container.labels.label:

    print l.name, 'of', l.address.city

binderytools.bind_file takes a file name, parses the file, and returns a data binding, rooted at the object container, which represents the XML root node. Each element is a specialized object that permits easy access to the data using Python idioms, with object property names based on the names of XML tags and attributes. In a typical expression of the prevalent attitude in the Python community, one blogger called it "turning XML into something useful."

The Natural Next Step: Push Binding

One possible problem with listing 3 is that the entire XML document is converted to Python objects, which could mean a lot of memory usage for large documents, for example, if labels.xml were expanded to have 10,000 entries in label elements. Amara Bindery does mitigate this a little bit by using SAX to create data bindings, but this may not be good enough. What would be great is some way to use the pushdom approach from listing 2 while still having the ease-of-use advantage of Amara Bindery. This option is available as the Push binding, illustrated in listing 4.

Listing 4: Amara Push binding code to print out all people and their street addresses

from amara import binderytools



for subtree in binderytools.pushbind('/labels/label', source='labels.xml'):

    print subtree.label.name, 'of', subtree.label.address.city  

You use patterns just as in listing 2 to break up the document, and just as in listing 2, binderytools.pushbind is a generator that instantiates part of the document at a time, thus never using up the memory needed to represent the entire document. This time, however, the values yielded by the generator are subtrees of an Amara binding rather than DOM nodes, so you can use the more natural Python idioms to access the data, if you prefer.

Modification

Amara Bindery makes it pretty easy to modify XML objects in place and reserialize them back to XML. As an example, listing 5 makes some changes to one of the label elements and then prints the result back out.

Listing 5: Amara Bindery code to update an address label entry

from amara import binderytools



container = binderytools.bind_file('labels.xml')



#Add a quote to the Langston Hughes entry



#The quote text to be added 

new_quote_text = \

u'\u2026if dreams die, life is a broken winged bird that cannot fly.'



#The ID of Hughes's entry

id = 'lh'



#Cull to a list of entries with the desired ID

lh_label = [ label for label in container.labels.label

                   if label.id == 'lh' ]

#We know there's only one, so get it

lh_label = lh_label[0]



#Now we have an element object.  Add a child element to the end



#xml_element is a factory method for elements.

#Specify no namespace, 'quote' local name

#Append the result to the label element

lh_label.xml_append(container.xml_element(None, u'quote'))



#Now set the child text on the new quote element

#Notice how easily the new quote element can be accessed

lh_label.quote.xml_children.append(new_quote_text)



#Change the added attribute

#Even easier than adding an element

lh_label.added = u'2005-01-10'



#Print the updated label element back out

print lh_label.xml()

#If you want to print the entire, updated document back out, use

#print container.xml() 

Again, the code's comments should provide all the needed explanation.

Taming SAX

Sometimes, though perhaps rarely, you may need to process huge files that cannot easily be broken into simple patterns. You may need to write SAX code, but of course as discussed often in this column, SAX isn't always an easy tool to use. Amara provides several tools to help make SAX easier to use, including a module saxtools.xpattern_sax_state_machine which can write SAX state machines for you, given patterns. In fact, this module is used in domtools.pushdom and binderytools.pushbind. There is also a framework, Tenorsax, to help effectively linearize SAX logic. With Tenorsax, you register callback generators rather than callback functions, and, using the magic of Python generators, each callback actually receives multiple SAX events within its logic, so you can use local variables and manage state more easily than in most SAX code. Listing 6 is an example using Tenorsax to also go through the labels XML file and print names and addresses. Tenorsax is overkill for such a purpose, and you've already seen how to accomplish it much more easily with Amara, but it should illustrate the workings of Tenorsax.

Listing 6: Tenorsax code to print out all people and their street address

import sys

from xml import sax

from amara import saxtools



class label_handler:

    def __init__(self):

        self.event = None

        self.top_dispatcher = { 

        	(saxtools.START_ELEMENT, None, u'labels'): 

        	self.handle_labels,

            }

        return



    def handle_labels(self, end_condition):

        dispatcher = {

            (saxtools.START_ELEMENT, None, u'label'):

            self.handle_label,

            }

        #First round through the generator corresponds to the

        #start element event

        yield None

        #delegate is a generator that handles all the events "within"

        #this element

        delegate = None

        while not self.event == end_condition:

            delegate = saxtools.tenorsax.event_loop_body(

                dispatcher, delegate, self.event)

            yield None

        #Element closed.  Wrap up

        return



    def handle_label(self, end_condition):

        dispatcher = {

            (saxtools.START_ELEMENT, None, 'name'):

            self.handle_leaf,

            (saxtools.START_ELEMENT, None, 'city'):

            self.handle_leaf,

            }

        delegate = None

        yield None

        while not self.event == end_condition:

            delegate = saxtools.tenorsax.event_loop_body(

                dispatcher, delegate, self.event)

            yield None

        return



    def handle_leaf(self, end_condition):

        element_name = self.event[2]

        yield None

        name = u''

        while not self.event == end_condition:

            if self.event[0] == saxtools.CHARACTER_DATA:

                name += self.params

            yield None

        #Element closed.  Wrap up

        print name,

        if element_name == u'name':

            print 'of',

        else:

            print

        return





if __name__ == "__main__":

    parser = sax.make_parser()

    #The "consumer" is our own handler

    consumer = label_handler()

    #Initialize Tenorsax with handler

    handler = saxtools.tenorsax(consumer)

    #Resulting tenorsax instance is the SAX handler 

    parser.setContentHandler(handler)

    parser.setFeature(sax.handler.feature_namespaces, 1)

    parser.parse('labels.xml')  

Tenorsax allows you to define a hierarchy of generators which handle subtrees of the document. Each generator gets multiple SAX events. Tenorsax takes advantage of the fact that Python generators can be suspended and resumed. Each time a Tenorsax handler generator yields, it is suspended, and when the next SAX event comes along, it's resumed. The current event information is always available as self.event. Tenorsax allows you to define dispatcher dictionaries which map SAX event details to subsidiary generators. The current subsidiary generator is called delegate in listing 6, because the relationship between a generator and its subsidiaries basically forms a delegation pattern.

Tenorsax automatically creates and runs the delegates within the main event loop, while not self.event == end_condition. The body of this loop is usually a call back to the Tenorsax framework, although you can also add specialized logic for the events that you want each generator to handle itself. end_condition is provided by Tenorsax so that generators know when to quit. For a start element, the end condition is set up to be the event that marks the corresponding end element. handle_leaf is an example of linear logic across SAX events.

It aggregates text from multiple character events into one string, either the contents of the name element or the city element. It builds this using a local variable, which is not possible with regular SAX. Usually, you'd have to use a class variable that is governed by a state machine (so that it is not grabbing text from the wrong events). Listing 6 is certainly much more ponderous than all the other sample code so far. Again, you would not usually use the heavy artillery for Tenorsax unless you had logic that was very hard to force into one of the other facilities in Amara.

Wrapping Up

There is a lot more to Amara XML Toolkit than I can cover in this article. The aim of the project is versatility—giving the developer many flexible ways of processing XML using idioms and native advantages of Python. Because of the popularity of languages such as Java, many XML standards have evolved in directions that don't match up with Python's strengths. Amara looks to bridge that gap. If you're curious about the project name, see this posting.

As often happens in the holiday season, activity has been a bit slow. Holiday revels are also a good excuse for an announcement entitled "xsdb does XML, SQL is dead as disco." Seems Aaron Watters's xsdb project, "a framework for distributing querying and combining tabular data over the Internet," has been renamed "xsdbXML." The announcement is a bit sketchy on the role of XML, but looking at the use cases, it seems xsdbXML is based on pure XML expressions of relational tables, meaning it effectively short-circuits SQL (which is, after all, but one realization of the relational calculus, and one that many relational purists consider flawed). The queries are also expressed in XML. This is a very interesting project, and coming from the brains behind Gadfly, you can expect the highest technical standards. Perhaps less whimsical announcements will help it gain the notice it deserves.

Walter Döwald announced XIST 2.8. "XIST is an extensible HTML/XML generator written in Python. XIST is also a DOM parser (built on top of SAX2) with a very simple and Pythonesque tree API." This release now requires Python 2.4 and there have been some API changes. See the announcement.

Dave Kuhlman announced generateDS 8a. generateDS is a data binding that generates Python data structures from a W3C XML Schema. I covered generateDS an earlier article. This release adds support for mixed content, structured type extensions (limited support), attribute groups, and substitution groups (limited support). See the announcement.