My recent interest in Python-XML data bindings was sparked not only by discussion in the XML community of effective approaches to XML processing, but also by personal experience with large projects where data binding approaches might have been particularly suitable. These projects included processing both data and document-style XML instances, complex systems of processing rules connected to the XML format, and other characteristics requiring flexibility from a data binding system. As a result of these considerations, and of my study of existing Python-XML data binding systems, I decided to write a new data Python-XML binding, which I call Anobind.
I designed Anobind with several properties in mind, some of which I have admired in other data binding systems, and some that I have thought were, unfortunately, lacking in other systems:
- A natural default binding (i.e. when given an XML file with no hints or customization)
- Well-defined mapping from XML to Python identifiers
- Declarative, rules-based system for finetuning the binding
- XPattern support for rules definition
- Strong support for document-style XML (especially with regard to mixed content)
- Reasonable support for unbinding back to XML
- Some flexibility in trading off between efficiency and features in the resulting binding
Also in Python and XML
Getting started with Anobind
Anobind 0.5 is the version I cover in this article. You can download it
at the home
page. Python 2.2 and 4Suite 1.0a3 are required. I'm
using Python 2.2.2 and 4Suite 1.0a3. Installing Anobind is a simple matter
of untarring and then good old
python setup.py install.
The XML file I use to exercise the binding is the same as the one I used for gnosis.xml.objectify, listing 1 in the last article. I've named that file labels.xml, and Listing 1 in this article shows basic steps for loading it into a data binding.
Listing 1: Basic steps for creating a binding from an XML file
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) #Now bind from the XML given in the input source binder = anobind.binder() binding = binder.read_xml(isrc)
The first thing you'll notice is that the Anobind example requires eight lines to perform a similar binding process which requires three lines in gnosis.xml.objectify. Anobind uses the 4Suite input source architecture for dealing with XML files. Although this API is a bit more verbose than some others, I'm a strong proponent of it because it is explicit and minimizes the sort of unpleasant surprises that developers run into when they try to interchange URIs and file system paths or try to perform XML operations that require URI resolution. At any rate, if you already have a 4Suite DOM node handy, creating a binding is much more terse:
>>> binder = anobind.binder() >>> binding = binder.read_domlette(node)
By default Anobind tries to minimize memory usage by trimming the source DOM tree as it goes along.
Peeking and Poking
The resulting Python data structure, as with gnosis.xml.objectify,
consists of a set of classes defined on-the-fly, based on the XML
structure. One big difference is that in Anobind the document itself is
represented by a binding object, which is returned as
in the above code examples:
$ python -i listing1.py >>> print binding <anobind.document_base object at 0x8176a4c>
You then access child elements naturally, as regular data members:
>>> print binding.labels <anobind.labels object at 0x84643cc> >>> print binding.labels.label <anobind.label object at 0x8465a3c>
Look closely at the second statement above. There are two label
elements, but it seems I access the
label data member as a
simple scalar value. Most Python data bindings offer tricks so that sets
of child elements can be accessed in some natural idiom, whether there is
one element or more. Anobind is no exception. If you access the data
member in the simple manner, as above, you get the first (or only)
corresponding object from the element. You can also use list-like access
to grab a particular element or even loop over all the elements:
>>> print binding.labels.label <anobind.label object at 0x8465a3c> >>> print binding.labels.label <anobind.label object at 0x846842c> >>> for label in binding.labels.label: ... print label ... <anobind.label object at 0x8465a3c> <anobind.label object at 0x846842c>
Anobind tries to keep track of the order of things in the source
document. Thus objects from XML elements and documents have a
children list which maintains references to child markup and
Unicode instances for child text, all maintained in the order read from
>>> print binding.children.children <anobind.label object at 0x8465a3c>
This is equivalent to the earlier
binding.children.children rather than
binding.children.children because the white space
preceding the first child element counts as a child (in the form of a
simple Unicode object):
>>> print repr(binding.children.children) u'\n '
As with gnosis.xml.objectify, attributes are also accessed as data
members given ordinary Python identifiers and non-ASCII characters are
handled without problem. The text content of elements is accessible using
>>> print repr(binding.labels.label.added) u'2003-06-20' >>> print repr(binding.labels.label.quote.text_content()) u'\n \n is its own season\x85\n ' >>> print repr(binding.labels.label.quote.emph.text_content()) u'Midwinter Spring'
Comments are represented by special objects in the
>>> print binding.labels.label.quote.children <anobind.comment_base object at 0x8177d6c> >>> print repr(binding.labels.label.quote.children.body) u' Mixed content '
Anobind also tries to support roundtrip back to XML. The default
binding described above provides the same level of roundtripping as the
XSLT identity transform. That is, it's good enough for most uses. To
generate XML from a binding use the
unbind method. The
following snippet writes the XML input to the binding back to the console:
>>> import sys >>> binding.unbind(sys.stdout)
Customizing the binding
As in gnosis.xml.objectify, you can substitute your classes for the ones generated by Anobind. Copying the example from the last article, Listing 2 demonstrates custom classes by adding the ability to compute initials from names in label entries:
Listing 2: Using a customized element class
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri from xml.dom import Node #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) class specialized_name(anobind.element_base): def get_initials(self): #Yes this could be done in more cute fashion with reduce() #Going for clearer steps here parts = self.text_content().split() initial_letters = [ part.upper() for part in parts ] initials = ". ".join(initial_letters) if initials: initials += "." return initials #Exercise the binding by yielding the initials in the sample document binder = anobind.binder() #associate specialized_name class with elements with GI of "name" binder.binding_classes[(Node.ELEMENT_NODE, "name")] = specialized_name #Then bind binding = binder.read_xml(isrc) #Show the specialized instance print binding.labels.label.name #Test the specialized method for label in binding.labels.label: print "name:", label.name.text_content() print "initials:", label.name.get_initials()
The binder maintains a mapping from node type and name to binding class. If it doesn't find an entry in this mapping, it generates a class for the binding of the node. Running listing 2, I get the following:
$ python listing2.py <__main__.specialized_name object at 0x8458d5c> name: Thomas Eliot initials: T. E. name: Ezra Pound initials: E. P.
The customized class is represented as
indicates that the class is defined at the top level of the module invoked
from the command line.
Anobind is rules-driven, and you can perform more complex
customizations pretty easily by defining your own rules. Anobind also
comes with some rules to handle common deviations from the default
binding. For example, it's a bit wasteful that the
state elements are
rendered as full objects when they are always simple text content. Listing
3 demonstrates a variation on the binding that treats them as one would
simple attributes, saving resources and simplifying access.
Listing 3: Turning certain elements into simple data members
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) #Now bind from the XML given in the input source binder = anobind.binder() #Specify (using XPatterns) elements to be treated similarly to attributes stringify = ["label/name", "label/address/*"] custom_rule = anobind.simple_string_element_rule(stringify) binder.add_rule(custom_rule) #Bind binding = binder.read_xml(isrc) print binding.labels.label.name print binding.labels.label.address.street
Which results in the following:
$ python listing3.py Thomas Eliot 3 Prufrock Lane
This system of rules declared and registered as Python objects is, I think, the distinct feature of Anobind. You can use XPatterns as the trigger for rules, as in listing 3, or you can use the full power of Python. For an earlier discussion of XPattern (which is a standard specialization of XPath designed for just such rules triggering), see A Tour of 4Suite.
Anobind does offer a few other features I haven't mentioned, including support for a subset of XPath expressions over binding objects.
Anobind is really just easing out of the gates. I have several near-term plans for it, including a tool that reads RELAX NG files and generates corresponding, customized binding rules. I also have longer-term plans such as a SAX module for generating bindings without having to build a DOM.
As for the name, "anobind" is really just a way of writing "4bind" that is friendly to computer identifiers. "Ano" is "four" in Igbo. "Anobind" also makes me think of chemistry; in particular, a possible name for the result when a negative ion reaches the positive electrode in a wet cell.
Meanwhile, gnosis.xml.objectify has not been standing still. David
1.1.0. The biggest change is that Mertz acted on his thoughts about
how to maintain sequence information, which were prompted by my last
article. As long as you use the Expat parser (which is now the default),
you can use new functions
to obtain from any binding object child content lists that reflect the
original order in the source document.
Mertz also addressed my comment in the last article that gnosis.xml.objectify seemed to save only the last PCDATA block in mixed content:
I'm pretty sure what was actually going on was instead the fact that gnosis.xml.objectify has a habit of stripping surrounding whitespace. In my own uses, I tend to want to pull off some text content, but don't care about the way linefeeds and indents are used to prettify it.
XML data bindings are still a developing science. This fact, combined with Python's endless flexibility, means that there probably will and should be a proliferation of data binding systems that scratch the itches of various users. Anobind is yet another entry, and I know of one other so far unreleased project by one of my colleagues. Perhaps at some point consolidation will make sense, but for now the variety and ease of use of the Python data binding tools is good news for users.
In Python-XML news, Andrew Clover put an admirable effort into documenting bugs and inconsistencies in the various Python DOM implementations.
Jerome Alet announced version 3.01 of jaxml, a bug-fix release of the Python module for creating XML documents.
xmlSiteMakerPy 0.2 is a new tool by Ilya Nemihin, a Python-based XML and XSLT framework for offline (i.e. static) site generation. It used to be a PHP tool but was migrated to Python. It uses 4Suite for XSLT processing.
SOAPpy seems to have found its way to version 0.10.2 with very few release announcements. Web services development in Python has been quite stagnant for a while, so I hope that the folks behind the recent spate of activity become more vocal soon.
And the big news: Python 2.3 is out. The new boolean and set types should prove very useful to developers of Python-XML tools. 2.3 is also a good deal faster and fixes some Unicode bugs which are unlikely be fixed in the 2.2 series. I encourage folks to upgrade right away. For more detail on what's new in the language, see the terse, official description or Andrew Kuchling's usual gift of a highly readable update.
- Use memoization
2003-08-15 23:44:19 David Mertz