August 13, 2003
My recent interest in Python-XML data bindings was sparked not only by discussion in the XML community of effective approaches to XML processing, but also by personal experience with large projects where data binding approaches might have been particularly suitable. These projects included processing both data and document-style XML instances, complex systems of processing rules connected to the XML format, and other characteristics requiring flexibility from a data binding system. As a result of these considerations, and of my study of existing Python-XML data binding systems, I decided to write a new data Python-XML binding, which I call Anobind.
I designed Anobind with several properties in mind, some of which I have admired in other data binding systems, and some that I have thought were, unfortunately, lacking in other systems:
- A natural default binding (i.e. when given an XML file with no hints or customization)
- Well-defined mapping from XML to Python identifiers
- Declarative, rules-based system for finetuning the binding
- XPattern support for rules definition
- Strong support for document-style XML (especially with regard to mixed content)
- Reasonable support for unbinding back to XML
- Some flexibility in trading off between efficiency and features in the resulting binding
Also in Python and XML
Getting started with Anobind
Anobind 0.5 is the version I cover in this article. You can download it at the home page. Python 2.2 and 4Suite 1.0a3 are required. I'm using Python 2.2.2
and 4Suite 1.0a3. Installing Anobind is a simple matter of untarring and then good
python setup.py install.
The XML file I use to exercise the binding is the same as the one I used for gnosis.xml.objectify, listing 1 in the last article. I've named that file labels.xml, and Listing 1 in this article shows basic steps for loading it into a data binding.
Listing 1: Basic steps for creating a binding from an XML file
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) #Now bind from the XML given in the input source binder = anobind.binder() binding = binder.read_xml(isrc)
The first thing you'll notice is that the Anobind example requires eight lines to perform a similar binding process which requires three lines in gnosis.xml.objectify. Anobind uses the 4Suite input source architecture for dealing with XML files. Although this API is a bit more verbose than some others, I'm a strong proponent of it because it is explicit and minimizes the sort of unpleasant surprises that developers run into when they try to interchange URIs and file system paths or try to perform XML operations that require URI resolution. At any rate, if you already have a 4Suite DOM node handy, creating a binding is much more terse:
>>> binder = anobind.binder() >>> binding = binder.read_domlette(node)
By default Anobind tries to minimize memory usage by trimming the source DOM tree as it goes along.
Peeking and Poking
The resulting Python data structure, as with gnosis.xml.objectify, consists of a set
classes defined on-the-fly, based on the XML structure. One big difference is that
Anobind the document itself is represented by a binding object, which is returned
binding in the above code examples:
$ python -i listing1.py >>> print binding <anobind.document_base object at 0x8176a4c>
You then access child elements naturally, as regular data members:
>>> print binding.labels <anobind.labels object at 0x84643cc> >>> print binding.labels.label <anobind.label object at 0x8465a3c>
Look closely at the second statement above. There are two label elements, but it seems
label data member as a simple scalar value. Most Python data
bindings offer tricks so that sets of child elements can be accessed in some natural
whether there is one element or more. Anobind is no exception. If you access the data
in the simple manner, as above, you get the first (or only) corresponding object from
element. You can also use list-like access to grab a particular element or even loop
all the elements:
>>> print binding.labels.label <anobind.label object at 0x8465a3c> >>> print binding.labels.label <anobind.label object at 0x846842c> >>> for label in binding.labels.label: ... print label ... <anobind.label object at 0x8465a3c> <anobind.label object at 0x846842c>
Anobind tries to keep track of the order of things in the source document. Thus objects
from XML elements and documents have a
children list which maintains references
to child markup and Unicode instances for child text, all maintained in the order
>>> print binding.children.children <anobind.label object at 0x8465a3c>
This is equivalent to the earlier
binding.children.children rather than
binding.children.children because the white space preceding the first
child element counts as a child (in the form of a simple Unicode object):
>>> print repr(binding.children.children) u'\n '
As with gnosis.xml.objectify, attributes are also accessed as data members given ordinary
Python identifiers and non-ASCII characters are handled without problem. The text
elements is accessible using the
>>> print repr(binding.labels.label.added) u'2003-06-20' >>> print repr(binding.labels.label.quote.text_content()) u'\n \n is its own season\x85\n ' >>> print repr(binding.labels.label.quote.emph.text_content()) u'Midwinter Spring'
Comments are represented by special objects in the
>>> print binding.labels.label.quote.children <anobind.comment_base object at 0x8177d6c> >>> print repr(binding.labels.label.quote.children.body) u' Mixed content '
Anobind also tries to support roundtrip back to XML. The default binding described
provides the same level of roundtripping as the XSLT identity transform. That is,
enough for most uses. To generate XML from a binding use the
unbind method. The
following snippet writes the XML input to the binding back to the console:
>>> import sys >>> binding.unbind(sys.stdout)
Customizing the binding
As in gnosis.xml.objectify, you can substitute your classes for the ones generated by Anobind. Copying the example from the last article, Listing 2 demonstrates custom classes by adding the ability to compute initials from names in label entries:
Listing 2: Using a customized element class
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri from xml.dom import Node #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) class specialized_name(anobind.element_base): def get_initials(self): #Yes this could be done in more cute fashion with reduce() #Going for clearer steps here parts = self.text_content().split() initial_letters = [ part.upper() for part in parts ] initials = ". ".join(initial_letters) if initials: initials += "." return initials #Exercise the binding by yielding the initials in the sample document binder = anobind.binder() #associate specialized_name class with elements with GI of "name" binder.binding_classes[(Node.ELEMENT_NODE, "name")] = specialized_name #Then bind binding = binder.read_xml(isrc) #Show the specialized instance print binding.labels.label.name #Test the specialized method for label in binding.labels.label: print "name:", label.name.text_content() print "initials:", label.name.get_initials()
The binder maintains a mapping from node type and name to binding class. If it doesn't find an entry in this mapping, it generates a class for the binding of the node. Running listing 2, I get the following:
$ python listing2.py <__main__.specialized_name object at 0x8458d5c> name: Thomas Eliot initials: T. E. name: Ezra Pound initials: E. P.
The customized class is represented as
__main__ indicates that the class is defined at the top level of the module
invoked from the command line.
Anobind is rules-driven, and you can perform more complex customizations pretty easily
defining your own rules. Anobind also comes with some rules to handle common deviations
the default binding. For example, it's a bit wasteful that the
state elements are rendered as
full objects when they are always simple text content. Listing 3 demonstrates a variation
the binding that treats them as one would simple attributes, saving resources and
Listing 3: Turning certain elements into simple data members
import anobind from Ft.Xml import InputSource from Ft.Lib import Uri #Create an input source for the XML isrc_factory = InputSource.DefaultFactory #Create a URI from a filename the right way file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1) isrc = isrc_factory.fromUri(file_uri) #Now bind from the XML given in the input source binder = anobind.binder() #Specify (using XPatterns) elements to be treated similarly to attributes stringify = ["label/name", "label/address/*"] custom_rule = anobind.simple_string_element_rule(stringify) binder.add_rule(custom_rule) #Bind binding = binder.read_xml(isrc) print binding.labels.label.name print binding.labels.label.address.street
Which results in the following:
$ python listing3.py Thomas Eliot 3 Prufrock Lane
This system of rules declared and registered as Python objects is, I think, the distinct feature of Anobind. You can use XPatterns as the trigger for rules, as in listing 3, or you can use the full power of Python. For an earlier discussion of XPattern (which is a standard specialization of XPath designed for just such rules triggering), see A Tour of 4Suite.
Anobind does offer a few other features I haven't mentioned, including support for a subset of XPath expressions over binding objects.
Anobind is really just easing out of the gates. I have several near-term plans for it, including a tool that reads RELAX NG files and generates corresponding, customized binding rules. I also have longer-term plans such as a SAX module for generating bindings without having to build a DOM.
As for the name, "anobind" is really just a way of writing "4bind" that is friendly to computer identifiers. "Ano" is "four" in Igbo. "Anobind" also makes me think of chemistry; in particular, a possible name for the result when a negative ion reaches the positive electrode in a wet cell.
Meanwhile, gnosis.xml.objectify has not been standing still. David Mertz announced
release 1.1.0. The biggest
change is that Mertz acted on his thoughts about how to maintain sequence information,
were prompted by my last article.
As long as you use the Expat parser (which is now the default), you can use new functions
children to obtain from any binding object child
content lists that reflect the original order in the source document.
Mertz also addressed my comment in the last article that gnosis.xml.objectify seemed to save only the last PCDATA block in mixed content:
I'm pretty sure what was actually going on was instead the fact that gnosis.xml.objectify has a habit of stripping surrounding whitespace. In my own uses, I tend to want to pull off some text content, but don't care about the way linefeeds and indents are used to prettify it.
XML data bindings are still a developing science. This fact, combined with Python's endless flexibility, means that there probably will and should be a proliferation of data binding systems that scratch the itches of various users. Anobind is yet another entry, and I know of one other so far unreleased project by one of my colleagues. Perhaps at some point consolidation will make sense, but for now the variety and ease of use of the Python data binding tools is good news for users.
In Python-XML news, Andrew Clover put an admirable effort into documenting bugs and inconsistencies in the various Python DOM implementations.
Jerome Alet announced version 3.01 of jaxml, a bug-fix release of the Python module for creating XML documents.
xmlSiteMakerPy 0.2 is a new tool by Ilya Nemihin, a Python-based XML and XSLT framework for offline (i.e. static) site generation. It used to be a PHP tool but was migrated to Python. It uses 4Suite for XSLT processing.
SOAPpy seems to have found its way to version 0.10.2 with very few release announcements. Web services development in Python has been quite stagnant for a while, so I hope that the folks behind the recent spate of activity become more vocal soon.
And the big news: Python 2.3 is out. The new boolean and set types should prove very useful to developers of Python-XML tools. 2.3 is also a good deal faster and fixes some Unicode bugs which are unlikely be fixed in the 2.2 series. I encourage folks to upgrade right away. For more detail on what's new in the language, see the terse, official description or Andrew Kuchling's usual gift of a highly readable update.