XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Introducing Anobind

Introducing Anobind

August 13, 2003

My recent interest in Python-XML data bindings was sparked not only by discussion in the XML community of effective approaches to XML processing, but also by personal experience with large projects where data binding approaches might have been particularly suitable. These projects included processing both data and document-style XML instances, complex systems of processing rules connected to the XML format, and other characteristics requiring flexibility from a data binding system. As a result of these considerations, and of my study of existing Python-XML data binding systems, I decided to write a new data Python-XML binding, which I call Anobind.

I designed Anobind with several properties in mind, some of which I have admired in other data binding systems, and some that I have thought were, unfortunately, lacking in other systems:

  • A natural default binding (i.e. when given an XML file with no hints or customization)
  • Well-defined mapping from XML to Python identifiers
  • Declarative, rules-based system for finetuning the binding
  • XPattern support for rules definition
  • Strong support for document-style XML (especially with regard to mixed content)
  • Reasonable support for unbinding back to XML
  • Some flexibility in trading off between efficiency and features in the resulting binding
    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

In this article I introduce Anobind, paying attention to the same considerations that guided my earlier introduction of generateDS.py and gnosis.xml.objectify.

Getting started with Anobind

Anobind 0.5 is the version I cover in this article. You can download it at the home page. Python 2.2 and 4Suite 1.0a3 are required. I'm using Python 2.2.2 and 4Suite 1.0a3. Installing Anobind is a simple matter of untarring and then good old python setup.py install.

The XML file I use to exercise the binding is the same as the one I used for gnosis.xml.objectify, listing 1 in the last article. I've named that file labels.xml, and Listing 1 in this article shows basic steps for loading it into a data binding.

Listing 1: Basic steps for creating a binding from an XML file

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder()
binding = binder.read_xml(isrc)

The first thing you'll notice is that the Anobind example requires eight lines to perform a similar binding process which requires three lines in gnosis.xml.objectify. Anobind uses the 4Suite input source architecture for dealing with XML files. Although this API is a bit more verbose than some others, I'm a strong proponent of it because it is explicit and minimizes the sort of unpleasant surprises that developers run into when they try to interchange URIs and file system paths or try to perform XML operations that require URI resolution. At any rate, if you already have a 4Suite DOM node handy, creating a binding is much more terse:

>>> binder = anobind.binder()
>>> binding = binder.read_domlette(node)

By default Anobind tries to minimize memory usage by trimming the source DOM tree as it goes along.

Peeking and Poking

The resulting Python data structure, as with gnosis.xml.objectify, consists of a set of classes defined on-the-fly, based on the XML structure. One big difference is that in Anobind the document itself is represented by a binding object, which is returned as binding in the above code examples:

$ python -i listing1.py
>>> print binding
<anobind.document_base object at 0x8176a4c>

You then access child elements naturally, as regular data members:

>>> print binding.labels
<anobind.labels object at 0x84643cc>
>>> print binding.labels.label
<anobind.label object at 0x8465a3c>

Look closely at the second statement above. There are two label elements, but it seems I access the label data member as a simple scalar value. Most Python data bindings offer tricks so that sets of child elements can be accessed in some natural idiom, whether there is one element or more. Anobind is no exception. If you access the data member in the simple manner, as above, you get the first (or only) corresponding object from the element. You can also use list-like access to grab a particular element or even loop over all the elements:

>>> print binding.labels.label[0]
<anobind.label object at 0x8465a3c>
>>> print binding.labels.label[1]
<anobind.label object at 0x846842c>
>>> for label in binding.labels.label:
...     print label
...
<anobind.label object at 0x8465a3c>
<anobind.label object at 0x846842c>

Anobind tries to keep track of the order of things in the source document. Thus objects from XML elements and documents have a children list which maintains references to child markup and Unicode instances for child text, all maintained in the order read from the document.

>>> print binding.children[0].children[1]
<anobind.label object at 0x8465a3c>

This is equivalent to the earlier binding.labels.label[0]. It's binding.children[0].children[1] rather than binding.children[0].children[0] because the white space preceding the first child element counts as a child (in the form of a simple Unicode object):

>>> print repr(binding.children[0].children[0])
u'\n  '

As with gnosis.xml.objectify, attributes are also accessed as data members given ordinary Python identifiers and non-ASCII characters are handled without problem. The text content of elements is accessible using the text_content method.

>>> print repr(binding.labels.label.added)
u'2003-06-20'
>>> print repr(binding.labels.label.quote.text_content())
u'\n      \n       is its own season\x85\n    '
>>> print repr(binding.labels.label.quote.emph.text_content())
u'Midwinter Spring'

Comments are represented by special objects in the children list:

>>> print binding.labels.label.quote.children[1]
<anobind.comment_base object at 0x8177d6c>
>>> print repr(binding.labels.label.quote.children[1].body)
u' Mixed content '

Anobind also tries to support roundtrip back to XML. The default binding described above provides the same level of roundtripping as the XSLT identity transform. That is, it's good enough for most uses. To generate XML from a binding use the unbind method. The following snippet writes the XML input to the binding back to the console:

>>> import sys
>>> binding.unbind(sys.stdout)

Customizing the binding

As in gnosis.xml.objectify, you can substitute your classes for the ones generated by Anobind. Copying the example from the last article, Listing 2 demonstrates custom classes by adding the ability to compute initials from names in label entries:

Listing 2: Using a customized element class

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri
from xml.dom import Node
 
#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)
 
class specialized_name(anobind.element_base):
    def get_initials(self):
        #Yes this could be done in more cute fashion with reduce()
        #Going for clearer steps here
        parts = self.text_content().split()
        initial_letters = [ part[0].upper() for part in parts ]
        initials = ". ".join(initial_letters)
        if initials:
            initials += "."
        return initials

#Exercise the binding by yielding the initials in the sample document
binder = anobind.binder()

#associate specialized_name class with elements with GI of "name"
binder.binding_classes[(Node.ELEMENT_NODE, "name")] = specialized_name

#Then bind
binding = binder.read_xml(isrc)

#Show the specialized instance
print binding.labels.label.name

#Test the specialized method
for label in binding.labels.label:
    print "name:", label.name.text_content()
    print "initials:", label.name.get_initials()

The binder maintains a mapping from node type and name to binding class. If it doesn't find an entry in this mapping, it generates a class for the binding of the node. Running listing 2, I get the following:

$ python listing2.py
<__main__.specialized_name object at 0x8458d5c>
name: Thomas Eliot
initials: T. E.
name: Ezra Pound
initials: E. P.

The customized class is represented as __main__.specialized_name, where __main__ indicates that the class is defined at the top level of the module invoked from the command line.

Anobind is rules-driven, and you can perform more complex customizations pretty easily by defining your own rules. Anobind also comes with some rules to handle common deviations from the default binding. For example, it's a bit wasteful that the name, street, city and state elements are rendered as full objects when they are always simple text content. Listing 3 demonstrates a variation on the binding that treats them as one would simple attributes, saving resources and simplifying access.

Listing 3: Turning certain elements into simple data members

import anobind
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri("labels.xml", attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)

#Now bind from the XML given in the input source
binder = anobind.binder()

#Specify (using XPatterns) elements to be treated similarly to attributes
stringify = ["label/name", "label/address/*"]
custom_rule = anobind.simple_string_element_rule(stringify)
binder.add_rule(custom_rule)

#Bind
binding = binder.read_xml(isrc)

print binding.labels.label.name
print binding.labels.label.address.street

Which results in the following:

$ python listing3.py
Thomas Eliot
3 Prufrock Lane

This system of rules declared and registered as Python objects is, I think, the distinct feature of Anobind. You can use XPatterns as the trigger for rules, as in listing 3, or you can use the full power of Python. For an earlier discussion of XPattern (which is a standard specialization of XPath designed for just such rules triggering), see A Tour of 4Suite.

Anobind does offer a few other features I haven't mentioned, including support for a subset of XPath expressions over binding objects.

Further Developments

Anobind is really just easing out of the gates. I have several near-term plans for it, including a tool that reads RELAX NG files and generates corresponding, customized binding rules. I also have longer-term plans such as a SAX module for generating bindings without having to build a DOM.

As for the name, "anobind" is really just a way of writing "4bind" that is friendly to computer identifiers. "Ano" is "four" in Igbo. "Anobind" also makes me think of chemistry; in particular, a possible name for the result when a negative ion reaches the positive electrode in a wet cell.

Meanwhile, gnosis.xml.objectify has not been standing still. David Mertz announced release 1.1.0. The biggest change is that Mertz acted on his thoughts about how to maintain sequence information, which were prompted by my last article. As long as you use the Expat parser (which is now the default), you can use new functions content and children to obtain from any binding object child content lists that reflect the original order in the source document.

Mertz also addressed my comment in the last article that gnosis.xml.objectify seemed to save only the last PCDATA block in mixed content:

I'm pretty sure what was actually going on was instead the fact that gnosis.xml.objectify has a habit of stripping surrounding whitespace. In my own uses, I tend to want to pull off some text content, but don't care about the way linefeeds and indents are used to prettify it.

Options, options

XML data bindings are still a developing science. This fact, combined with Python's endless flexibility, means that there probably will and should be a proliferation of data binding systems that scratch the itches of various users. Anobind is yet another entry, and I know of one other so far unreleased project by one of my colleagues. Perhaps at some point consolidation will make sense, but for now the variety and ease of use of the Python data binding tools is good news for users.

In Python-XML news, Andrew Clover put an admirable effort into documenting bugs and inconsistencies in the various Python DOM implementations.

I announced the release of 4Suite 1.0a3, which is mostly a bug-fix release.

Martin v. Löwis announced the release of PyXML 0.8.3, which fixes quite a few bugs and addresses build problems for Mac OS X users.

Jerome Alet announced version 3.01 of jaxml, a bug-fix release of the Python module for creating XML documents.

xmlSiteMakerPy 0.2 is a new tool by Ilya Nemihin, a Python-based XML and XSLT framework for offline (i.e. static) site generation. It used to be a PHP tool but was migrated to Python. It uses 4Suite for XSLT processing.

Fredrik Lundh's ElementTree (see previous article) is now up to its second 1.2 alpha release, which mainly features bug fixes from the first.

SOAPpy seems to have found its way to version 0.10.2 with very few release announcements. Web services development in Python has been quite stagnant for a while, so I hope that the folks behind the recent spate of activity become more vocal soon.

And the big news: Python 2.3 is out. The new boolean and set types should prove very useful to developers of Python-XML tools. 2.3 is also a good deal faster and fixes some Unicode bugs which are unlikely be fixed in the 2.2 series. I encourage folks to upgrade right away. For more detail on what's new in the language, see the terse, official description or Andrew Kuchling's usual gift of a highly readable update.



1 to 1 of 1
  1. Use memoization
    2003-08-15 23:44:19 David Mertz
1 to 1 of 1