XML Data Bindings in Python, Part 2

July 2, 2003

In my last article I started a discussion of data bindings for Python with a close look at generateDS.py. This time I'll look at another package, gnosis.xml.objectify from David Mertz's Gnosis Utilities. Dave Kuhlman, developer of generateDS, has also written up a comparison of his package and gnosis.xml.objectify.

Gnosis XML Utilities is a Python package with a variety of utility classes for data management, especially utility classes for XML processing. Mertz writes separate columns covering Python (Charming Python) and XML (XML Matters) on IBM developerWorks. The Gnosis tools are very handy and complementary to PyXML and 4Suite, which I have introduced in other recent articles in this column.

gnosis.xml.objectify

I'm using Gnosis_Utils-1.0.6.tar.gz, Python 2.2.2, and PyXML 0.8.2. The Gnosis installer uses distutils, but, unusually, requires the build and install steps to be executed separately: $ python setup_gnosis.py build and $ python setup_gnosis.py install.

The module import gnosis.xml.objectify allows you to convert arbitrary XML documents to Python objects. At its most basic, it does ordinary marshaling and unmarshaling, but it's also a sophisticated data binding tool. Let's begin our examination by unmarshaling the sample document from the last article, reproduced as listing 1.

Listing 1: Example file for Python data binding comparison




<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label added="2003-06-20">

    <quote>

      <!-- Mixed content -->

      <emph>Midwinter Spring</emph> is its own season&#133;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>

I did make one change from the last article's version. I added attributes, which are an important consideration for any binding. Running gnosis.xml.objectify on this file is a very simple matter:

>>> import gnosis.xml.objectify

>>> xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml')

>>> py_obj = xml_obj.make_instance()

There are two steps to creating the Python representation of the XML document. gnosis.xml.objectify.XML_Objectify sets up a preparatory object with a DOM tree from which the Python structure is created. The make_instance method does the actual work of generating the Python structure. Considerations of memory usage, or any other performance measures, are not part of this comparison; but as I mentioned in the last article, it would be nice if Python data bindings were able to minimize memory usage. I think that it's best to process XML in small chunks, but I despair of convincing others of this. It seems that since people are used to treating traditional database instances as monolithic resources, they have a natural tendency to want to do the same with XML, stuffing all their data into huge documents that are very unwieldy to process. As a first step one can at least make sure the DOM used by make_instance is cleaned up right away by using the following variation:

py_obj = XML_Objectify('listing1.xml').make_instance()

As soon as the instance is created, and the interpreter leaves that line, the temporary XML DOM is reclaimed. So the DOM is temporarily in memory at the same time as the Python structure, but this is par for Python data bindings and certainly not unreasonable. If this is too heavyweight for you, gnosis.xml.objectify allows you to generate the binding from the streaming pyexpat interface rather than DOM, although you do lose some features if you chose this approach, which is much faster and uses less memory.

The resulting Python data structure consists of a set of classes that are defined on the fly based on the XML structure. The root (document) element labels represented by py_object itself:

>>> print py_obj

<gnosis.xml.objectify._objectify._XO_labels instance at 0x824492c>

Child elements are represented by data members with names based on the XML element generic identifiers (GIs) . Each such data member is a list of objects representing child elements. For example,

>>> print py_obj.label

[<gnosis.xml.objectify._objectify._XO_label instance at 0x8208f64>, 

<gnosis.xml.objectify._objectify._XO_label instance at 0x824355c>]

>>> print py_obj.label[0].name

<gnosis.xml.objectify._objectify._XO_name instance at 0x8136344>

Attributes are also accessed as data members given ordinary Python identifiers:

>>> print repr(py_obj.label[0].added)

u'2003-06-20'

For each element, content can be accessed using the PCDATA members :

>>> print py_obj.label[0].name.PCDATA

Thomas Eliot

>>> print repr(py_obj.label[0].name.PCDATA)

u'Thomas Eliot'

Also notice that gnosis.xml.objectify does the right thing with content: it represents it as Python Unicode objects. (I did not check how it would handle elements that use Unicode -- or dashes for that matter -- in GIs, given Python's identifier name limitations.) This bodes well for the high character test; indeed, it handles the ellipsis character just fine:

>>> print repr(py_obj.label[0].quote.PCDATA)

u' is its own season\x85\n    '

The above quote element, however, is mixed content. It appears that gnosis.xml.objectify only keeps the last chunk of content in the mix by default, but not the rest. In particular, the text before the emph element, even though it's only white space, doesn't seem directly accessible. The emph element is handled conventionally:

>>> print repr(py_obj.label[0].quote.emph)

<gnosis.xml.objectify._objectify._XO_emph instance at 0x81f6fec>

>>> print repr(py_obj.label[0].quote.emph.PCDATA)

u'Midwinter Spring'

The quote element I'm exploring also has a comment, and gnosis.xml.objectify seems to offer experimental support for comments. I say "experimental" because digging into the relevant structure demonstrates very odd results:

>>> print py_obj.label[0].quote._comment

<gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc>

>>> print py_obj.label[0].quote._comment[0]

<gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc>

>>> print py_obj.label[0].quote._comment[0][0]

<gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc>

>>> print dir(py_obj.label[0].quote._comment)

['__doc__', '__getitem__', '__len__', '__module__']

Of course, the documentation says that comments are ignored, so I'd guess support is in development. The final thing to note about this default behavior of gnosis.xml.objectify is that the accumulation of various elements into Python lists means that the actual order of child elements in an XML document is lost. For example, the document in listing 2 would result in a root object with a spam data member which is a list of two elements and an eggs data member which is a list of one element, with no record of the fact that eggs occurred between the spam elements.

Listing 2: XML file demonstrating loss of ordering

<monty>

  <spam/>

  <eggs/>

  <spam/>

</monty>

gnosis.xml.objectify does have a very nice feature that allows you to recover a lot of the elided information. It keeps around the raw markup of any object with mixed content in a special data member, _XML.

>>> print repr(py_obj.label[0].quote._XML)

u'\n      <!-- Mixed content -->\n      <emph>Midwinter

 Spring</emph> is its own season\x85\n    '

You can also tune gnosis.xml.objectify to not maintain this raw information or to maintain it for all elements. And there is much more to the flexibility of the package than such simple tuning.

Customizing the binding

One of the key features of gnosis.xml.objectify is the ability to customize data bindings by substituting your classes for the autogenerated ones. For example, if I know that I will need the ability to compute initials from names in label entries, I might write a program such as listing 3:

Listing 3: Using a customized element class

import gnosis.xml.objectify



class specialized_name(gnosis.xml.objectify._XO_):

    def get_initials(self):

        #Yes this could be done in more cure fashion with reduce()

        #Going for clearer steps here

        parts = self.PCDATA.split()

        initial_letters = [ part[0].upper() for part in parts ]

        initials = ". ".join(initial_letters)

        if initials:

            initials += "."

        return initials



#Exercise the binding by yielding the initials in the sample document

#associate specialized_name class with elements with GI of "name"

gnosis.xml.objectify._XO_name = specialized_name



#Now "objectify" as before

xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml')

py_obj = xml_obj.make_instance()



#Test the specialized method

for label in py_obj.label:

    print "name:", label.name.PCDATA

    print "initials:", label.name.get_initials()

Setting gnosis.xml.objectify._XO_<GI> for a particular GI establishes a class object to be used in binding corresponding elements, rather than the generated default. Running listing 3, I get

$ python listing3.py

name: Thomas Eliot

initials: T. E.

name: Ezra Pound

initials: E. P.

There is a lot you can do with customizable bindings. You can add routines for reserializing to XML, more sophisticated transforms and queries, or even specialized persistence modules.

Scratching my itch

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

So far I have looked at two Python data binding tools which represent the current state of the art. I listed some other tools in the last article, but I won't cover them just yet. In particular, XBind and Skyron look interesting, but they use specialized languages to drive the binding process. This is a reasonable approach, one which offers some potential advantages, including support for multiple programming languages. But I'm focusing on systems that are completely built around Python's dynamism.

Part of the reason why I still use DOM rather than Python bindings is that I'm accustomed to a lot of the other XML-processing tools that work closely with DOM right now: XPath, XPatterns, etc. And a lot of my XML usage has to do with the document flavor of XML, which doesn't really suit a lot of the current data bindings. I have long incubated ideas for a Python data binding library that would tend to suit my needs better. Setting the stage for this library has been one of my motives for taking a close look at the state of the art. In the next article I shall offer a preliminary examination of my effort, as well as a general discussion of what one might like in the ultimate Python data binding tool.

Since the last article, Mike Olson and I released 0.6 of wsdl4py, our simplistic library for WSDL document manipulation. The release is mainly based on Mark Bucciarelli's patches to support recent DOM libraries.