XML Data Bindings in Python, Part 2
July 2, 2003
In my last article I started a discussion of data bindings for Python with a close look at generateDS.py. This time I'll look at another package, gnosis.xml.objectify from David Mertz's Gnosis Utilities. Dave Kuhlman, developer of generateDS, has also written up a comparison of his package and gnosis.xml.objectify.
Gnosis XML Utilities is a Python package with a variety of utility classes for data management, especially utility classes for XML processing. Mertz writes separate columns covering Python (Charming Python) and XML (XML Matters) on IBM developerWorks. The Gnosis tools are very handy and complementary to PyXML and 4Suite, which I have introduced in other recent articles in this column.
I'm using Gnosis_Utils-1.0.6.tar.gz, Python
2.2.2, and PyXML 0.8.2. The Gnosis installer uses distutils, but, unusually, requires
build and install steps to be executed separately:
$ python setup_gnosis.py
$ python setup_gnosis.py install.
import gnosis.xml.objectify allows you to convert arbitrary XML
documents to Python objects. At its most basic, it does ordinary marshaling and
unmarshaling, but it's also a sophisticated data binding tool. Let's begin our examination
by unmarshaling the sample document from the last article, reproduced as listing 1.
Listing 1: Example file for Python data binding comparison
<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
I did make one change from the last article's version. I added attributes, which are an important consideration for any binding. Running gnosis.xml.objectify on this file is a very simple matter:
>>> import gnosis.xml.objectify >>> xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml') >>> py_obj = xml_obj.make_instance()
There are two steps to creating the Python representation of the XML document.
gnosis.xml.objectify.XML_Objectify sets up a preparatory object with a DOM
tree from which the Python structure is created. The
make_instance method does
the actual work of generating the Python structure. Considerations of memory usage,
other performance measures, are not part of this comparison; but as I mentioned in
article, it would be nice if Python data bindings were able to minimize memory usage.
think that it's best to process XML in small chunks, but I despair of convincing others
this. It seems that since people are used to treating traditional database instances
monolithic resources, they have a natural tendency to want to do the same with XML,
all their data into huge documents that are very unwieldy to process. As a first step
can at least make sure the DOM used by
make_instance is cleaned up right away
by using the following variation:
py_obj = XML_Objectify('listing1.xml').make_instance()
As soon as the instance is created, and the interpreter leaves that line, the temporary XML DOM is reclaimed. So the DOM is temporarily in memory at the same time as the Python structure, but this is par for Python data bindings and certainly not unreasonable. If this is too heavyweight for you, gnosis.xml.objectify allows you to generate the binding from the streaming pyexpat interface rather than DOM, although you do lose some features if you chose this approach, which is much faster and uses less memory.
The resulting Python data structure consists of a set of classes that are defined
fly based on the XML structure. The root (document) element
>>> print py_obj <gnosis.xml.objectify._objectify._XO_labels instance at 0x824492c>
Child elements are represented by data members with names based on the XML element generic identifiers (GIs) . Each such data member is a list of objects representing child elements. For example,
>>> print py_obj.label [<gnosis.xml.objectify._objectify._XO_label instance at 0x8208f64>, <gnosis.xml.objectify._objectify._XO_label instance at 0x824355c>] >>> print py_obj.label.name <gnosis.xml.objectify._objectify._XO_name instance at 0x8136344>
Attributes are also accessed as data members given ordinary Python identifiers:
>>> print repr(py_obj.label.added) u'2003-06-20'
For each element, content can be accessed using the
PCDATA members :
>>> print py_obj.label.name.PCDATA Thomas Eliot >>> print repr(py_obj.label.name.PCDATA) u'Thomas Eliot'
Also notice that gnosis.xml.objectify does the right thing with content: it represents it as Python Unicode objects. (I did not check how it would handle elements that use Unicode -- or dashes for that matter -- in GIs, given Python's identifier name limitations.) This bodes well for the high character test; indeed, it handles the ellipsis character just fine:
>>> print repr(py_obj.label.quote.PCDATA) u' is its own season\x85\n '
quote element, however, is mixed content. It appears that
gnosis.xml.objectify only keeps the last chunk of content in the mix by default, but
rest. In particular, the text before the
emph element, even though it's only
white space, doesn't seem directly accessible. The
emph element is handled
>>> print repr(py_obj.label.quote.emph) <gnosis.xml.objectify._objectify._XO_emph instance at 0x81f6fec> >>> print repr(py_obj.label.quote.emph.PCDATA) u'Midwinter Spring'
quote element I'm exploring also has a comment, and gnosis.xml.objectify
seems to offer experimental support for comments. I say "experimental" because digging
the relevant structure demonstrates very odd results:
>>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print dir(py_obj.label.quote._comment) ['__doc__', '__getitem__', '__len__', '__module__']
Of course, the documentation says that comments are ignored, so I'd guess support
development. The final thing to note about this default behavior of gnosis.xml.objectify
that the accumulation of various elements into Python lists means that the actual
child elements in an XML document is lost. For example, the document in listing 2
result in a root object with a
spam data member which is a list of two elements
eggs data member which is a list of one element, with no record of the
eggs occurred between the
<monty> <spam/> <eggs/> <spam/> </monty>
gnosis.xml.objectify does have a very nice feature that allows you to recover a lot
elided information. It keeps around the raw markup of any object with mixed content
special data member,
>>> print repr(py_obj.label.quote._XML) u'\n <!-- Mixed content -->\n <emph>Midwinter Spring</emph> is its own season\x85\n '
You can also tune gnosis.xml.objectify to not maintain this raw information or to maintain it for all elements. And there is much more to the flexibility of the package than such simple tuning.
Customizing the binding
One of the key features of gnosis.xml.objectify is the ability to customize data bindings by substituting your classes for the autogenerated ones. For example, if I know that I will need the ability to compute initials from names in label entries, I might write a program such as listing 3:Listing 3: Using a customized element class
import gnosis.xml.objectify class specialized_name(gnosis.xml.objectify._XO_): def get_initials(self): #Yes this could be done in more cure fashion with reduce() #Going for clearer steps here parts = self.PCDATA.split() initial_letters = [ part.upper() for part in parts ] initials = ". ".join(initial_letters) if initials: initials += "." return initials #Exercise the binding by yielding the initials in the sample document #associate specialized_name class with elements with GI of "name" gnosis.xml.objectify._XO_name = specialized_name #Now "objectify" as before xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml') py_obj = xml_obj.make_instance() #Test the specialized method for label in py_obj.label: print "name:", label.name.PCDATA print "initials:", label.name.get_initials()
gnosis.xml.objectify._XO_<GI> for a particular GI
establishes a class object to be used in binding corresponding elements, rather than
generated default. Running listing 3, I get
$ python listing3.py name: Thomas Eliot initials: T. E. name: Ezra Pound initials: E. P.
There is a lot you can do with customizable bindings. You can add routines for reserializing to XML, more sophisticated transforms and queries, or even specialized persistence modules.
Scratching my itch
Also in Python and XML
So far I have looked at two Python data binding tools which represent the current state of the art. I listed some other tools in the last article, but I won't cover them just yet. In particular, XBind and Skyron look interesting, but they use specialized languages to drive the binding process. This is a reasonable approach, one which offers some potential advantages, including support for multiple programming languages. But I'm focusing on systems that are completely built around Python's dynamism.
Part of the reason why I still use DOM rather than Python bindings is that I'm accustomed to a lot of the other XML-processing tools that work closely with DOM right now: XPath, XPatterns, etc. And a lot of my XML usage has to do with the document flavor of XML, which doesn't really suit a lot of the current data bindings. I have long incubated ideas for a Python data binding library that would tend to suit my needs better. Setting the stage for this library has been one of my motives for taking a close look at the state of the art. In the next article I shall offer a preliminary examination of my effort, as well as a general discussion of what one might like in the ultimate Python data binding tool.
Since the last article, Mike Olson and I released 0.6 of wsdl4py, our simplistic library for WSDL document manipulation. The release is mainly based on Mark Bucciarelli's patches to support recent DOM libraries.