XML Data Bindings in Python, Part 2
In my last article I started a discussion of data bindings for Python with a close look at generateDS.py. This time I'll look at another package, gnosis.xml.objectify from David Mertz's Gnosis Utilities. Dave Kuhlman, developer of generateDS, has also written up a comparison of his package and gnosis.xml.objectify.
Gnosis XML Utilities is a Python package with a variety of utility classes for data management, especially utility classes for XML processing. Mertz writes separate columns covering Python (Charming Python) and XML (XML Matters) on IBM developerWorks. The Gnosis tools are very handy and complementary to PyXML and 4Suite, which I have introduced in other recent articles in this column.
I'm using Gnosis_Utils-1.0.6.tar.gz,
Python 2.2.2, and PyXML 0.8.2. The Gnosis installer uses
distutils, but, unusually, requires the build and install steps to
be executed separately:
$ python setup_gnosis.py build and
$ python setup_gnosis.py install.
import gnosis.xml.objectify allows you to
convert arbitrary XML documents to Python objects. At its most
basic, it does ordinary marshaling and unmarshaling, but it's also a
sophisticated data binding tool. Let's begin our examination by
unmarshaling the sample document from the last article, reproduced
as listing 1.
Listing 1: Example file for Python data binding comparison
<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label added="2003-06-20"> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label added="2003-06-10"> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
I did make one change from the last article's version. I added attributes, which are an important consideration for any binding. Running gnosis.xml.objectify on this file is a very simple matter:
>>> import gnosis.xml.objectify >>> xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml') >>> py_obj = xml_obj.make_instance()
There are two steps to creating the Python representation of the
sets up a preparatory object with a DOM tree from which the Python
structure is created. The
make_instance method does
the actual work of generating the Python structure.
Considerations of memory usage, or any other performance measures,
are not part of this comparison; but as I mentioned in the last
article, it would be nice if Python data bindings were able to
minimize memory usage. I think that it's best to process XML in
small chunks, but I despair of convincing others of this. It
seems that since people are used to treating traditional database
instances as monolithic resources, they have a natural tendency to
want to do the same with XML, stuffing all their data into huge
documents that are very unwieldy to process. As a first step one
can at least make sure the DOM used by
is cleaned up right away by using the following variation:
py_obj = XML_Objectify('listing1.xml').make_instance()
As soon as the instance is created, and the interpreter leaves that line, the temporary XML DOM is reclaimed. So the DOM is temporarily in memory at the same time as the Python structure, but this is par for Python data bindings and certainly not unreasonable. If this is too heavyweight for you, gnosis.xml.objectify allows you to generate the binding from the streaming pyexpat interface rather than DOM, although you do lose some features if you chose this approach, which is much faster and uses less memory.
The resulting Python data structure consists of a set of classes
that are defined on the fly based on the XML structure. The root
labels represented by
>>> print py_obj <gnosis.xml.objectify._objectify._XO_labels instance at 0x824492c>
Child elements are represented by data members with names based on the XML element generic identifiers (GIs) . Each such data member is a list of objects representing child elements. For example,
>>> print py_obj.label [<gnosis.xml.objectify._objectify._XO_label instance at 0x8208f64>, <gnosis.xml.objectify._objectify._XO_label instance at 0x824355c>] >>> print py_obj.label.name <gnosis.xml.objectify._objectify._XO_name instance at 0x8136344>
Attributes are also accessed as data members given ordinary Python identifiers:
>>> print repr(py_obj.label.added) u'2003-06-20'
For each element, content can be accessed using the
PCDATA members :
>>> print py_obj.label.name.PCDATA Thomas Eliot >>> print repr(py_obj.label.name.PCDATA) u'Thomas Eliot'
Also notice that gnosis.xml.objectify does the right thing with content: it represents it as Python Unicode objects. (I did not check how it would handle elements that use Unicode -- or dashes for that matter -- in GIs, given Python's identifier name limitations.) This bodes well for the high character test; indeed, it handles the ellipsis character just fine:
>>> print repr(py_obj.label.quote.PCDATA) u' is its own season\x85\n '
quote element, however, is mixed content.
It appears that gnosis.xml.objectify only keeps the last chunk of
content in the mix by default, but not the rest. In particular,
the text before the
emph element, even though it's
only white space, doesn't seem directly accessible. The
emph element is handled conventionally:
>>> print repr(py_obj.label.quote.emph) <gnosis.xml.objectify._objectify._XO_emph instance at 0x81f6fec> >>> print repr(py_obj.label.quote.emph.PCDATA) u'Midwinter Spring'
quote element I'm exploring also has a comment,
and gnosis.xml.objectify seems to offer experimental support for
comments. I say "experimental" because digging into the relevant
structure demonstrates very odd results:
>>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print py_obj.label.quote._comment <gnosis.xml.objectify._objectify._XO__comment instance at 0x8242ddc> >>> print dir(py_obj.label.quote._comment) ['__doc__', '__getitem__', '__len__', '__module__']
Of course, the documentation says that comments are ignored, so
I'd guess support is in development. The final thing to note
about this default behavior of gnosis.xml.objectify is that the
accumulation of various elements into Python lists means that the
actual order of child elements in an XML document is lost. For
example, the document in listing 2 would result in a root object
spam data member which is a list of two
elements and an
eggs data member which is a list of
one element, with no record of the fact that
occurred between the
<monty> <spam/> <eggs/> <spam/> </monty>
gnosis.xml.objectify does have a very nice feature that allows
you to recover a lot of the elided information. It keeps around
the raw markup of any object with mixed content in a special data
>>> print repr(py_obj.label.quote._XML) u'\n <!-- Mixed content -->\n <emph>Midwinter Spring</emph> is its own season\x85\n '
You can also tune gnosis.xml.objectify to not maintain this raw information or to maintain it for all elements. And there is much more to the flexibility of the package than such simple tuning.
Customizing the binding
One of the key features of gnosis.xml.objectify is the ability to customize data bindings by substituting your classes for the autogenerated ones. For example, if I know that I will need the ability to compute initials from names in label entries, I might write a program such as listing 3:Listing 3: Using a customized element class
import gnosis.xml.objectify class specialized_name(gnosis.xml.objectify._XO_): def get_initials(self): #Yes this could be done in more cure fashion with reduce() #Going for clearer steps here parts = self.PCDATA.split() initial_letters = [ part.upper() for part in parts ] initials = ". ".join(initial_letters) if initials: initials += "." return initials #Exercise the binding by yielding the initials in the sample document #associate specialized_name class with elements with GI of "name" gnosis.xml.objectify._XO_name = specialized_name #Now "objectify" as before xml_obj = gnosis.xml.objectify.XML_Objectify('listing1.xml') py_obj = xml_obj.make_instance() #Test the specialized method for label in py_obj.label: print "name:", label.name.PCDATA print "initials:", label.name.get_initials()
a particular GI establishes a class object to be used in binding
corresponding elements, rather than the generated default.
Running listing 3, I get
$ python listing3.py name: Thomas Eliot initials: T. E. name: Ezra Pound initials: E. P.
There is a lot you can do with customizable bindings. You can add routines for reserializing to XML, more sophisticated transforms and queries, or even specialized persistence modules.
Scratching my itch
Also in Python and XML
So far I have looked at two Python data binding tools which represent the current state of the art. I listed some other tools in the last article, but I won't cover them just yet. In particular, XBind and Skyron look interesting, but they use specialized languages to drive the binding process. This is a reasonable approach, one which offers some potential advantages, including support for multiple programming languages. But I'm focusing on systems that are completely built around Python's dynamism.
Part of the reason why I still use DOM rather than Python bindings is that I'm accustomed to a lot of the other XML-processing tools that work closely with DOM right now: XPath, XPatterns, etc. And a lot of my XML usage has to do with the document flavor of XML, which doesn't really suit a lot of the current data bindings. I have long incubated ideas for a Python data binding library that would tend to suit my needs better. Setting the stage for this library has been one of my motives for taking a close look at the state of the art. In the next article I shall offer a preliminary examination of my effort, as well as a general discussion of what one might like in the ultimate Python data binding tool.
Since the last article, Mike Olson and I released 0.6 of wsdl4py, our simplistic library for WSDL document manipulation. The release is mainly based on Mark Bucciarelli's patches to support recent DOM libraries.
- Retaining sequence information in gnosis.xml.objectify
2003-07-09 00:18:44 David Mertz