XML Data Bindings in Python
In a recent interview, "What's Wrong with XML APIs", Elliotte Rusty Harold offers a familiar classification of XML APIs:
The XML community of late there has been a lot of talk that there are no really easy and efficient ways of general XML programming. Push processing has the usual rap of being too difficult. It is easy to dismiss this as a problem for amateur programmers who have not properly learned how to code state machines; but let's face it, state machines are hard to code by hand, and the community has been slow to develop more declarative and friendly tools for developing SAX processing stubs, such as LEX and YACC tools for generating parser state machines. As frequent Python-XML contributor Tom Passim puts it, in a recent XML-DEV posting, with push processing the more context one has to keep track of between callbacks the harder the code is to write and maintain.
Pull processing has strong adherents, but there are also many, including me, who don't see that it really buys all that much simplicity. Tree APIs are easier to code, but less efficient as documents become larger because they generally require the entire document to be in memory. Query APIs take a step toward bridging XML and programming languages, which is a step toward making life easier for developers. Data bindings are a further step toward this goal and the focus of this article and others to come.
A data binding is any system for viewing XML documents as databases or programming language or data structures, and vice versa. There are several aspects, including:
All of these aspects are available to some extent in Python, but unfortunately, the coverage is spotty. In the following list, the numbers refer to which aspects of data binding from the preceding list are offered by each tool.
In future articles I'll survey all these packages, starting in this
article with generateDS.py, which I downloaded
(generateDS-1.2a.tar.gz), unpacked and installed using python
setup.py install. The sample file for exercising the binding is
in listing 1.
<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label>
<quote>
<!-- Mixed content -->
<emph>Midwinter Spring</emph> is its own season…
</quote>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
<label>
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
This example demonstrates a few things: an XML character entity
outside the ASCII range (to test proper character support), a bit of the
data flavor of XML with repeated, structured records, and a bit of the
document flavor with mixed content in the quote element.
The document flavor can be reinforced a bit if one treats the order of
labels as important; likewise, the data flavor is reinforced if the
order is considered unimportant. See this
excellent discussion by Python-XML stalwart Paul Prescod for a nice
contrast between data and document nuances of XML usage. Namespaces are
another area of consideration, but to save space I do not cover them in
this discussion of data bindings. generateDS.py operates on a WXS
definition for the XML format. See listing 2 for the WXS description of
the format used in listing 1.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
>
<xs:element name="labels">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="quote">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element ref="emph"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="emph" type="xs:string"/>
<xs:element name="name" type="xs:string"/>
<xs:element name="address">
<xs:complexType>
<xs:sequence>
<xs:element ref="street"/>
<xs:element ref="city"/>
<xs:element ref="state"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="street" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="state" type="xs:string"/>
</xs:schema>
generateDS.py requires pyxml, and I used the most recent CVS version. It seems to require Python 2.2, as it uses static methods. I used Python 2.2.2 and ran it against the WXS as follows:
python generateDS.py -o labels.py listing2.xsd
generateDS.py generates Python files with the data binding derived
from the schema. The -o option gives the location of the
file containing data structures derived from the schema. This is the
heart of the data binding. The output file labels.py is
too large to paste in its entirety, but listing 3 is a snippet to give
you a feel for the output:
class label:
subclass = None
def __init__(self, quote=None, name=None, address=None):
self.quote = quote
self.name = name
self.address = address
def factory(*args):
if label.subclass:
return apply(label.subclass, args)
else:
return apply(label, args)
factory = staticmethod(factory)
def getQuote(self): return self.quote
def setQuote(self, quote): self.quote = quote
def getName(self): return self.name
def setName(self, name): self.name = name
def getAddress(self): return self.address
def setAddress(self, address): self.address = address
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<label>\n')
level += 1
if self.quote:
self.quote.export(outfile, level)
if self.name:
self.name.export(outfile, level)
if self.address:
self.address.export(outfile, level)
level -= 1
showIndent(outfile, level)
outfile.write('</label>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
if child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'quote':
obj = quote.factory()
obj.build(child)
self.setQuote(obj)
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'name':
obj = name.factory()
obj.build(child)
self.setName(obj)
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'address':
obj = address.factory()
obj.build(child)
self.setAddress(obj)
# end class label
# SNIP
class name:
subclass = None
def __init__(self):
pass
def factory(*args):
if name.subclass:
return apply(name.subclass, args)
else:
return apply(name, args)
factory = staticmethod(factory)
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<name>\n')
level += 1
level -= 1
showIndent(outfile, level)
outfile.write('</name>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
pass
# end class name
The label class has, among other things, facilities for
marshalling and unmarshalling. The build method allows
instances of the class to be built from a DOM, and this appears to be the
only supplied method of binding from instances. This is what one might
expect, since it's the easiest and most convenient way to write a data
binding. It does mean that memory footprint could become a problem as the
DOM contents are duplicated in the resulting data structures. Given that
the DOM might become unnecessary once the data structures are complete,
there seems to be some room for optimization. The export
method marshals the object back to XML.
|
There is a class like label for each element defined in
the schema. As you can see, this even extends to the name
element and therein lies a problem. name is a simple
element with only string content. But in the generated binding it is
given its own element, rather than making it a simple data member of
label. Even worse than that, if you follow the
build method carefully, you'll see that it throws away the
text content of the element upon unmarshalling. It turns out
generateDS.py is rather picky in its interpretation of WXS. The relevant
snippet from listing 2 is
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="name" type="xs:string"/>
This is a common practice in WXS: using a separate
xs:element declaration for each element, even if it is of
simple type. But this usage throws off generateDS.py, and in order to
have name treated as a simple data member of the binding class you have
to rewrite the schema:
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name" type="xs:string"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Which, according to WXS rules, is strictly equivalent to the original form. Listing 4 is a new version of the WXS to satisfy this preference of generateDS.py.
Listing 4: Adjusted WXS for data binding generation by generateDS.py<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
>
<xs:element name="labels">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name" type="xs:string"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="quote">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element ref="emph" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="address">
<xs:complexType>
<xs:sequence>
<xs:element ref="street" type="xs:string"/>
<xs:element ref="city" type="xs:string"/>
<xs:element ref="state" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Listing 5 is a snippet from the new data binding. Notice the update
to the handling of the name element.
class label:
subclass = None
def __init__(self, quote=None, name='', address=None):
self.quote = quote
self.name = name
self.address = address
def factory(*args):
if label.subclass:
return apply(label.subclass, args)
else:
return apply(label, args)
factory = staticmethod(factory)
def getQuote(self): return self.quote
def setQuote(self, quote): self.quote = quote
def getName(self): return self.name
def setName(self, name): self.name = name
def getAddress(self): return self.address
def setAddress(self, address): self.address = address
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<label>\n')
level += 1
if self.quote:
self.quote.export(outfile, level)
showIndent(outfile, level)
outfile.write('<name>%s</name>\n' % quote_xml(self.getName()))
if self.address:
self.address.export(outfile, level)
level -= 1
showIndent(outfile, level)
outfile.write('</label>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
if child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'quote':
obj = quote.factory()
obj.build(child)
self.setQuote(obj)
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'name':
name = ''
for text_ in child.childNodes:
name += text_.nodeValue
self.name = name
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'address':
obj = address.factory()
obj.build(child)
self.setAddress(obj)
# end class label
class quote:
subclass = None
def __init__(self, emph=''):
self.emph = emph
def factory(*args):
if quote.subclass:
return apply(quote.subclass, args)
else:
return apply(quote, args)
factory = staticmethod(factory)
def getEmph(self): return self.emph
def setEmph(self, emph): self.emph = emph
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<quote>\n')
level += 1
showIndent(outfile, level)
outfile.write('<emph>%s</emph>\n' % quote_xml(self.getEmph()))
level -= 1
showIndent(outfile, level)
outfile.write('</quote>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
if child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'emph':
emph = ''
for text_ in child.childNodes:
emph += text_.nodeValue
self.emph = emph
# end class quote
Now this is a pretty straightforward data binding result that, for
example, wouldn't surprise a Java developer. Each complex type in the
schema becomes a class, and simple types become simple properties with
get/set methods (like JavaBeans). This might feel a bit unpythonic until
you reflect that these binding classes are designed to be subclassed (note
the factory convenience functions), and the use of accessor
functions allows classic method polymorphism. Of course, one could still
argue that since the binding already uses Python 2.2, it could have taken
advantage of the more Pythonic approaches to such polymorphism available
with new style classes in Python 2.2. (For more on new style classes, see
Unifying types and
classes in Python 2.2 by Guido van Rossum and What's New
in Python 2.2 by A.M. Kuchling.)
Look at the quote.build method. Again, careful
examination will show that generateDS.py does not seem to handle mixed
content. In particular it discards text that is not within the
emph element: "is its own season...".
Listing 6 demonstrates usage of the data binding, a pretty straightforward matter.
Listing 6:import sys
import labels
rootObject = labels.parse('listing1.xml')
print dir(rootObject)
eliot = rootObject.label[0]
name = eliot.name
street = eliot.address.street
print street
emphasized = eliot.quote.emph
print emphasized
pound = rootObject.label[1]
#Modify the XML through the data binding
pound.name = 'Ezra Loomis Pound'
#Marshall back a portion of the XML, as modified
pound.export(sys.stdout, 0)
I also wanted to check the handling of non-ASCII characters, but the
ellipsis character I'd placed in the quote element was
discarded by the binding generation. I moved it into the
emph element and this time when I tried parsing the
instance I ended up with the infamous "UnicodeError: ASCII encoding
error: ordinal not in range(128)". Examining the binding code, I think
this might be more a problem with the marshalling and unmarshalling than
with the binding implementation, so perhaps it would be easy to fix.
generateDS.py is a very nifty program and offers many of the hallmarks of a data binding. I did point out a few shortcomings, not to knock the project, but because I think that rich bindings may be an area where Python can leapfrog the field in XML processing because of its dynamic qualities. In this column I shall continue to explore the issue, exploring the remaining data binding projects and offering discussion on future directions.
Meanwhile, here's the usual brief on activity in the Python-XML landscape.
Dave Kuhlman, the developer behind generateDS.py, announced code for Python support for the REST (XML-over-HTTP) mode of Amazon Web services. The package provides Python code for parsing and processing the Amazon Web Services XML documents. It also includes code for generating WXS from an XML instance document (not unlike the concept in Eric Van der Vlist's Examplotron). Kuhlman has been very busy working with XML, REST, and SWIG (a tool for binding Python and other languages to C code). Another nice resource is Kuhlman's unofficial SWIG-based Python binding of the libxml tree API (see my last article for a discussion of the official Python binding).
Fredrik Lundh has been busy working on ElementTree, which I covered recently. He announced 1.1 and 1.2 alpha 1. Changes include a new XML literal factory, a self-contained ElementTree module, use of ASCII as the default encoding, optimizations, and limited XPath support.
Also in Python and XML | |
Should Python and XML Coexist? | |
John Merrells pointed me to the Python API for Berkeley DB XML, part of Berkeley DB. In Merrells' words: "The Python API is basically the same as the C++ and Java APIs, in that they expose the functionality of the product."
See this post and thread for discussion of Tim Bray's comment: "The Python people also piped to say 'everything's just fine here' but then they always do, I really must learn that language". I suspect that Tim Bray might have been referring to comments by me, Paul Prescod and others on the XML-DEV mailing list. I think our point is that Python's dynamic nature makes the horrors of DOM and SAX easier to bear, and not that Python has anything radical to leapfrog them. I'm rather hoping this series on data bindings helps produce such a leap, though.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.