XML Data Bindings in Python
In a recent interview, "What's Wrong with XML APIs", Elliotte Rusty Harold offers a familiar classification of XML APIs:
- Push APIs (e.g. SAX)
- Pull APIs (e.g. Python's pulldom)
- Tree-based APIs (e.g. DOM)
- data binding APIs (e.g. PyXML marshalling tools)
- Query APIs (e.g. using 4XPath directly from Python)
The XML community of late there has been a lot of talk that there are no really easy and efficient ways of general XML programming. Push processing has the usual rap of being too difficult. It is easy to dismiss this as a problem for amateur programmers who have not properly learned how to code state machines; but let's face it, state machines are hard to code by hand, and the community has been slow to develop more declarative and friendly tools for developing SAX processing stubs, such as LEX and YACC tools for generating parser state machines. As frequent Python-XML contributor Tom Passim puts it, in a recent XML-DEV posting, with push processing the more context one has to keep track of between callbacks the harder the code is to write and maintain.
Pull processing has strong adherents, but there are also many, including me, who don't see that it really buys all that much simplicity. Tree APIs are easier to code, but less efficient as documents become larger because they generally require the entire document to be in memory. Query APIs take a step toward bridging XML and programming languages, which is a step toward making life easier for developers. Data bindings are a further step toward this goal and the focus of this article and others to come.
The State of Python Data Bindings
A data binding is any system for viewing XML documents as databases or programming language or data structures, and vice versa. There are several aspects, including:
- marshalling -- serializing program data constructs to XML
- unmarshalling -- creating program data constructs from XML
- schema-directed binding -- using XML schema languages (DTD, WXS, RELAX NG, etc.) to provide hints and intended data constructs to marshalling and unmarshalling systems
- query-directed binding -- using XML-specific query languages such as XPath to provide hints to marshalling and unmarshalling systems
- process bindings -- mapping program or DBMS actions designed to process particular data structure patterns covered by marshalling and unmarshalling
All of these aspects are available to some extent in Python, but unfortunately, the coverage is spotty. In the following list, the numbers refer to which aspects of data binding from the preceding list are offered by each tool.
- Generic and WDDX marshalling in PyXML (1)(2)
- I covered these marshalling/unmarshalling tools in the earlier article Introducing PyXML
- generateDS.py (1)(2)(3)
- A tool for generating Python data structures from XML Schema.
- xml_pickle and xml_objectify.py from the Gnosis XML Utilities (1)(2)
- tools for generic and specialized marshalling and unmarshalling.
- XBind (1)(2)
- An XML vocabulary for specifying language-independent data bindings; includes a prototype Python implementation.
- Skyron (1)(2)(5)
- Uses recipes encoded in XML to bind XML data to handler code in Python. Typical usage is to create a specialized Python data structure from particular XML data patterns.
In future articles I'll survey all these packages, starting in this
article with generateDS.py, which I downloaded
(generateDS-1.2a.tar.gz), unpacked and installed using
setup.py install. The sample file for exercising the binding is
in listing 1.
<?xml version="1.0" encoding="iso-8859-1"?> <labels> <label> <quote> <!-- Mixed content --> <emph>Midwinter Spring</emph> is its own season… </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
This example demonstrates a few things: an XML character entity
outside the ASCII range (to test proper character support), a bit of the
data flavor of XML with repeated, structured records, and a bit of the
document flavor with mixed content in the
The document flavor can be reinforced a bit if one treats the order of
labels as important; likewise, the data flavor is reinforced if the
order is considered unimportant. See this
excellent discussion by Python-XML stalwart Paul Prescod for a nice
contrast between data and document nuances of XML usage. Namespaces are
another area of consideration, but to save space I do not cover them in
this discussion of data bindings. generateDS.py operates on a WXS
definition for the XML format. See listing 2 for the WXS description of
the format used in listing 1.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" > <xs:element name="labels"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="label"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="quote"/> <xs:element ref="name"/> <xs:element ref="address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="quote"> <xs:complexType mixed="true"> <xs:sequence> <xs:element ref="emph"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="emph" type="xs:string"/> <xs:element name="name" type="xs:string"/> <xs:element name="address"> <xs:complexType> <xs:sequence> <xs:element ref="street"/> <xs:element ref="city"/> <xs:element ref="state"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="state" type="xs:string"/> </xs:schema>
generateDS.py requires pyxml, and I used the most recent CVS version. It seems to require Python 2.2, as it uses static methods. I used Python 2.2.2 and ran it against the WXS as follows:
python generateDS.py -o labels.py listing2.xsd
generateDS.py generates Python files with the data binding derived
from the schema. The
-o option gives the location of the
file containing data structures derived from the schema. This is the
heart of the data binding. The output file
too large to paste in its entirety, but listing 3 is a snippet to give
you a feel for the output:
class label: subclass = None def __init__(self, quote=None, name=None, address=None): self.quote = quote self.name = name self.address = address def factory(*args): if label.subclass: return apply(label.subclass, args) else: return apply(label, args) factory = staticmethod(factory) def getQuote(self): return self.quote def setQuote(self, quote): self.quote = quote def getName(self): return self.name def setName(self, name): self.name = name def getAddress(self): return self.address def setAddress(self, address): self.address = address def export(self, outfile, level): showIndent(outfile, level) outfile.write('<label>\n') level += 1 if self.quote: self.quote.export(outfile, level) if self.name: self.name.export(outfile, level) if self.address: self.address.export(outfile, level) level -= 1 showIndent(outfile, level) outfile.write('</label>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: if child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'quote': obj = quote.factory() obj.build(child) self.setQuote(obj) elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'name': obj = name.factory() obj.build(child) self.setName(obj) elif child.nodeType == Node.ELEMENT_NODE and \ child.nodeName == 'address': obj = address.factory() obj.build(child) self.setAddress(obj) # end class label # SNIP class name: subclass = None def __init__(self): pass def factory(*args): if name.subclass: return apply(name.subclass, args) else: return apply(name, args) factory = staticmethod(factory) def export(self, outfile, level): showIndent(outfile, level) outfile.write('<name>\n') level += 1 level -= 1 showIndent(outfile, level) outfile.write('</name>\n') def build(self, node_): attrs = node_.attributes for child in node_.childNodes: pass # end class name
label class has, among other things, facilities for
marshalling and unmarshalling. The
build method allows
instances of the class to be built from a DOM, and this appears to be the
only supplied method of binding from instances. This is what one might
expect, since it's the easiest and most convenient way to write a data
binding. It does mean that memory footprint could become a problem as the
DOM contents are duplicated in the resulting data structures. Given that
the DOM might become unnecessary once the data structures are complete,
there seems to be some room for optimization. The
method marshals the object back to XML.
Pages: 1, 2