Menu

XML Data Bindings in Python

June 11, 2003

Uche Ogbuji

In a recent interview, "What's Wrong with XML APIs", Elliotte Rusty Harold offers a familiar classification of XML APIs:

  1. Push APIs (e.g. SAX)
  2. Pull APIs (e.g. Python's pulldom)
  3. Tree-based APIs (e.g. DOM)
  4. data binding APIs (e.g. PyXML marshalling tools)
  5. Query APIs (e.g. using 4XPath directly from Python)

The XML community of late there has been a lot of talk that there are no really easy and efficient ways of general XML programming. Push processing has the usual rap of being too difficult. It is easy to dismiss this as a problem for amateur programmers who have not properly learned how to code state machines; but let's face it, state machines are hard to code by hand, and the community has been slow to develop more declarative and friendly tools for developing SAX processing stubs, such as LEX and YACC tools for generating parser state machines. As frequent Python-XML contributor Tom Passim puts it, in a recent XML-DEV posting, with push processing the more context one has to keep track of between callbacks the harder the code is to write and maintain.

Pull processing has strong adherents, but there are also many, including me, who don't see that it really buys all that much simplicity. Tree APIs are easier to code, but less efficient as documents become larger because they generally require the entire document to be in memory. Query APIs take a step toward bridging XML and programming languages, which is a step toward making life easier for developers. Data bindings are a further step toward this goal and the focus of this article and others to come.

The State of Python Data Bindings

A data binding is any system for viewing XML documents as databases or programming language or data structures, and vice versa. There are several aspects, including:

  1. marshalling -- serializing program data constructs to XML
  2. unmarshalling -- creating program data constructs from XML
  3. schema-directed binding -- using XML schema languages (DTD, WXS, RELAX NG, etc.) to provide hints and intended data constructs to marshalling and unmarshalling systems
  4. query-directed binding -- using XML-specific query languages such as XPath to provide hints to marshalling and unmarshalling systems
  5. process bindings -- mapping program or DBMS actions designed to process particular data structure patterns covered by marshalling and unmarshalling

All of these aspects are available to some extent in Python, but unfortunately, the coverage is spotty. In the following list, the numbers refer to which aspects of data binding from the preceding list are offered by each tool.

Generic and WDDX marshalling in PyXML (1)(2)
I covered these marshalling/unmarshalling tools in the earlier article Introducing PyXML
generateDS.py (1)(2)(3)
A tool for generating Python data structures from XML Schema.
xml_pickle and xml_objectify.py from the Gnosis XML Utilities (1)(2)
tools for generic and specialized marshalling and unmarshalling.
XBind (1)(2)
An XML vocabulary for specifying language-independent data bindings; includes a prototype Python implementation.
Skyron (1)(2)(5)
Uses recipes encoded in XML to bind XML data to handler code in Python. Typical usage is to create a specialized Python data structure from particular XML data patterns.

generateDS.py

In future articles I'll survey all these packages, starting in this article with generateDS.py, which I downloaded (generateDS-1.2a.tar.gz), unpacked and installed using python setup.py install. The sample file for exercising the binding is in listing 1.

Listing 1: Example file for Python data binding comparison
<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label>

    <quote>

      <!-- Mixed content -->

      <emph>Midwinter Spring</emph> is its own season&#133;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label>

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>  

This example demonstrates a few things: an XML character entity outside the ASCII range (to test proper character support), a bit of the data flavor of XML with repeated, structured records, and a bit of the document flavor with mixed content in the quote element. The document flavor can be reinforced a bit if one treats the order of labels as important; likewise, the data flavor is reinforced if the order is considered unimportant. See this excellent discussion by Python-XML stalwart Paul Prescod for a nice contrast between data and document nuances of XML usage. Namespaces are another area of consideration, but to save space I do not cover them in this discussion of data bindings. generateDS.py operates on a WXS definition for the XML format. See listing 2 for the WXS description of the format used in listing 1.

Listing 2: WXS schema for XML format in listing 1
<?xml version="1.0" encoding="UTF-8"?>

<xs:schema

  xmlns:xs="http://www.w3.org/2001/XMLSchema"

  elementFormDefault="qualified"

>

  <xs:element name="labels">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="label">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" ref="quote"/>

        <xs:element ref="name"/>

        <xs:element ref="address"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="quote">

    <xs:complexType mixed="true">

      <xs:sequence>

        <xs:element ref="emph"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="emph" type="xs:string"/>

  <xs:element name="name" type="xs:string"/>

  <xs:element name="address">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="street"/>

        <xs:element ref="city"/>

        <xs:element ref="state"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="street" type="xs:string"/>

  <xs:element name="city" type="xs:string"/>

  <xs:element name="state" type="xs:string"/>

</xs:schema>  

generateDS.py requires pyxml, and I used the most recent CVS version. It seems to require Python 2.2, as it uses static methods. I used Python 2.2.2 and ran it against the WXS as follows:

python generateDS.py -o labels.py listing2.xsd

generateDS.py generates Python files with the data binding derived from the schema. The -o option gives the location of the file containing data structures derived from the schema. This is the heart of the data binding. The output file labels.py is too large to paste in its entirety, but listing 3 is a snippet to give you a feel for the output:

Listing 3: A snippet from the data binding generated by generateDS.py.
class label:

    subclass = None

    def __init__(self, quote=None, name=None, address=None):

        self.quote = quote

        self.name = name

        self.address = address

    def factory(*args):

        if label.subclass:

            return apply(label.subclass, args)

        else:

            return apply(label, args)

    factory = staticmethod(factory)

    def getQuote(self): return self.quote

    def setQuote(self, quote): self.quote = quote

    def getName(self): return self.name

    def setName(self, name): self.name = name

    def getAddress(self): return self.address

    def setAddress(self, address): self.address = address

    def export(self, outfile, level):

        showIndent(outfile, level)

        outfile.write('<label>\n')

        level += 1

        if self.quote:

            self.quote.export(outfile, level)

        if self.name:

            self.name.export(outfile, level)

        if self.address:

            self.address.export(outfile, level)

        level -= 1

        showIndent(outfile, level)

        outfile.write('</label>\n')

    def build(self, node_):

        attrs = node_.attributes

        for child in node_.childNodes:

            if child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'quote':

                obj = quote.factory()

                obj.build(child)

                self.setQuote(obj)

            elif child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'name':

                obj = name.factory()

                obj.build(child)

                self.setName(obj)

            elif child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'address':

                obj = address.factory()

                obj.build(child)

                self.setAddress(obj)

# end class label



# SNIP



class name:

    subclass = None

    def __init__(self):

        pass

    def factory(*args):

        if name.subclass:

            return apply(name.subclass, args)

        else:

            return apply(name, args)

    factory = staticmethod(factory)

    def export(self, outfile, level):

        showIndent(outfile, level)

        outfile.write('<name>\n')

        level += 1

        level -= 1

        showIndent(outfile, level)

        outfile.write('</name>\n')

    def build(self, node_):

        attrs = node_.attributes

        for child in node_.childNodes:

            pass

# end class name  

The label class has, among other things, facilities for marshalling and unmarshalling. The build method allows instances of the class to be built from a DOM, and this appears to be the only supplied method of binding from instances. This is what one might expect, since it's the easiest and most convenient way to write a data binding. It does mean that memory footprint could become a problem as the DOM contents are duplicated in the resulting data structures. Given that the DOM might become unnecessary once the data structures are complete, there seems to be some room for optimization. The export method marshals the object back to XML.

Special Schema Needs

There is a class like label for each element defined in the schema. As you can see, this even extends to the name element and therein lies a problem. name is a simple element with only string content. But in the generated binding it is given its own element, rather than making it a simple data member of label. Even worse than that, if you follow the build method carefully, you'll see that it throws away the text content of the element upon unmarshalling. It turns out generateDS.py is rather picky in its interpretation of WXS. The relevant snippet from listing 2 is

  <xs:element name="label">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" ref="quote"/>

        <xs:element ref="name"/>

        <xs:element ref="address"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="name" type="xs:string"/>  

This is a common practice in WXS: using a separate xs:element declaration for each element, even if it is of simple type. But this usage throws off generateDS.py, and in order to have name treated as a simple data member of the binding class you have to rewrite the schema:

  <xs:element name="label">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" ref="quote"/>

        <xs:element ref="name" type="xs:string"/>

        <xs:element ref="address"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>  

Which, according to WXS rules, is strictly equivalent to the original form. Listing 4 is a new version of the WXS to satisfy this preference of generateDS.py.

Listing 4: Adjusted WXS for data binding generation by generateDS.py
<?xml version="1.0" encoding="UTF-8"?>

<xs:schema

  xmlns:xs="http://www.w3.org/2001/XMLSchema"

  elementFormDefault="qualified"

>

  <xs:element name="labels">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="label">

    <xs:complexType>

      <xs:sequence>

        <xs:element minOccurs="0" ref="quote"/>

        <xs:element ref="name" type="xs:string"/>

        <xs:element ref="address"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="quote">

    <xs:complexType mixed="true">

      <xs:sequence>

        <xs:element ref="emph" type="xs:string"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="address">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="street" type="xs:string"/>

        <xs:element ref="city" type="xs:string"/>

        <xs:element ref="state" type="xs:string"/>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>  

Listing 5 is a snippet from the new data binding. Notice the update to the handling of the name element.

Listing 5: A snippet from the updated data binding
class label:

    subclass = None

    def __init__(self, quote=None, name='', address=None):

        self.quote = quote

        self.name = name

        self.address = address

    def factory(*args):

        if label.subclass:

            return apply(label.subclass, args)

        else:

            return apply(label, args)

    factory = staticmethod(factory)

    def getQuote(self): return self.quote

    def setQuote(self, quote): self.quote = quote

    def getName(self): return self.name

    def setName(self, name): self.name = name

    def getAddress(self): return self.address

    def setAddress(self, address): self.address = address

    def export(self, outfile, level):

        showIndent(outfile, level)

        outfile.write('<label>\n')

        level += 1

        if self.quote:

            self.quote.export(outfile, level)

        showIndent(outfile, level)

        outfile.write('<name>%s</name>\n' % quote_xml(self.getName()))

        if self.address:

            self.address.export(outfile, level)

        level -= 1

        showIndent(outfile, level)

        outfile.write('</label>\n')

    def build(self, node_):

        attrs = node_.attributes

        for child in node_.childNodes:

            if child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'quote':

                obj = quote.factory()

                obj.build(child)

                self.setQuote(obj)

            elif child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'name':

                name = ''

                for text_ in child.childNodes:

                    name += text_.nodeValue

                self.name = name

            elif child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'address':

                obj = address.factory()

                obj.build(child)

                self.setAddress(obj)

# end class label



class quote:

    subclass = None

    def __init__(self, emph=''):

        self.emph = emph

    def factory(*args):

        if quote.subclass:

            return apply(quote.subclass, args)

        else:

            return apply(quote, args)

    factory = staticmethod(factory)

    def getEmph(self): return self.emph

    def setEmph(self, emph): self.emph = emph

    def export(self, outfile, level):

        showIndent(outfile, level)

        outfile.write('<quote>\n')

        level += 1

        showIndent(outfile, level)

        outfile.write('<emph>%s</emph>\n' % quote_xml(self.getEmph()))

        level -= 1

        showIndent(outfile, level)

        outfile.write('</quote>\n')

    def build(self, node_):

        attrs = node_.attributes

        for child in node_.childNodes:

            if child.nodeType == Node.ELEMENT_NODE and \

                child.nodeName == 'emph':

                emph = ''

                for text_ in child.childNodes:

                    emph += text_.nodeValue

                self.emph = emph

# end class quote  

Now this is a pretty straightforward data binding result that, for example, wouldn't surprise a Java developer. Each complex type in the schema becomes a class, and simple types become simple properties with get/set methods (like JavaBeans). This might feel a bit unpythonic until you reflect that these binding classes are designed to be subclassed (note the factory convenience functions), and the use of accessor functions allows classic method polymorphism. Of course, one could still argue that since the binding already uses Python 2.2, it could have taken advantage of the more Pythonic approaches to such polymorphism available with new style classes in Python 2.2. (For more on new style classes, see Unifying types and classes in Python 2.2 by Guido van Rossum and What's New in Python 2.2 by A.M. Kuchling.)

Look at the quote.build method. Again, careful examination will show that generateDS.py does not seem to handle mixed content. In particular it discards text that is not within the emph element: "is its own season...".

Listing 6 demonstrates usage of the data binding, a pretty straightforward matter.

Listing 6:
import sys

import labels



rootObject = labels.parse('listing1.xml')

print dir(rootObject)



eliot = rootObject.label[0]

name = eliot.name

street = eliot.address.street

print street



emphasized = eliot.quote.emph

print emphasized



pound = rootObject.label[1]



#Modify the XML through the data binding

pound.name = 'Ezra Loomis Pound'



#Marshall back a portion of the XML, as modified

pound.export(sys.stdout, 0)  

I also wanted to check the handling of non-ASCII characters, but the ellipsis character I'd placed in the quote element was discarded by the binding generation. I moved it into the emph element and this time when I tried parsing the instance I ended up with the infamous "UnicodeError: ASCII encoding error: ordinal not in range(128)". Examining the binding code, I think this might be more a problem with the marshalling and unmarshalling than with the binding implementation, so perhaps it would be easy to fix.

Just the beginning

generateDS.py is a very nifty program and offers many of the hallmarks of a data binding. I did point out a few shortcomings, not to knock the project, but because I think that rich bindings may be an area where Python can leapfrog the field in XML processing because of its dynamic qualities. In this column I shall continue to explore the issue, exploring the remaining data binding projects and offering discussion on future directions.

Meanwhile, here's the usual brief on activity in the Python-XML landscape.

Dave Kuhlman, the developer behind generateDS.py, announced code for Python support for the REST (XML-over-HTTP) mode of Amazon Web services. The package provides Python code for parsing and processing the Amazon Web Services XML documents. It also includes code for generating WXS from an XML instance document (not unlike the concept in Eric Van der Vlist's Examplotron). Kuhlman has been very busy working with XML, REST, and SWIG (a tool for binding Python and other languages to C code). Another nice resource is Kuhlman's unofficial SWIG-based Python binding of the libxml tree API (see my last article for a discussion of the official Python binding).

Fredrik Lundh has been busy working on ElementTree, which I covered recently. He announced 1.1 and 1.2 alpha 1. Changes include a new XML literal factory, a self-contained ElementTree module, use of ASCII as the default encoding, optimizations, and limited XPath support.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

John Merrells pointed me to the Python API for Berkeley DB XML, part of Berkeley DB. In Merrells' words: "The Python API is basically the same as the C++ and Java APIs, in that they expose the functionality of the product."

See this post and thread for discussion of Tim Bray's comment: "The Python people also piped to say 'everything's just fine here' but then they always do, I really must learn that language". I suspect that Tim Bray might have been referring to comments by me, Paul Prescod and others on the XML-DEV mailing list. I think our point is that Python's dynamic nature makes the horrors of DOM and SAX easier to bear, and not that Python has anything radical to leapfrog them. I'm rather hoping this series on data bindings helps produce such a leap, though.