Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

XML Data Bindings in Python
by Uche Ogbuji | Pages: 1, 2

Special Schema Needs

There is a class like label for each element defined in the schema. As you can see, this even extends to the name element and therein lies a problem. name is a simple element with only string content. But in the generated binding it is given its own element, rather than making it a simple data member of label. Even worse than that, if you follow the build method carefully, you'll see that it throws away the text content of the element upon unmarshalling. It turns out generateDS.py is rather picky in its interpretation of WXS. The relevant snippet from listing 2 is

  <xs:element name="label">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" ref="quote"/>
        <xs:element ref="name"/>
        <xs:element ref="address"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="name" type="xs:string"/>  

This is a common practice in WXS: using a separate xs:element declaration for each element, even if it is of simple type. But this usage throws off generateDS.py, and in order to have name treated as a simple data member of the binding class you have to rewrite the schema:

  <xs:element name="label">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" ref="quote"/>
        <xs:element ref="name" type="xs:string"/>
        <xs:element ref="address"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>  

Which, according to WXS rules, is strictly equivalent to the original form. Listing 4 is a new version of the WXS to satisfy this preference of generateDS.py.

Listing 4: Adjusted WXS for data binding generation by generateDS.py
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  elementFormDefault="qualified"
>
  <xs:element name="labels">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="label">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="0" ref="quote"/>
        <xs:element ref="name" type="xs:string"/>
        <xs:element ref="address"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="quote">
    <xs:complexType mixed="true">
      <xs:sequence>
        <xs:element ref="emph" type="xs:string"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <xs:element name="address">
    <xs:complexType>
      <xs:sequence>
        <xs:element ref="street" type="xs:string"/>
        <xs:element ref="city" type="xs:string"/>
        <xs:element ref="state" type="xs:string"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>  

Listing 5 is a snippet from the new data binding. Notice the update to the handling of the name element.

Listing 5: A snippet from the updated data binding
class label:
    subclass = None
    def __init__(self, quote=None, name='', address=None):
        self.quote = quote
        self.name = name
        self.address = address
    def factory(*args):
        if label.subclass:
            return apply(label.subclass, args)
        else:
            return apply(label, args)
    factory = staticmethod(factory)
    def getQuote(self): return self.quote
    def setQuote(self, quote): self.quote = quote
    def getName(self): return self.name
    def setName(self, name): self.name = name
    def getAddress(self): return self.address
    def setAddress(self, address): self.address = address
    def export(self, outfile, level):
        showIndent(outfile, level)
        outfile.write('<label>\n')
        level += 1
        if self.quote:
            self.quote.export(outfile, level)
        showIndent(outfile, level)
        outfile.write('<name>%s</name>\n' % quote_xml(self.getName()))
        if self.address:
            self.address.export(outfile, level)
        level -= 1
        showIndent(outfile, level)
        outfile.write('</label>\n')
    def build(self, node_):
        attrs = node_.attributes
        for child in node_.childNodes:
            if child.nodeType == Node.ELEMENT_NODE and \
                child.nodeName == 'quote':
                obj = quote.factory()
                obj.build(child)
                self.setQuote(obj)
            elif child.nodeType == Node.ELEMENT_NODE and \
                child.nodeName == 'name':
                name = ''
                for text_ in child.childNodes:
                    name += text_.nodeValue
                self.name = name
            elif child.nodeType == Node.ELEMENT_NODE and \
                child.nodeName == 'address':
                obj = address.factory()
                obj.build(child)
                self.setAddress(obj)
# end class label

class quote:
    subclass = None
    def __init__(self, emph=''):
        self.emph = emph
    def factory(*args):
        if quote.subclass:
            return apply(quote.subclass, args)
        else:
            return apply(quote, args)
    factory = staticmethod(factory)
    def getEmph(self): return self.emph
    def setEmph(self, emph): self.emph = emph
    def export(self, outfile, level):
        showIndent(outfile, level)
        outfile.write('<quote>\n')
        level += 1
        showIndent(outfile, level)
        outfile.write('<emph>%s</emph>\n' % quote_xml(self.getEmph()))
        level -= 1
        showIndent(outfile, level)
        outfile.write('</quote>\n')
    def build(self, node_):
        attrs = node_.attributes
        for child in node_.childNodes:
            if child.nodeType == Node.ELEMENT_NODE and \
                child.nodeName == 'emph':
                emph = ''
                for text_ in child.childNodes:
                    emph += text_.nodeValue
                self.emph = emph
# end class quote  

Now this is a pretty straightforward data binding result that, for example, wouldn't surprise a Java developer. Each complex type in the schema becomes a class, and simple types become simple properties with get/set methods (like JavaBeans). This might feel a bit unpythonic until you reflect that these binding classes are designed to be subclassed (note the factory convenience functions), and the use of accessor functions allows classic method polymorphism. Of course, one could still argue that since the binding already uses Python 2.2, it could have taken advantage of the more Pythonic approaches to such polymorphism available with new style classes in Python 2.2. (For more on new style classes, see Unifying types and classes in Python 2.2 by Guido van Rossum and What's New in Python 2.2 by A.M. Kuchling.)

Look at the quote.build method. Again, careful examination will show that generateDS.py does not seem to handle mixed content. In particular it discards text that is not within the emph element: "is its own season...".

Listing 6 demonstrates usage of the data binding, a pretty straightforward matter.

Listing 6:
import sys
import labels

rootObject = labels.parse('listing1.xml')
print dir(rootObject)

eliot = rootObject.label[0]
name = eliot.name
street = eliot.address.street
print street

emphasized = eliot.quote.emph
print emphasized

pound = rootObject.label[1]

#Modify the XML through the data binding
pound.name = 'Ezra Loomis Pound'

#Marshall back a portion of the XML, as modified
pound.export(sys.stdout, 0)  

I also wanted to check the handling of non-ASCII characters, but the ellipsis character I'd placed in the quote element was discarded by the binding generation. I moved it into the emph element and this time when I tried parsing the instance I ended up with the infamous "UnicodeError: ASCII encoding error: ordinal not in range(128)". Examining the binding code, I think this might be more a problem with the marshalling and unmarshalling than with the binding implementation, so perhaps it would be easy to fix.

Just the beginning

generateDS.py is a very nifty program and offers many of the hallmarks of a data binding. I did point out a few shortcomings, not to knock the project, but because I think that rich bindings may be an area where Python can leapfrog the field in XML processing because of its dynamic qualities. In this column I shall continue to explore the issue, exploring the remaining data binding projects and offering discussion on future directions.

Meanwhile, here's the usual brief on activity in the Python-XML landscape.

Dave Kuhlman, the developer behind generateDS.py, announced code for Python support for the REST (XML-over-HTTP) mode of Amazon Web services. The package provides Python code for parsing and processing the Amazon Web Services XML documents. It also includes code for generating WXS from an XML instance document (not unlike the concept in Eric Van der Vlist's Examplotron). Kuhlman has been very busy working with XML, REST, and SWIG (a tool for binding Python and other languages to C code). Another nice resource is Kuhlman's unofficial SWIG-based Python binding of the libxml tree API (see my last article for a discussion of the official Python binding).

Fredrik Lundh has been busy working on ElementTree, which I covered recently. He announced 1.1 and 1.2 alpha 1. Changes include a new XML literal factory, a self-contained ElementTree module, use of ASCII as the default encoding, optimizations, and limited XPath support.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

John Merrells pointed me to the Python API for Berkeley DB XML, part of Berkeley DB. In Merrells' words: "The Python API is basically the same as the C++ and Java APIs, in that they expose the functionality of the product."

See this post and thread for discussion of Tim Bray's comment: "The Python people also piped to say 'everything's just fine here' but then they always do, I really must learn that language". I suspect that Tim Bray might have been referring to comments by me, Paul Prescod and others on the XML-DEV mailing list. I think our point is that Python's dynamic nature makes the horrors of DOM and SAX easier to bear, and not that Python has anything radical to leapfrog them. I'm rather hoping this series on data bindings helps produce such a leap, though.


Comment on this articleAre you using data binding techniques in Python? Share your experience in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • Dave responds
    2003-09-10 08:56:25 Dave Kuhlman [Reply]

    Thanks to Uche for his comments and suggestions on generateDS.py.


    Looks to me like Uche had the following comments:


    1. generateDS.py does not handle element declarations that are defined as simple types.


    Dave replies: I've fixed this one. A new version that handles
    element definitions that are simple types is at my Web site:


    http://www.rexx.com/~dkuhlman/generateDS-1.3b.tar.gz


    2. generateDS.py generates parsers that use minidom, whereas, for large XML input documents at least, it would be preferable if a SAX parser were used.


    Dave replies: I'm working on this one. You are right. It needs to be done. However, since all of my own use of generateDS.py has been with small documents, I probably won't be fixing it too quickly.


    3. generateDS.py generates code that does not handle mixed content.


    Dave replies: I really should fix this one, but probably will not. I actually do not see how generateDS.py could handle mixed content, at least not without radical revision. I admit, the generated data structures were not designed well enough to handle mixed content. And, it now seems bizarre to me that I've used generateDS.py as much as I have without discovering this problem. I suppose that's because I've always used it for XML documents that represent data structures rather than for documents that contain text that was (later) marked up.


    By the way, Uche did not mention that there is an option "-s" that can be used to instruct generateDS.py to generate a file containing subclasses of the data representation classes. (That he did not see this may be my fault. I noticed that the link from my main page refers to an older version. Now fixed.) These subclasses enable the user to inherit from the data representation classes and to add application specific behaviors. Using a sub-class file seems cleaner to me and also means that you can use a single super-class file (the data representation classes) along with multiple sub-class files to implement multiple tasks on the same XML document type.


    Thanks again, Uche, for your interesting comments. They have been helpful to me. And, I appreciate all the work you've done on PyXML.


    Dave


    Dave Kuhlman
    dkuhlman@rexx.com
    http://www.rexx.com/~dkuhlman


    • Dave responds
      2003-09-17 14:34:22 Dave Kuhlman [Reply]


      > 2. generateDS.py generates parsers that use minidom, >whereas, for large XML input documents at least, it would >be preferable if a SAX parser were used.
      >
      > Dave replies: I'm working on this one. You are right. It >needs to be done. However, since all of my own use of >generateDS.py has been with small documents, I probably >won't be fixing it too quickly.


      It's done. Now, generateDS.py also generates SAX parsers.


      A qualification: The generated SAX parser still builds a tree; it just does not build a DOM tree. So using generateDS.py parsers on *huge* documents is still a questionable thing to do.


      You can find the new version at:


      http://www.rexx.com/~dkuhlman/generateDS.html
      http://www.rexx.com/~dkuhlman/generateDS-1.4a.tar.gz


      Thanks to Uche for motivating me to fix this.


      Dave


  • Gnosis Utilities Work Great
    2003-06-15 18:58:42 Doug Tillman [Reply]

    I recently used David Mertz's Gnosis utilities to generate XML from a Python object holding some maps. 1 line of code was all it took to generate the XML and the library also uses a deep copy flag to provide optimizations to cross reference repeating values rather than explicitly recreating them each time. I didn't put these libraries through any sort of rigorous testing but for ease of use they score big with me.

  • XML data binding only half done
    2003-06-12 11:58:08 Peter Herndon [Reply]

    I've been investigating this area recently, as I need some sort of data binding tool for my current project. I've looked at coding DOM manipulation, and would like to avoid it as too inefficient and not as easy as it could be. I've considered SAX, and am thinking of it as my fall-back position if nothing else works out, as (relatively) painful as it might be.


    On top of that I've looked in David Mertz' xml_objectify, and while it looks like a great candidate for deserialization, I have noticed that I will be forced to code my own serializer. Again, not a problem in and of itself, but if someone else has already done the work and performed some real world testing, so much the better.


    XML in and of itself is not bad as a serialization format, but the tools, so far, seem to be lacking. I would much rather have a serialization format that is easily read by humans, yet most tools seem to want to submerge the meaningful data in a morass of type declarations and placement/order information.


    I'm currently looking into YAML (http://www.yaml.org) as an alternative, as the type information is embedded in the rather minimalist markup and structuring of the document. The resulting document is more easily understood by humans. The trade-off seems to be that the tools are not yet immature (Python's module seems to be in transition from the original developers to a new team), and of course, YAML is so unknown as yet that it is almost an anti-buzzword (PHB: "YAML, *what*?!?!?"), while XML is universally known and accepted at this point.


    Uche, thank you for pointing out Skyron, as I will look into their offering. I would consider using generateDS.py, but I'd rather not have to create a schema for my serialization format if I can avoid it, and I'd *much* rather use RELAX NG than WXS, if I cannot (hint, hint, Mr. Kuhlman!!).

    • XML data binding only half done, Pt II
      2003-06-12 16:33:42 Peter Herndon [Reply]

      I've looked briefly at Skyron, and it doesn't really "fit my head". More specifically, I'd rather not have to learn some esoteric XML format for creating transformation recipes. In Python at least, it's just as easy to code it yourself.


      YAML is not without its own set of problems, but they look eminently solvable. For the project on which I'm currently working, I need an all-Python solution. My target platform is Windows, and I haven't the luxury of Visual Studio or Cygwin. So, of my two candidates, Syck and PyYAML, I must choose PyYAML. And not unwillingly, since it seems ridiculously easy to use so far. I have run into a limitation, though, in its limited support for Python types. Specifically, PyYAML can handle most Python 2.1 or earlier built-in types, and handles classic classes quite well, but it cannot currently handle new-style classes by default. Thankfully, my current project does not require my serializable classes be new-style, so I am safe.


      PyYAML does have a means of getting around this limitation, though I have not yet fully explored its usage. If your class defines a to_yaml() method, serializing calls this method instead of the underlying type-munging methods in yaml.Dumper(). to_yaml() should return a tuple consisting of the data you wish to serialize and a string defining the type of your class. Thus if your new-style class simply inherits from object, you can return (self.__dict__, " !!classname\n") and everything should be fine. N.B.: I have not thoroughly tested this solution, as I have not yet begun to deserialize these results, nor have I made sure that the output is valid YAML. Still, this solution provides a first step.