XML Data Bindings in Python
by Uche Ogbuji
|
Pages: 1, 2
Special Schema Needs
There is a class like label for each element defined in
the schema. As you can see, this even extends to the name
element and therein lies a problem. name is a simple
element with only string content. But in the generated binding it is
given its own element, rather than making it a simple data member of
label. Even worse than that, if you follow the
build method carefully, you'll see that it throws away the
text content of the element upon unmarshalling. It turns out
generateDS.py is rather picky in its interpretation of WXS. The relevant
snippet from listing 2 is
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="name" type="xs:string"/>
This is a common practice in WXS: using a separate
xs:element declaration for each element, even if it is of
simple type. But this usage throws off generateDS.py, and in order to
have name treated as a simple data member of the binding class you have
to rewrite the schema:
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name" type="xs:string"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Which, according to WXS rules, is strictly equivalent to the original form. Listing 4 is a new version of the WXS to satisfy this preference of generateDS.py.
Listing 4: Adjusted WXS for data binding generation by generateDS.py<?xml version="1.0" encoding="UTF-8"?>
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
>
<xs:element name="labels">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="label"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="label">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" ref="quote"/>
<xs:element ref="name" type="xs:string"/>
<xs:element ref="address"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="quote">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element ref="emph" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="address">
<xs:complexType>
<xs:sequence>
<xs:element ref="street" type="xs:string"/>
<xs:element ref="city" type="xs:string"/>
<xs:element ref="state" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Listing 5 is a snippet from the new data binding. Notice the update
to the handling of the name element.
class label:
subclass = None
def __init__(self, quote=None, name='', address=None):
self.quote = quote
self.name = name
self.address = address
def factory(*args):
if label.subclass:
return apply(label.subclass, args)
else:
return apply(label, args)
factory = staticmethod(factory)
def getQuote(self): return self.quote
def setQuote(self, quote): self.quote = quote
def getName(self): return self.name
def setName(self, name): self.name = name
def getAddress(self): return self.address
def setAddress(self, address): self.address = address
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<label>\n')
level += 1
if self.quote:
self.quote.export(outfile, level)
showIndent(outfile, level)
outfile.write('<name>%s</name>\n' % quote_xml(self.getName()))
if self.address:
self.address.export(outfile, level)
level -= 1
showIndent(outfile, level)
outfile.write('</label>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
if child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'quote':
obj = quote.factory()
obj.build(child)
self.setQuote(obj)
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'name':
name = ''
for text_ in child.childNodes:
name += text_.nodeValue
self.name = name
elif child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'address':
obj = address.factory()
obj.build(child)
self.setAddress(obj)
# end class label
class quote:
subclass = None
def __init__(self, emph=''):
self.emph = emph
def factory(*args):
if quote.subclass:
return apply(quote.subclass, args)
else:
return apply(quote, args)
factory = staticmethod(factory)
def getEmph(self): return self.emph
def setEmph(self, emph): self.emph = emph
def export(self, outfile, level):
showIndent(outfile, level)
outfile.write('<quote>\n')
level += 1
showIndent(outfile, level)
outfile.write('<emph>%s</emph>\n' % quote_xml(self.getEmph()))
level -= 1
showIndent(outfile, level)
outfile.write('</quote>\n')
def build(self, node_):
attrs = node_.attributes
for child in node_.childNodes:
if child.nodeType == Node.ELEMENT_NODE and \
child.nodeName == 'emph':
emph = ''
for text_ in child.childNodes:
emph += text_.nodeValue
self.emph = emph
# end class quote
Now this is a pretty straightforward data binding result that, for
example, wouldn't surprise a Java developer. Each complex type in the
schema becomes a class, and simple types become simple properties with
get/set methods (like JavaBeans). This might feel a bit unpythonic until
you reflect that these binding classes are designed to be subclassed (note
the factory convenience functions), and the use of accessor
functions allows classic method polymorphism. Of course, one could still
argue that since the binding already uses Python 2.2, it could have taken
advantage of the more Pythonic approaches to such polymorphism available
with new style classes in Python 2.2. (For more on new style classes, see
Unifying types and
classes in Python 2.2 by Guido van Rossum and What's New
in Python 2.2 by A.M. Kuchling.)
Look at the quote.build method. Again, careful
examination will show that generateDS.py does not seem to handle mixed
content. In particular it discards text that is not within the
emph element: "is its own season...".
Listing 6 demonstrates usage of the data binding, a pretty straightforward matter.
Listing 6:import sys
import labels
rootObject = labels.parse('listing1.xml')
print dir(rootObject)
eliot = rootObject.label[0]
name = eliot.name
street = eliot.address.street
print street
emphasized = eliot.quote.emph
print emphasized
pound = rootObject.label[1]
#Modify the XML through the data binding
pound.name = 'Ezra Loomis Pound'
#Marshall back a portion of the XML, as modified
pound.export(sys.stdout, 0)
I also wanted to check the handling of non-ASCII characters, but the
ellipsis character I'd placed in the quote element was
discarded by the binding generation. I moved it into the
emph element and this time when I tried parsing the
instance I ended up with the infamous "UnicodeError: ASCII encoding
error: ordinal not in range(128)". Examining the binding code, I think
this might be more a problem with the marshalling and unmarshalling than
with the binding implementation, so perhaps it would be easy to fix.
Just the beginning
generateDS.py is a very nifty program and offers many of the hallmarks of a data binding. I did point out a few shortcomings, not to knock the project, but because I think that rich bindings may be an area where Python can leapfrog the field in XML processing because of its dynamic qualities. In this column I shall continue to explore the issue, exploring the remaining data binding projects and offering discussion on future directions.
Meanwhile, here's the usual brief on activity in the Python-XML landscape.
Dave Kuhlman, the developer behind generateDS.py, announced code for Python support for the REST (XML-over-HTTP) mode of Amazon Web services. The package provides Python code for parsing and processing the Amazon Web Services XML documents. It also includes code for generating WXS from an XML instance document (not unlike the concept in Eric Van der Vlist's Examplotron). Kuhlman has been very busy working with XML, REST, and SWIG (a tool for binding Python and other languages to C code). Another nice resource is Kuhlman's unofficial SWIG-based Python binding of the libxml tree API (see my last article for a discussion of the official Python binding).
Fredrik Lundh has been busy working on ElementTree, which I covered recently. He announced 1.1 and 1.2 alpha 1. Changes include a new XML literal factory, a self-contained ElementTree module, use of ASCII as the default encoding, optimizations, and limited XPath support.
Also in Python and XML | |
Should Python and XML Coexist? | |
John Merrells pointed me to the Python API for Berkeley DB XML, part of Berkeley DB. In Merrells' words: "The Python API is basically the same as the C++ and Java APIs, in that they expose the functionality of the product."
See this post and thread for discussion of Tim Bray's comment: "The Python people also piped to say 'everything's just fine here' but then they always do, I really must learn that language". I suspect that Tim Bray might have been referring to comments by me, Paul Prescod and others on the XML-DEV mailing list. I think our point is that Python's dynamic nature makes the horrors of DOM and SAX easier to bear, and not that Python has anything radical to leapfrog them. I'm rather hoping this series on data bindings helps produce such a leap, though.
Are you using data binding techniques in Python? Share your experience in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Dave responds
2003-09-10 08:56:25 Dave Kuhlman [Reply]
Thanks to Uche for his comments and suggestions on generateDS.py.
Looks to me like Uche had the following comments:
1. generateDS.py does not handle element declarations that are defined as simple types.
Dave replies: I've fixed this one. A new version that handles
element definitions that are simple types is at my Web site:
http://www.rexx.com/~dkuhlman/generateDS-1.3b.tar.gz
2. generateDS.py generates parsers that use minidom, whereas, for large XML input documents at least, it would be preferable if a SAX parser were used.
Dave replies: I'm working on this one. You are right. It needs to be done. However, since all of my own use of generateDS.py has been with small documents, I probably won't be fixing it too quickly.
3. generateDS.py generates code that does not handle mixed content.
Dave replies: I really should fix this one, but probably will not. I actually do not see how generateDS.py could handle mixed content, at least not without radical revision. I admit, the generated data structures were not designed well enough to handle mixed content. And, it now seems bizarre to me that I've used generateDS.py as much as I have without discovering this problem. I suppose that's because I've always used it for XML documents that represent data structures rather than for documents that contain text that was (later) marked up.
By the way, Uche did not mention that there is an option "-s" that can be used to instruct generateDS.py to generate a file containing subclasses of the data representation classes. (That he did not see this may be my fault. I noticed that the link from my main page refers to an older version. Now fixed.) These subclasses enable the user to inherit from the data representation classes and to add application specific behaviors. Using a sub-class file seems cleaner to me and also means that you can use a single super-class file (the data representation classes) along with multiple sub-class files to implement multiple tasks on the same XML document type.
Thanks again, Uche, for your interesting comments. They have been helpful to me. And, I appreciate all the work you've done on PyXML.
Dave
Dave Kuhlman
dkuhlman@rexx.com
http://www.rexx.com/~dkuhlman
- Dave responds
2003-09-17 14:34:22 Dave Kuhlman [Reply]
> 2. generateDS.py generates parsers that use minidom, >whereas, for large XML input documents at least, it would >be preferable if a SAX parser were used.
>
> Dave replies: I'm working on this one. You are right. It >needs to be done. However, since all of my own use of >generateDS.py has been with small documents, I probably >won't be fixing it too quickly.
It's done. Now, generateDS.py also generates SAX parsers.
A qualification: The generated SAX parser still builds a tree; it just does not build a DOM tree. So using generateDS.py parsers on *huge* documents is still a questionable thing to do.
You can find the new version at:
http://www.rexx.com/~dkuhlman/generateDS.html
http://www.rexx.com/~dkuhlman/generateDS-1.4a.tar.gz
Thanks to Uche for motivating me to fix this.
Dave
- Dave responds
- Gnosis Utilities Work Great
2003-06-15 18:58:42 Doug Tillman [Reply]
I recently used David Mertz's Gnosis utilities to generate XML from a Python object holding some maps. 1 line of code was all it took to generate the XML and the library also uses a deep copy flag to provide optimizations to cross reference repeating values rather than explicitly recreating them each time. I didn't put these libraries through any sort of rigorous testing but for ease of use they score big with me.
- XML data binding only half done
2003-06-12 11:58:08 Peter Herndon [Reply]
I've been investigating this area recently, as I need some sort of data binding tool for my current project. I've looked at coding DOM manipulation, and would like to avoid it as too inefficient and not as easy as it could be. I've considered SAX, and am thinking of it as my fall-back position if nothing else works out, as (relatively) painful as it might be.
On top of that I've looked in David Mertz' xml_objectify, and while it looks like a great candidate for deserialization, I have noticed that I will be forced to code my own serializer. Again, not a problem in and of itself, but if someone else has already done the work and performed some real world testing, so much the better.
XML in and of itself is not bad as a serialization format, but the tools, so far, seem to be lacking. I would much rather have a serialization format that is easily read by humans, yet most tools seem to want to submerge the meaningful data in a morass of type declarations and placement/order information.
I'm currently looking into YAML (http://www.yaml.org) as an alternative, as the type information is embedded in the rather minimalist markup and structuring of the document. The resulting document is more easily understood by humans. The trade-off seems to be that the tools are not yet immature (Python's module seems to be in transition from the original developers to a new team), and of course, YAML is so unknown as yet that it is almost an anti-buzzword (PHB: "YAML, *what*?!?!?"), while XML is universally known and accepted at this point.
Uche, thank you for pointing out Skyron, as I will look into their offering. I would consider using generateDS.py, but I'd rather not have to create a schema for my serialization format if I can avoid it, and I'd *much* rather use RELAX NG than WXS, if I cannot (hint, hint, Mr. Kuhlman!!).
- XML data binding only half done, Pt II
2003-06-12 16:33:42 Peter Herndon [Reply]
I've looked briefly at Skyron, and it doesn't really "fit my head". More specifically, I'd rather not have to learn some esoteric XML format for creating transformation recipes. In Python at least, it's just as easy to code it yourself.
YAML is not without its own set of problems, but they look eminently solvable. For the project on which I'm currently working, I need an all-Python solution. My target platform is Windows, and I haven't the luxury of Visual Studio or Cygwin. So, of my two candidates, Syck and PyYAML, I must choose PyYAML. And not unwillingly, since it seems ridiculously easy to use so far. I have run into a limitation, though, in its limited support for Python types. Specifically, PyYAML can handle most Python 2.1 or earlier built-in types, and handles classic classes quite well, but it cannot currently handle new-style classes by default. Thankfully, my current project does not require my serializable classes be new-style, so I am safe.
PyYAML does have a means of getting around this limitation, though I have not yet fully explored its usage. If your class defines a to_yaml() method, serializing calls this method instead of the underlying type-munging methods in yaml.Dumper(). to_yaml() should return a tuple consisting of the data you wish to serialize and a string defining the type of your class. Thus if your new-style class simply inherits from object, you can return (self.__dict__, " !!classname\n") and everything should be fine. N.B.: I have not thoroughly tested this solution, as I have not yet begun to deserialize these results, nor have I made sure that the output is valid YAML. Still, this solution provides a first step.
- XML data binding only half done, Pt II
