xmltramp and pxdom
In this article I cover two XML processing libraries with very disjoint goals. xmltramp, developed by Aaron Swartz, is a tool for parsing XML documents into a data structure very friendly to Python. Recently many of the tools I've been covering with this primary goal of Python-friendliness have been data binding tools. xmltramp doesn't meet the definition of a data binding tool I've been using; that is, it isn't a system that represents elements and attributes from the XML document as custom objects that use the vocabulary from the XML document for naming and reference. xmltramp is more like ElementTree, which I covered earlier, defining a set of lightweight objects that make information in XML document accessible through familiar Python idioms. The stated goal of xmltramp is simplicity rather than exhaustive coverage of XML features.
pxdom, on the other hand, has the goal of strict DOM Level 3 compliance. It is developed by Andrew Clover, who contributed to the XML-SIG the document "DOM Standards compliance", a very thorough matrix of feature and defect comparisons between Python DOM implementatons. DOM has generally not been the favorite API of Python users -- or, for that matter, of Java users -- but it certainly has an important place because of its cross-language support.
I downloaded xmltramp 2.0, which is a single Python module. A required Python version is not given, but according to Python features I noticed in the implementation, at least 2.1 is required. I used Python 2.3.2, and installation was a simple matter of copying xmltramp.py to a directory in the Python path. To kick off my exercising of xmltramp I used the same sample file as I've been using in the data binding coverage (see listing 1).
Listing 1: Sample XML file for exercising xmltramp<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label added="2003-06-20">
<quote>
<!-- Mixed content -->
<emph>Midwinter Spring</emph> is its own season…
</quote>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
<label added="2003-06-10">
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
The following snippet shows how simple it is to parse in a file using xmltramp:
>>> import xmltramp
>>> xml_file = open('labels.xml')
>>> doc = xmltramp.seed(xml_file)
xmltramp uses SAX behind the scenes for parsing, so it should
generally be efficient in building up the in-memory structure. The
seed function takes a file-like object but you can use
the parse function instead if you have a string object.
Like Elementree, xmltramp defines specialized objects
(xmltramp.Element) representing each element in the XML
document. The top-level object (assigned to doc)
represents the top-level element, rather than the document itself.
You can see its element children by peeking into an internal
structure:
doc._dir
[<label added="2003-06-20">...</label>, <label added="2003-06-10">...</label>]
In this list each entry is a representation of a child element object. The whitespace text nodes between elements are omitted, which might be conventional stripping of such nodes, but it did make me wonder about the way xmltramp handles mixed content, about which more later. Of course in normal use you would access the xmltramp structures using the public API, which in part adopts Python's list idioms:
>>> for label in doc: print repr(label)
...
<label added="2003-06-20">...</label>
<label added="2003-06-10">...</label>
>>> print repr(doc[0])
<label added="2003-06-20">...</label>
I use repr because the
str function (used by print to coerce non-string parameters) applied to Element objects returns a concatenation of child text nodes, excluding pure white text nodes:
>>> print doc[1]
Ezra Pound45 Usura PlaceHaileyID
You can also use the element node name to navigate the XML structure:
>>> print repr(doc.label)
<label added="2003-06-20">...</label>
There are, of course, multiple label children. The
first one is returned. And, as if that weren't enough, you can also use
a dictionary access (mapping) idiom:
>>> print repr(doc['label'])
<label added="2003-06-20">...</label>
You read attributes using the function invocation idiom:
>>> print doc.label('added')
2003-06-20
To navigating further into the tree, you can combine and cascade the access methods I described above:
>>> print repr(doc.label.name)
<name>...</name>
>>> print repr(doc['label']['name'])
<name>...</name>
>>> print repr(doc[0][1])
<name>...</name>
Unfortunately it seems that the only way to access any element except for the first child element with a certain name is to use list access methods.
>>> doc[1] #Second label element
<label added="2003-06-10">...</label>
You can't access this element using either the reference name "label" or using "label" as a key string for mapping access.
You have to use the list idiom to access child text nodes:
>>> print repr(doc.label.name[0])
u'Thomas Eliot'
You can see that text nodes are maintained as Unicode objects, which
is the right thing to do. I thought that coercing
Element objects to Unicode would be another good way to
access their child content, but I found an odd quirk:
>>> print repr(unicode(doc.label.name)) #so far so good
u'Thomas Eliot'
>>> print repr(unicode(doc.label.quote))
u'Midwinter Spring is its own season'
There should be a trailing ellipsis character (the
… character entity) in the quote
element, but it has gone missing. I looked though the xmltramp code for
an obvious cause of this defect, but it turned out to be rather subtle.
If you look closely you will see that the whitespace after the
ellipsis character is missing as well. xmltramp coerces to Unicode by
taking all text nodes descending from the given object and, using
split
and join string methods, collapses runs of whitespace
into single space characters. Python's Unicode methods treat
… as whitespace, which surprised me. I know
that some other Unicode characters are treated as whitespace,
including  , popularly known in its HTML entity
form, , but ellipsis seems a strange character
to treat as whitespace. At any rate, this quick and dirty
normalization by xmltramp means that coercion to Unicode does not
reliably return the precise content of descendant text nodes, and I
recommend sticking to list access. The following snippet gets all
text content that is the immediate child of an element, excepting pure
whitespace nodes, which xmltramp seems to strip:
>>> ''.join([t for t in doc.label.quote if isinstance(t, unicode)])
u' is its own season\x85\n '
Within these constraints, xmltramp maintains mixed content so that you can access it using the patterns I've described.
>>> print list(doc.label.quote)
[<emph>...</emph>, u' is its own season\x85\n ']
>>> print repr(doc.label.quote.emph)
<emph>...</emph>
>>> print repr(unicode(doc.label.quote.emph))
u'Midwinter Spring'
xmltramp allows for limited mutation. The easiest thing to do is add or modify an attribute:
>>> doc.label('added')
u'2003-06-20'
>>> doc.label(added=u'2003-11-20') #returns attrs as a dict
{u'added': u'2003-11-20'}
>>> doc.label('added')
u'2003-11-20'
>>> doc.label('added', u'2003-12-20')
>>> doc.label('added')
u'2003-12-20'
>>> doc.label(new_attr=u'1')
{u'added': u'2003-12-20', 'new_attr': u'1'}
To add an element with simple text content you can use mapping update idiom:
>>> doc[1]['quote'] = u"Make it new"
This code adds a quote element as the last child of the
second label element with the simple text content
Make it new. In order to see the result of this
operation I wanted to reserialize the element back to XML. xmltramp
provides for additional parameters to the __repr__ magic
method which can be used for such reserialization. The first is a
boolean parameter which you just set to True to trigger
full reserialization:
>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address><quote>Make it new</quote></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
Again you can see the effect of the stripped whitespace. The second
parameter is also a boolean, and True turns on
pretty-printing (using tabs for indentation). You cannot use the
repr built-in function in this way on xmltramp elements
because it only accepts one argument.
To delete an element, you must use the sequence idiom for deletion, in contrast to the use of mapping idiom for addition of elements:
>>> del doc[1][2] #Remove newly added quote element
>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
You can add more complex elements, by passing in well-formed XML documents and adding them as new elements:
>>> new_elem = xmltramp.parse("<emph>Make it new</emph>")
>>> doc[1]['quote'] = new_elem
>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address><quote><emph>Make it new</emph></quote></label>'
The above output actually appears all on one line, but I've added in breaks for formatting reasons.
But you cannot add mixed content so easily because you can't parse a a document which isn't well-formed XML.
>>> new_elem = xmltramp.parse("Make it <emph>new</emph>")
[... Raises a SAX parse exception ...]
You would have to combine other operations to add such mixed content.
pxdom 0.6 like xmltramp comes as a single Python module. Again I simply copied pxdom.py to a directory in my Python 2.3.2 library path (pxdom supports Python versions from 1.5.2 on). pxdom scrupulously implements DOM Level 3's Load/Save specification which standardizes serialization and deserialization between XML text and DOM. To read XML from a file, use a pattern such as that in listing 2.
Listing 2: Basic loading of an XML fileimport pxdom
dom= pxdom.getDOMImplementation('')
parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None)
doc= parser.parseURI('file:labels.xml')
pxdom also provides some convenience functions
parseString and parse (accepts a file-like
object or an OS-specific pathname) which are not provided for in DOM but
are added in minidom. Listing 3 demonstrates some DOM operations using
pxdom.
import pxdom
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
#Create a pxdom document node parsed from XML in a string
dom= pxdom.getDOMImplementation('')
parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None)
doc_node = pxdom.parseString(DOC)
print doc_node
#You can execute regular DOM operations on the document node
verse_element = doc_node.documentElement
print verse_element
#As with other Python DOMs you can use "Pythonic" shortcuts for
#things like Node lists and named node maps
#The first child of the verse element is a white space text node
#The second is the attribution element
attribution_element = verse_element.childNodes[1]
#attribution_string becomes "Christopher Okibgo"
attribution_string = attribution_element.firstChild.data
print repr(attribution_string)
I was a bit concerned to see that the output from the last line of the listing is a plain text string rather than a Unicode object. I experimented a bit and found that if any text node has a non-ASCII character, pxdom appears to be representing it as a Unicode object rather than a plain string. This at least reassured me of pxdom's Unicode support, but I wonder whether such a mix of text and Unicode objects adds unnecessary complications.
Listing 4 shows how to use pxdom to build a DOM tree from scratch,
node by node, and then print the corresponding XML. Rather than the
toxml method of minidom and the Print and
PrettyPrint functions of Domlette and 4DOM respectively,
pxdom implements the DOM standard saveXML method.
import pxdom
from xml.dom import EMPTY_NAMESPACE, XML_NAMESPACE
impl = pxdom.getDOMImplementation('')
#Create a document type node using the doctype name "message"
#A blank system ID and blank public ID (i.e. no DTD information)
doctype = impl.createDocumentType(u"message", None, None)
#Create a document node, which also creates a document element node
#For the element, use a blank namespace URI and local name "message"
doc = impl.createDocument(EMPTY_NAMESPACE, u"message", doctype)
#Get the document element
msg_elem = doc.documentElement
#Create an xml:lang attribute on the new element
msg_elem.setAttributeNS(XML_NAMESPACE, u"xml:lang", u"en")
#Create a text node with some data in it
new_text = doc.createTextNode(u"You need Python")
#Add the new text node to the document element
msg_elem.appendChild(new_text)
#Print out the result
print doc.saveXML()
Also in Python and XML | |
Should Python and XML Coexist? | |
There is much more to pxdom than I can cover here. After all, it is a complete DOM implementation. The pxdom project puts a premium on conformance, and the module does extremely well running the DOM Level 1/2 Test Suite.
The choices available to Python developers for processing XML continue to multiply, which is a blessing as well as a curse -- there is plenty of variety and choice, but there is also a lot to keep track of. xmltramp and pxdom demonstrate the variety especially well, providing contrasting styles for XML processing. If you need a quick and dirty excavation of an XML document to extract key data, xmltramp is a nice tool to have on hand. If you want to stick to the standard DOM idiom, or need to be able to control all the advanced aspects of XML documents, pxdom is a trusty companion. There are more choices that I have not been able to cover yet, notably PyRXP. I have also not provided much coverage of XML namespaces in articles on individual tools, but I shall be looking at namespace processing across libraries. Watch for such topics in future columns and don't hesitate to post your own ideas for useful coverage in the comments section of this article.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.