XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

xmltramp and pxdom

xmltramp and pxdom

December 17, 2003

In this article I cover two XML processing libraries with very disjoint goals. xmltramp, developed by Aaron Swartz, is a tool for parsing XML documents into a data structure very friendly to Python. Recently many of the tools I've been covering with this primary goal of Python-friendliness have been data binding tools. xmltramp doesn't meet the definition of a data binding tool I've been using; that is, it isn't a system that represents elements and attributes from the XML document as custom objects that use the vocabulary from the XML document for naming and reference. xmltramp is more like ElementTree, which I covered earlier, defining a set of lightweight objects that make information in XML document accessible through familiar Python idioms. The stated goal of xmltramp is simplicity rather than exhaustive coverage of XML features.

pxdom, on the other hand, has the goal of strict DOM Level 3 compliance. It is developed by Andrew Clover, who contributed to the XML-SIG the document "DOM Standards compliance", a very thorough matrix of feature and defect comparisons between Python DOM implementatons. DOM has generally not been the favorite API of Python users -- or, for that matter, of Java users -- but it certainly has an important place because of its cross-language support.

xmltramp

I downloaded xmltramp 2.0, which is a single Python module. A required Python version is not given, but according to Python features I noticed in the implementation, at least 2.1 is required. I used Python 2.3.2, and installation was a simple matter of copying xmltramp.py to a directory in the Python path. To kick off my exercising of xmltramp I used the same sample file as I've been using in the data binding coverage (see listing 1).

Listing 1: Sample XML file for exercising xmltramp
<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
  <label added="2003-06-20">
    <quote>
      <!-- Mixed content -->
      <emph>Midwinter Spring</emph> is its own season&#133;
    </quote>
    <name>Thomas Eliot</name>
    <address>
      <street>3 Prufrock Lane</street>
      <city>Stamford</city>
      <state>CT</state>
    </address>
  </label>
  <label added="2003-06-10">
    <name>Ezra Pound</name>
    <address>
      <street>45 Usura Place</street>
      <city>Hailey</city>
      <state>ID</state>
    </address>
  </label>
</labels>

  

The following snippet shows how simple it is to parse in a file using xmltramp:



>>> import xmltramp
>>> xml_file = open('labels.xml')
>>> doc = xmltramp.seed(xml_file)

  

xmltramp uses SAX behind the scenes for parsing, so it should generally be efficient in building up the in-memory structure. The seed function takes a file-like object but you can use the parse function instead if you have a string object. Like Elementree, xmltramp defines specialized objects (xmltramp.Element) representing each element in the XML document. The top-level object (assigned to doc) represents the top-level element, rather than the document itself. You can see its element children by peeking into an internal structure:



doc._dir
[<label added="2003-06-20">...</label>, <label added="2003-06-10">...</label>]

  

In this list each entry is a representation of a child element object. The whitespace text nodes between elements are omitted, which might be conventional stripping of such nodes, but it did make me wonder about the way xmltramp handles mixed content, about which more later. Of course in normal use you would access the xmltramp structures using the public API, which in part adopts Python's list idioms:



>>> for label in doc: print repr(label)
...
<label added="2003-06-20">...</label>
<label added="2003-06-10">...</label>
>>> print repr(doc[0])
<label added="2003-06-20">...</label>

  

I use repr because the str function (used by print to coerce non-string parameters) applied to Element objects returns a concatenation of child text nodes, excluding pure white text nodes:



>>> print doc[1]
Ezra Pound45 Usura PlaceHaileyID

  

You can also use the element node name to navigate the XML structure:



>>> print repr(doc.label)
<label added="2003-06-20">...</label>

  

There are, of course, multiple label children. The first one is returned. And, as if that weren't enough, you can also use a dictionary access (mapping) idiom:



>>> print repr(doc['label'])
<label added="2003-06-20">...</label>

  

You read attributes using the function invocation idiom:



>>> print doc.label('added')
2003-06-20

  

To navigating further into the tree, you can combine and cascade the access methods I described above:



>>> print repr(doc.label.name)
<name>...</name>
>>> print repr(doc['label']['name'])
<name>...</name>
>>> print repr(doc[0][1])
<name>...</name>

  

Unfortunately it seems that the only way to access any element except for the first child element with a certain name is to use list access methods.



>>> doc[1] #Second label element
<label added="2003-06-10">...</label>


  

You can't access this element using either the reference name "label" or using "label" as a key string for mapping access.

Text nodes, whitespace and mixed content

You have to use the list idiom to access child text nodes:



>>> print repr(doc.label.name[0])
u'Thomas Eliot'

  

You can see that text nodes are maintained as Unicode objects, which is the right thing to do. I thought that coercing Element objects to Unicode would be another good way to access their child content, but I found an odd quirk:



>>> print repr(unicode(doc.label.name)) #so far so good
u'Thomas Eliot'
>>> print repr(unicode(doc.label.quote))
u'Midwinter Spring is its own season'

  

There should be a trailing ellipsis character (the &#133; character entity) in the quote element, but it has gone missing. I looked though the xmltramp code for an obvious cause of this defect, but it turned out to be rather subtle. If you look closely you will see that the whitespace after the ellipsis character is missing as well. xmltramp coerces to Unicode by taking all text nodes descending from the given object and, using split and join string methods, collapses runs of whitespace into single space characters. Python's Unicode methods treat &#133; as whitespace, which surprised me. I know that some other Unicode characters are treated as whitespace, including &#160;, popularly known in its HTML entity form, &nbsp;, but ellipsis seems a strange character to treat as whitespace. At any rate, this quick and dirty normalization by xmltramp means that coercion to Unicode does not reliably return the precise content of descendant text nodes, and I recommend sticking to list access. The following snippet gets all text content that is the immediate child of an element, excepting pure whitespace nodes, which xmltramp seems to strip:



>>> ''.join([t for t in doc.label.quote if isinstance(t, unicode)])
u' is its own season\x85\n    '

  

Within these constraints, xmltramp maintains mixed content so that you can access it using the patterns I've described.



>>> print list(doc.label.quote)
[<emph>...</emph>, u' is its own season\x85\n    ']
>>> print repr(doc.label.quote.emph)
<emph>...</emph>
>>> print repr(unicode(doc.label.quote.emph))
u'Midwinter Spring'

  

Mutations and re-serialization

xmltramp allows for limited mutation. The easiest thing to do is add or modify an attribute:



>>> doc.label('added')
u'2003-06-20'
>>> doc.label(added=u'2003-11-20') #returns attrs as a dict
{u'added': u'2003-11-20'}
>>> doc.label('added')
u'2003-11-20'
>>> doc.label('added', u'2003-12-20')
>>> doc.label('added')
u'2003-12-20'
>>> doc.label(new_attr=u'1')
{u'added': u'2003-12-20', 'new_attr': u'1'}

  

To add an element with simple text content you can use mapping update idiom:



>>> doc[1]['quote'] = u"Make it new"

  

This code adds a quote element as the last child of the second label element with the simple text content Make it new. In order to see the result of this operation I wanted to reserialize the element back to XML. xmltramp provides for additional parameters to the __repr__ magic method which can be used for such reserialization. The first is a boolean parameter which you just set to True to trigger full reserialization:



>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address><quote>Make it new</quote></label>'

  

The above output actually appears all on one line, but I've added in breaks for formatting reasons.

Again you can see the effect of the stripped whitespace. The second parameter is also a boolean, and True turns on pretty-printing (using tabs for indentation). You cannot use the repr built-in function in this way on xmltramp elements because it only accepts one argument.

To delete an element, you must use the sequence idiom for deletion, in contrast to the use of mapping idiom for addition of elements:



>>> del doc[1][2] #Remove newly added quote element
>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address></label>'

  

The above output actually appears all on one line, but I've added in breaks for formatting reasons.

You can add more complex elements, by passing in well-formed XML documents and adding them as new elements:



>>> new_elem = xmltramp.parse("<emph>Make it new</emph>")
>>> doc[1]['quote'] = new_elem
>>> doc[1].__repr__(True)
u'<label added="2003-06-10"><name>Ezra Pound</name><address>
<street>45 Usura Place</street><city>Hailey</city><state>ID</state>
</address><quote><emph>Make it new</emph></quote></label>'

  

The above output actually appears all on one line, but I've added in breaks for formatting reasons.

But you cannot add mixed content so easily because you can't parse a a document which isn't well-formed XML.



>>> new_elem = xmltramp.parse("Make it <emph>new</emph>")
[... Raises a SAX parse exception ...]

  

You would have to combine other operations to add such mixed content.

pxdom

pxdom 0.6 like xmltramp comes as a single Python module. Again I simply copied pxdom.py to a directory in my Python 2.3.2 library path (pxdom supports Python versions from 1.5.2 on). pxdom scrupulously implements DOM Level 3's Load/Save specification which standardizes serialization and deserialization between XML text and DOM. To read XML from a file, use a pattern such as that in listing 2.

Listing 2: Basic loading of an XML file
import pxdom
dom= pxdom.getDOMImplementation('')
parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None)
doc= parser.parseURI('file:labels.xml')

  

pxdom also provides some convenience functions parseString and parse (accepts a file-like object or an OS-specific pathname) which are not provided for in DOM but are added in minidom. Listing 3 demonstrates some DOM operations using pxdom.

Listing 3: Using pxdom to access parts of an XML document
import pxdom

DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""
 
#Create a pxdom document node parsed from XML in a string
dom= pxdom.getDOMImplementation('')
parser= dom.createDOMParser(dom.MODE_SYNCHRONOUS, None)
doc_node = pxdom.parseString(DOC)
print doc_node

#You can execute regular DOM operations on the document node
verse_element = doc_node.documentElement
print verse_element

#As with other Python DOMs you can use "Pythonic" shortcuts for
#things like Node lists and named node maps
#The first child of the verse element is a white space text node
#The second is the attribution element
attribution_element = verse_element.childNodes[1]

#attribution_string becomes "Christopher Okibgo"
attribution_string = attribution_element.firstChild.data
print repr(attribution_string)


  

I was a bit concerned to see that the output from the last line of the listing is a plain text string rather than a Unicode object. I experimented a bit and found that if any text node has a non-ASCII character, pxdom appears to be representing it as a Unicode object rather than a plain string. This at least reassured me of pxdom's Unicode support, but I wonder whether such a mix of text and Unicode objects adds unnecessary complications.

Listing 4 shows how to use pxdom to build a DOM tree from scratch, node by node, and then print the corresponding XML. Rather than the toxml method of minidom and the Print and PrettyPrint functions of Domlette and 4DOM respectively, pxdom implements the DOM standard saveXML method.

Listing 4: Using pxdom to build a document and serialize to XML
import pxdom
from xml.dom import EMPTY_NAMESPACE, XML_NAMESPACE

impl = pxdom.getDOMImplementation('')

#Create a document type node using the doctype name "message"
#A blank system ID and blank public ID (i.e. no DTD information)
doctype = impl.createDocumentType(u"message", None, None)

#Create a document node, which also creates a document element node
#For the element, use a blank namespace URI and local name "message"
doc = impl.createDocument(EMPTY_NAMESPACE, u"message", doctype)

#Get the document element
msg_elem = doc.documentElement

#Create an xml:lang attribute on the new element
msg_elem.setAttributeNS(XML_NAMESPACE, u"xml:lang", u"en")

#Create a text node with some data in it
new_text = doc.createTextNode(u"You need Python")

#Add the new text node to the document element
msg_elem.appendChild(new_text)

#Print out the result
print doc.saveXML()


  
    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

There is much more to pxdom than I can cover here. After all, it is a complete DOM implementation. The pxdom project puts a premium on conformance, and the module does extremely well running the DOM Level 1/2 Test Suite.

Wrap up

The choices available to Python developers for processing XML continue to multiply, which is a blessing as well as a curse -- there is plenty of variety and choice, but there is also a lot to keep track of. xmltramp and pxdom demonstrate the variety especially well, providing contrasting styles for XML processing. If you need a quick and dirty excavation of an XML document to extract key data, xmltramp is a nice tool to have on hand. If you want to stick to the standard DOM idiom, or need to be able to control all the advanced aspects of XML documents, pxdom is a trusty companion. There are more choices that I have not been able to cover yet, notably PyRXP. I have also not provided much coverage of XML namespaces in articles on individual tools, but I shall be looking at namespace processing across libraries. Watch for such topics in future columns and don't hesitate to post your own ideas for useful coverage in the comments section of this article.



1 to 2 of 2
  1. Unicode mappings
    2004-01-07 09:39:17 Thomas Hickey
  2. The ellipsis problem is on your end
    2003-12-22 09:33:00 John Cowan
1 to 2 of 2