Menu

Simple XML Processing With elementtree

February 12, 2003

Uche Ogbuji

Fredrik Lundh, well known in Python circles as "the effbot", has been an important contributor to Python and to PyXML. He has also developed a variety of useful tools, many of which involve Python and XML. One of these is elementtree, a collection of lightweight utilities for XML processing. elementtree is centered around a data structure for representing XML. As its name implies, this data structure is a hierarchy of objects, each of which represents an XML element. The focus is squarely on elements: there is no zoo of node types. Element objects themselves act as Python dictionaries of the XML attributes and Python lists of the element children. Text content is represented as simple data members on element instances. elementtree is about as pythonic as it gets, offering a fresh perspective on Python-XML processing, especially after the DOM explorations of my previous columns.

elementtree is very easy to set up. I downloaded version 1.1b3 (you can always find the latest version on the effbot download page). You need Python 2.1 (or newer); I used 2.2.1. Installation was a simple matter of unzipping the package and invoking distutils:

python setup.py install

You must have pyexpat in order to use elementtree, either as part of your Python installation itself or by installing PyXML.

XML made even easier

Listing 1 is a sample document (memo.xml) that I will use in this article.

Listing 1 (memo.xml): a sample XML file

<?xml version='1.0' encoding='utf-8'?>

<memo>

<title>With Usura Hath no Man a House of Good Stone</title>

<date form="ISO-8601">2003-02-01</date>

<to>The Art World</to>

<body>

It appears that with the unfortunate recent United States

Supreme Court ruling in <cite>Eldred vs. Ashcroft</cite>, the

basis for creative expression, and the general gain of society

in such expression is <strong>forfeit</strong> to crude commercial

interest.

</body>

</memo>  

The primary benefit of elementtree is simplicity. Listing 2 reads the XML document in Listing 1 into the elementtree data structure and then writes it back out as XML.

Listing 2 (listing2.py): XML round trip using elementtree
import sys

#Most common APIs are available on the ElementTree class

from elementtree.ElementTree import ElementTree

#create an ElementTree instance from an XML file

doc = ElementTree(file="memo.xml")

#write out XML from the ElementTree instance

doc.write(sys.stdout)  

elementtree is fast and lightweight. I tested it with Elliotte Rusty Harold's 1998 baseball stats, Hamlet, and the Old Testament from John Bosak's Revised XML Document Collections. These are the same large files I used in the last article to explore the performance of various iteration techniques. Using very crude benchmarks, while simply parsing, ElementTree was about 30% slower than cDomlette, but it also used about 30% less memory, which is very impressive for an XML data structure in pure Python (the parser is a different matter, using pyexpat, which is written in C).

elementtree also offers access to nodes of the XML tree using specialized Python objects, which are not based on DOM. elementtree uses its freedom from DOM to adopt the most pythonic idioms available. Iterators, in particular, are the core mechanism for navigating ElementTree instances. As an example, listing 3 displays information about all elements in the example document.

Listing 3 (listing3.py): displaying the content of the XML document
import sys

from elementtree.ElementTree import ElementTree

root = ElementTree(file=sys.argv[1])

#Create an iterator

iter = root.getiterator()

#Iterate

for element in iter:

    #First the element tag name

    print "Element:", element.tag

    #Next the attributes (available on the instance itself using

    #the Python dictionary protocol

    if element.keys():

        print "\tAttributes:"

        for name, value in element.items():

            print "\t\tName: '%s', Value: '%s'"%(name, value)

    #Next the child elements and text

    print "\tChildren:"

    #Text that precedes all child elements (may be None)

    if element.text:

        text = element.text

        text = len(text) > 40 and text[:40] + "..." or text

        print "\t\tText:", repr(text)

    if element.getchildren():

        #Can also use: "for child in element.getchildren():"

        for child in element:

            #Child element tag name

            print "\t\tElement", child.tag

            #The "tail" on each child element consists of the text

            #that comes after it in the parent element content, but

            #before its next sibling.

            if child.tail:

                text = child.tail

                text = len(text) > 40 and text[:40] + "..." or text

                print "\t\tText:", repr(text)  

This gives you a quick look at the very pythonic read API for elementtree objects. Each element object can be accessed using the Python dictionary protocol to access its attributes and the sequence protocol to access its children. The main quirk in this API is how mixed content is handled. Each element only directly stores the portion of its text content that precedes any child elements. It leaves the storage of all its other text to its children. Each child element stores any text that follows it in its parent node (tail). The comments in the elementtree code are actually misleading on this point; I suspect they are out of date. And there are a few other points of confusion in the comments, so do be careful. Running the script in Listing 3 against the document in Listing 1, I get:

$ python listing3.py memo.xml

Element: memo

        Children:

                Text: '\n'

                Element title

                Text: '\n'

                Element date

                Text: '\n'

                Element to

                Text: '\n'

                Element body

                Text: '\n'

Element: title

        Children:

                Text: 'With Usura Hath no Man a House of Good S...'

Element: date

        Attributes:

                Name: 'form', Value: 'ISO-8601'

        Children:

                Text: '2003-02-01'

Element: to

        Children:

                Text: 'The Art World'

Element: body

        Children:

                Text: '\nIt appears that with the unfortunate re...'

                Element cite

                Text: ', the\nbasis for creative expression, and...'

                Element strong

                Text: ' to crude commercial\ninterest.\n'

Element: cite

        Children:

                Text: 'Eldred vs. Ashcroft'

Element: strong

        Children:

                Text: 'forfeit'  

Namespaces

elementtree supports XML namespaces using a rather different mechanism from most XML processing APIs. The namespace is not maintained as a separate instance variable but is built into the element or attribute name using James Clark's notation. If you are not familiar with James Clark's article "XML Namespaces", you probably should be. Even if you are very familiar with XML namespaces, you may need to explain them to others, and this article is considered one of the best explanations of XML namespaces available. It also introduces a notation for expressing a namespace-qualified name. For example, the name of an element with namespace http://www.w3.org/1999/XSL/Transform and local name template is written {http://www.w3.org/1999/XSL/Transform}template. The sample document in Listing 4 (nsexample.xml) uses namespaces.

Listing 4 (nsexample.xml): a sample document with namespaces
<ClientInfo xmlns="http://fourthought.com/timelog"

           xmlns:dc="http://purl.org/dc/elements/1.1/">

  <dc:Description>

    Fourthought, Inc

  </dc:Description>

  <dc:Title>

    Management Subcontracting

  </dc:Title>

  <MinIncrement>0.25</MinIncrement>

  <InvoiceNumber>7777</InvoiceNumber>

</ClientInfo>  

The following is the result of running listing 3 against this file.

$ python listing3.py nsexample.xml

Element: {http://fourthought.com/timelog}ClientInfo

        Children:

                Text: '\n  '

                Element {http://purl.org/dc/elements/1.1/}Description

                Text: '\n  '

                Element {http://purl.org/dc/elements/1.1/}Title

                Text: '\n  '

                Element {http://fourthought.com/timelog}MinIncrement

                Text: '\n  '

                Element {http://fourthought.com/timelog}InvoiceNumber

                Text: '\n'

Element: {http://purl.org/dc/elements/1.1/}Description

        Children:

                Text: '\n    Fourthought, Inc\n  '

Element: {http://purl.org/dc/elements/1.1/}Title

        Children:

                Text: '\n    Management Subcontracting\n  '

Element: {http://fourthought.com/timelog}MinIncrement

        Children:

                Text: '0.25'

Element: {http://fourthought.com/timelog}InvoiceNumber

        Children:

                Text: '7777'  

You can see how the namespace is built into the tag name variables. One problem with elementtree's handling of namespaces is that the prefixes used in the original XML document are not preserved, as they are with DOM and the like. This is mostly an inconvenience: prefixes are strictly inconsequential in XML namespaces. But it can be enough of an annoyance that you should be aware of it. For example, I took the document in Listing 4 and ran it through the round-trip script (which just uses elementtree to read in a document and print it right back out again). I got the following result:

<ns0:ClientInfo xmlns:ns0="http://fourthought.com/timelog">

  <ns1:Description xmlns:ns1="http://purl.org/dc/elements/1.1/">

    Fourthought, Inc

  </ns1:Description>

  <ns2:Title xmlns:ns2="http://purl.org/dc/elements/1.1/">

    Management Subcontracting

  </ns2:Title>

  <ns0:MinIncrement>0.25</ns0:MinIncrement>

  <ns0:InvoiceNumber>7777</ns0:InvoiceNumber>

</ns0:ClientInfo>  

This is identical to the original document according to the rules of XML namespaces, but you can see the lexical differences, including the generic prefixes and the change in location of the namespace declarations.

Mutation

elementtree includes APIs for mutating documents. Suppose that I decide to change the body of the memo. Listing 5 is a script that does so.

Listing 5 (listing5.py): an example of mutation using elementtree
import sys

from elementtree.ElementTree import ElementTree, SubElement

doc = ElementTree(file="memo.xml")

#find the "body" element by tag name

body = doc.getroot().findall("body")[0]

#Remove all child elements, text (and attributes)

body.clear()

#Insert new lead text

body.text = "This is a new memo.  Send responses to \n"

new_element = SubElement(body, 'a', {'href': 'mailto:memos@spam.com'})

new_element.text = "memos@spam.com"

new_element.tail = "\nThanks.\n"

#write out the modified XML

doc.write(sys.stdout)  

I use getroot() to get the document (top-level) element and then the findall() method to find the body element, which I'll be manipulating. This latter method is similar to the get_elements_by_tag_name() functions I introduced in the last article. The method clear() eliminates any attributes, text, and child elements from an element. In effect it leaves me with a blank body element, which I can then repopulate by setting initial content. In this example I add content that includes an element, which I can do using the SubElement() factory function, which automatically appends the resulting element to a parent element. The tag name is a and I add attributes by passing in a dictionary. I complete the mutation by adding content to the new a element (as new_element.text) and to its parent, the body element (as new_element.tail). Finally, I write out the result, which looks like this:

<memo>

<title>With Usura Hath no Man a House of Good Stone</title>

<date form="ISO-8601">2003-02-01</date>

<to>The Art World</to>

<body>This is a new memo.  Send responses to

<a href="mailto:memos@spam.com">memos@spam.com</a>

Thanks.

</body></memo>  

You can gain finer control over what is removed and added by using append(), insert() and remove(). You can set and remove attributes using the dictionary-like API for element objects. You can create comments by using the elementtree.ElementTree.Comment() factory function (although comments are not preserved when parsed from source documents). elementtree doesn't appear to offer any support for processing instructions. You can apply namespaces by using tags with Clark notation or by passing in an instance of the elementtree.ElementTree.QName class rather than a string for the tag.

Yet another tool in the box

elementtree is fast, pythonic and very simple to use. It is very handy when all you want to do is get in, do some rapid and simple XML processing, and get out. It also includes some handy tools for HTML processing. The module elementtree.TidyTools provides a wrapper for the popular HTML Tidy utility, which, among other things, can take all sorts of poorly structured HTML and convert it into valid XHTML. This makes possible the elementtree.TidyXMLTreeBuilder module, which can parse HTML and return an elementree instance of the resulting XHTML. If you do find elementtree useful, you may want to offer a donation to the effbot PayPal account linked from his downloads page.

Python-XML Happenings

It has been a busy month in the world of Python-XML development:

JAXML is a Python module to assist with generating XML, XHTML or HTML documents. It's maintained as part of Debian, but freely available on its own.

Daniel Veillard announced improvements to Python support in libxml (specifically, libxml2-2.5.0), including Python support for XmlTextReader, an API inspired by C# which combines the efficiency of SAX and the relative ease of DOM.

Robin Becker announced ReportLab Toolkit 1.17, a suite of tools for generating PDF reports, based on a series of XML technologies. See the ReportLab SourceForge page for more details.

PyXML 0.8.2 has been released. It now comes with Expat 1.95.6, which deals with many memory problems and other bugs in recent Expat releases. PyXML also supports more DOM Level 3 features in minidom (isWhitespaceInElementContent, schemaType, isId, and DOMImplementationSource), and adds various bugfixes. I advise all users of PyXML to upgrade as soon as possible.

Python Object Model for XML (POM) is part of PyNMS, a Python library for network management applications. POM is a Pythonic variation on the DOM which, interestingly, includes integrated validation based on DTD. PyNMS also includes other, smaller XML tools.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

XElf 0.1 is a set of modules dedicated to XML processing for Python. It currently features a Python XOM implementation, including support for Namespaces and XMLBase. XOM is Elliotte Rusty Harold's XML object module for Java intended to improve upon DOM and JDOM.

Remi Delon announced the release of the 0.8 version of CherryPy, a Python-based tool for developing dynamic websites. It includes hooks for XML-RPC and XSLT.

Pete Ohler announced a small validating XML parser for Python called xmlite but neglected to make the module available. He seems willing to share the module, so contact him if you are interested.