XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Simple XML Processing With elementtree

Simple XML Processing With elementtree

February 12, 2003

Fredrik Lundh, well known in Python circles as "the effbot", has been an important contributor to Python and to PyXML. He has also developed a variety of useful tools, many of which involve Python and XML. One of these is elementtree, a collection of lightweight utilities for XML processing. elementtree is centered around a data structure for representing XML. As its name implies, this data structure is a hierarchy of objects, each of which represents an XML element. The focus is squarely on elements: there is no zoo of node types. Element objects themselves act as Python dictionaries of the XML attributes and Python lists of the element children. Text content is represented as simple data members on element instances. elementtree is about as pythonic as it gets, offering a fresh perspective on Python-XML processing, especially after the DOM explorations of my previous columns.

elementtree is very easy to set up. I downloaded version 1.1b3 (you can always find the latest version on the effbot download page). You need Python 2.1 (or newer); I used 2.2.1. Installation was a simple matter of unzipping the package and invoking distutils:

python setup.py install

You must have pyexpat in order to use elementtree, either as part of your Python installation itself or by installing PyXML.

XML made even easier

Listing 1 is a sample document (memo.xml) that I will use in this article.

Listing 1 (memo.xml): a sample XML file

<?xml version='1.0' encoding='utf-8'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">2003-02-01</date>
<to>The Art World</to>
<body>
It appears that with the unfortunate recent United States
Supreme Court ruling in <cite>Eldred vs. Ashcroft</cite>, the
basis for creative expression, and the general gain of society
in such expression is <strong>forfeit</strong> to crude commercial
interest.
</body>
</memo>  

The primary benefit of elementtree is simplicity. Listing 2 reads the XML document in Listing 1 into the elementtree data structure and then writes it back out as XML.

Listing 2 (listing2.py): XML round trip using elementtree
import sys
#Most common APIs are available on the ElementTree class
from elementtree.ElementTree import ElementTree
#create an ElementTree instance from an XML file
doc = ElementTree(file="memo.xml")
#write out XML from the ElementTree instance
doc.write(sys.stdout)  

elementtree is fast and lightweight. I tested it with Elliotte Rusty Harold's 1998 baseball stats, Hamlet, and the Old Testament from John Bosak's Revised XML Document Collections. These are the same large files I used in the last article to explore the performance of various iteration techniques. Using very crude benchmarks, while simply parsing, ElementTree was about 30% slower than cDomlette, but it also used about 30% less memory, which is very impressive for an XML data structure in pure Python (the parser is a different matter, using pyexpat, which is written in C).

elementtree also offers access to nodes of the XML tree using specialized Python objects, which are not based on DOM. elementtree uses its freedom from DOM to adopt the most pythonic idioms available. Iterators, in particular, are the core mechanism for navigating ElementTree instances. As an example, listing 3 displays information about all elements in the example document.

Listing 3 (listing3.py): displaying the content of the XML document
import sys
from elementtree.ElementTree import ElementTree
root = ElementTree(file=sys.argv[1])
#Create an iterator
iter = root.getiterator()
#Iterate
for element in iter:
    #First the element tag name
    print "Element:", element.tag
    #Next the attributes (available on the instance itself using
    #the Python dictionary protocol
    if element.keys():
        print "\tAttributes:"
        for name, value in element.items():
            print "\t\tName: '%s', Value: '%s'"%(name, value)
    #Next the child elements and text
    print "\tChildren:"
    #Text that precedes all child elements (may be None)
    if element.text:
        text = element.text
        text = len(text) > 40 and text[:40] + "..." or text
        print "\t\tText:", repr(text)
    if element.getchildren():
        #Can also use: "for child in element.getchildren():"
        for child in element:
            #Child element tag name
            print "\t\tElement", child.tag
            #The "tail" on each child element consists of the text
            #that comes after it in the parent element content, but
            #before its next sibling.
            if child.tail:
                text = child.tail
                text = len(text) > 40 and text[:40] + "..." or text
                print "\t\tText:", repr(text)  

This gives you a quick look at the very pythonic read API for elementtree objects. Each element object can be accessed using the Python dictionary protocol to access its attributes and the sequence protocol to access its children. The main quirk in this API is how mixed content is handled. Each element only directly stores the portion of its text content that precedes any child elements. It leaves the storage of all its other text to its children. Each child element stores any text that follows it in its parent node (tail). The comments in the elementtree code are actually misleading on this point; I suspect they are out of date. And there are a few other points of confusion in the comments, so do be careful. Running the script in Listing 3 against the document in Listing 1, I get:

$ python listing3.py memo.xml
Element: memo
        Children:
                Text: '\n'
                Element title
                Text: '\n'
                Element date
                Text: '\n'
                Element to
                Text: '\n'
                Element body
                Text: '\n'
Element: title
        Children:
                Text: 'With Usura Hath no Man a House of Good S...'
Element: date
        Attributes:
                Name: 'form', Value: 'ISO-8601'
        Children:
                Text: '2003-02-01'
Element: to
        Children:
                Text: 'The Art World'
Element: body
        Children:
                Text: '\nIt appears that with the unfortunate re...'
                Element cite
                Text: ', the\nbasis for creative expression, and...'
                Element strong
                Text: ' to crude commercial\ninterest.\n'
Element: cite
        Children:
                Text: 'Eldred vs. Ashcroft'
Element: strong
        Children:
                Text: 'forfeit'  

Pages: 1, 2

Next Pagearrow







close