Simple XML Processing With elementtree
by Uche Ogbuji
|
Pages: 1, 2
Namespaces
elementtree supports XML namespaces using a rather different mechanism
from most XML processing APIs. The namespace is not maintained as a
separate instance variable but is built into the element or attribute name
using James Clark's notation. If you are not familiar with James Clark's
article "XML
Namespaces", you probably should be. Even if you are very familiar
with XML namespaces, you may need to explain them to others, and this
article is considered one of the best explanations of XML namespaces
available. It also introduces a notation for expressing a
namespace-qualified name. For example, the name of an element with
namespace http://www.w3.org/1999/XSL/Transform and local name
template is written
{http://www.w3.org/1999/XSL/Transform}template. The sample
document in Listing 4 (nsexample.xml) uses namespaces.
<ClientInfo xmlns="http://fourthought.com/timelog"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:Description>
Fourthought, Inc
</dc:Description>
<dc:Title>
Management Subcontracting
</dc:Title>
<MinIncrement>0.25</MinIncrement>
<InvoiceNumber>7777</InvoiceNumber>
</ClientInfo>
The following is the result of running listing 3 against this file.
$ python listing3.py nsexample.xml
Element: {http://fourthought.com/timelog}ClientInfo
Children:
Text: '\n '
Element {http://purl.org/dc/elements/1.1/}Description
Text: '\n '
Element {http://purl.org/dc/elements/1.1/}Title
Text: '\n '
Element {http://fourthought.com/timelog}MinIncrement
Text: '\n '
Element {http://fourthought.com/timelog}InvoiceNumber
Text: '\n'
Element: {http://purl.org/dc/elements/1.1/}Description
Children:
Text: '\n Fourthought, Inc\n '
Element: {http://purl.org/dc/elements/1.1/}Title
Children:
Text: '\n Management Subcontracting\n '
Element: {http://fourthought.com/timelog}MinIncrement
Children:
Text: '0.25'
Element: {http://fourthought.com/timelog}InvoiceNumber
Children:
Text: '7777'
You can see how the namespace is built into the tag name variables. One problem with elementtree's handling of namespaces is that the prefixes used in the original XML document are not preserved, as they are with DOM and the like. This is mostly an inconvenience: prefixes are strictly inconsequential in XML namespaces. But it can be enough of an annoyance that you should be aware of it. For example, I took the document in Listing 4 and ran it through the round-trip script (which just uses elementtree to read in a document and print it right back out again). I got the following result:
<ns0:ClientInfo xmlns:ns0="http://fourthought.com/timelog">
<ns1:Description xmlns:ns1="http://purl.org/dc/elements/1.1/">
Fourthought, Inc
</ns1:Description>
<ns2:Title xmlns:ns2="http://purl.org/dc/elements/1.1/">
Management Subcontracting
</ns2:Title>
<ns0:MinIncrement>0.25</ns0:MinIncrement>
<ns0:InvoiceNumber>7777</ns0:InvoiceNumber>
</ns0:ClientInfo>
This is identical to the original document according to the rules of XML namespaces, but you can see the lexical differences, including the generic prefixes and the change in location of the namespace declarations.
Mutation
elementtree includes APIs for mutating documents. Suppose that I decide to change the body of the memo. Listing 5 is a script that does so.
Listing 5 (listing5.py): an example of mutation using elementtreeimport sys
from elementtree.ElementTree import ElementTree, SubElement
doc = ElementTree(file="memo.xml")
#find the "body" element by tag name
body = doc.getroot().findall("body")[0]
#Remove all child elements, text (and attributes)
body.clear()
#Insert new lead text
body.text = "This is a new memo. Send responses to \n"
new_element = SubElement(body, 'a', {'href': 'mailto:memos@spam.com'})
new_element.text = "memos@spam.com"
new_element.tail = "\nThanks.\n"
#write out the modified XML
doc.write(sys.stdout)
I use getroot() to get the document (top-level) element
and then the findall() method to find the body
element, which I'll be manipulating. This latter method is similar to the
get_elements_by_tag_name() functions I introduced in the last
article. The method clear() eliminates any attributes, text,
and child elements from an element. In effect it leaves me with a blank
body element, which I can then repopulate by setting initial
content. In this example I add content that includes an element, which I
can do using the SubElement() factory function, which
automatically appends the resulting element to a parent element. The
tag name is a and I add attributes by passing in a
dictionary. I complete the mutation by adding content to the new a element
(as new_element.text) and to its parent, the body
element (as new_element.tail). Finally, I write out the
result, which looks like this:
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">2003-02-01</date>
<to>The Art World</to>
<body>This is a new memo. Send responses to
<a href="mailto:memos@spam.com">memos@spam.com</a>
Thanks.
</body></memo>
You can gain finer control over what is removed and added by using
append(), insert() and remove().
You can set and remove attributes using the dictionary-like API for
element objects. You can create comments by using the
elementtree.ElementTree.Comment() factory function (although
comments are not preserved when parsed from source documents).
elementtree doesn't appear to offer any support for processing
instructions. You can apply namespaces by using tags with Clark notation
or by passing in an instance of the
elementtree.ElementTree.QName class rather than a string for
the tag.
Yet another tool in the box
elementtree is fast, pythonic and very simple to use. It is very handy
when all you want to do is get in, do some rapid and simple XML
processing, and get out. It also includes some handy tools for HTML
processing. The module elementtree.TidyTools provides a
wrapper for the popular HTML Tidy utility,
which, among other things, can take all sorts of poorly structured HTML
and convert it into valid XHTML. This makes possible the
elementtree.TidyXMLTreeBuilder module, which can parse HTML
and return an elementree instance of the resulting XHTML. If you do find
elementtree useful, you may want to offer a donation to the effbot PayPal
account linked from his downloads
page.
Python-XML Happenings
It has been a busy month in the world of Python-XML development:
JAXML is a Python module to assist with generating XML, XHTML or HTML documents. It's maintained as part of Debian, but freely available on its own.
Daniel Veillard announced improvements to Python support in libxml (specifically, libxml2-2.5.0), including Python support for XmlTextReader, an API inspired by C# which combines the efficiency of SAX and the relative ease of DOM.
Robin Becker announced ReportLab Toolkit 1.17, a suite of tools for generating PDF reports, based on a series of XML technologies. See the ReportLab SourceForge page for more details.
PyXML 0.8.2 has been
released. It now comes with Expat 1.95.6, which deals with many memory
problems and other bugs in recent Expat releases. PyXML also supports
more DOM Level 3 features in minidom (isWhitespaceInElementContent, schemaType,
isIdDOMImplementationSource), and adds various bugfixes. I advise
all users of PyXML to upgrade as soon as possible.
Python Object Model for XML (POM) is part of PyNMS, a Python library for network management applications. POM is a Pythonic variation on the DOM which, interestingly, includes integrated validation based on DTD. PyNMS also includes other, smaller XML tools.
Also in Python and XML | |
Should Python and XML Coexist? | |
XElf 0.1 is a set of modules dedicated to XML processing for Python. It currently features a Python XOM implementation, including support for Namespaces and XMLBase. XOM is Elliotte Rusty Harold's XML object module for Java intended to improve upon DOM and JDOM.
Remi Delon announced the release of the 0.8 version of CherryPy, a Python-based tool for developing dynamic websites. It includes hooks for XML-RPC and XSLT.
Pete Ohler announced a small validating XML parser for Python called xmlite but neglected to make the module available. He seems willing to share the module, so contact him if you are interested.