A Python & XML Companion

December 11, 2002

Python & XML, written by Christopher Jones and Fred Drake, Jr. (O'Reilly and Associates, 2002), introduces Python programmers to XML processing. Drake is a core developer of PyXML, and Jones is an experienced developer in Python and XML, as well as an experienced author. As you would expect from such a team, this book is detailed and handy; however, I have a few notes, amplifications, and updates (the book was released in December of 2001) to offer -- all of which are distinct from the errata that the authors maintain. In this article I will provide updates, additional suggestions, and other material to serve as a companion to the book. You don't have to have the book in order to follow along.

Chapter 2 offers a brief introduction to XML, but if you are not very familiar with XML, I would recommend you find a general XML book such as Learning XML by Eric Ray (O'Reilly and Associates, 2001). The book offers no introduction to Python, so if you are not very familiar with Python, you will want to consult other books. Python & XML doesn't do anything too esoteric with Python, so a basic text such as Learning Python by Mark Lutz and David Ascher (O'Reilly and Associates, 1999) will serve you well.

More on DOM

Chapter 4 covers DOM APIs for Python, focusing on 4DOM and minidom. It doesn't discuss performance considerations very much. In particular, one important warning to be added about 4DOM is that it is very slow. It goes out of its way to be DOM compliant, though strict compliance is probably not important to most applications. This strict conformance compromises performance. I was one of the original authors of 4DOM, but I have long since come to the view that DOM is a specification to be taken in moderation. In trying to be platform and language neutral, it fails to take advantage of particular strengths of host languages like Python. I now prefer DOM-like systems that use familiar DOM method names and idioms in general, but which are specialized for Python. Thus I do most of my work in Domlette, which comes with 4Suite and is, because it's written in C, very fast. Domlette falls-back to a Python implementation on platforms where the C library cannot be compiled. I introduced Domlette in my recent article on 4Suite. PyRXP and Python/LTXML are faster than Domlette, but they deviate enough from DOM that I shall save discussion of them for future articles.

I understand that the authors' choices were largely motivated by the desire to minimize the amount of software readers need to install. If the book's examples used Domlette, PyRXP, LTXML, or the like, the reader would have to install more than just Python and PyXML. Nevertheless, many examples in the book use the old, deprecated system for creating 4DOM instances. For example, on page 86 I find the following code:

from xml.dom.ext.reader.Sax2 import FromXmlStream

doc = FromXmlStream(sys.stdin)

But the FromXmlStream function is deprecated (as are others like it). The preferred way to write this code is

from xml.dom.ext.reader import Sax2

reader = Sax2.Reader()

doc = reader.fromStream(sys.stdin)

This reader object has the advantage of being reusable, with some savings in overhead.

As an illustration of the performance advantage of Domlette, I ported the example on page 94. An earlier example demonstrated the construction of a simple XML format for file system metadata, such as the following snippet:

<IndexedFiles>

<file name="/home/fdrake/memos/oreilly/pythonxml/ch07-edits-2001-09-01.txt">

        <userID>4104</userID>

        <groupID>4104</groupID>

        <size>1887</size>

        <lastAccessed>Sat Sep  1 03:09:07 2001</lastAccessed>

        <lastModified>Sat Sep  1 03:09:07 2001</lastModified>

        <created>Sat Sep  1 03:09:07 2001</created>

        <extension>.txt</extension>

        <contents>

          /home/fdrake/memos/oreilly/pythonxml/ch07-edits-2001-09-01.txt: 

              ASCII English text

        </contents>

</file>

<!-- ... snip more entries like this ... -->

</IndexedFiles>

The example DOM code on page 94 removes elements named contents, extension, userID and groupID from the file elements. The script on page 94 is as follows:

#!/usr/bin/env python

import sys

from xml.dom.ext.reader.Sax2 import FromXmlStream

from xml.dom.ext import PrettyPrint



# get DOM object

doc = FromXmlStream(sys.stdin)



# remove unwanted nodes by traversing Node tree

for node1 in doc.childNodes:

  for node2 in node1.childNodes:

    node2.normalize()

    node3 = node2.firstChild

    while node3 is not None:

      next = node3.nextSibling

      name = node3.nodeName

      if name in ("contents", "extension", "userID", "groupID"):

        # remove unwanted nodes here via the parent

        node2.removeChild(node3)

      node3 = next



PrettyPrint(doc)

In order to do the same thing with Domlette, I created the doc node with the following code instead:

from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint

doc = NonvalidatingReader.parseStream(sys.stdin)

This code will issue a warning because I do not provide a URI for the input source I'm creating. See my 4Suite article for more details. You can safely ignore these warnings for now.

This change also uses the Domlette version of PrettyPrint, which is, by default, the C-coded version. The original code using 4DOM takes 8.4 seconds to process a 24KB file on my 1.13GHz Pentium Laptop running Red Hat 8.0 with 512MB RAM. It takes 1.4 seconds to run the cDomlette version. The Domlette version is not only faster in this case, but scales better, too. I quadrupled the size of the XML file to be processed. The 4DOM script took 23.7 seconds to run on the enlarged file; the Domlette script took just 2.6 seconds. minidom is also much faster than 4DOM, though not as much so as Domlette. Using the minidom parser and the PrettyPrint from 4DOM, the times were 1.5 seconds for the smaller file and 4.7 seconds for the larger file.

What's new in XPath

Chapter 5 covers XPath and also uses 4DOM. In the case of XPath, Domlette has advantages in addition to DOM-processing speed. For example, it supports document index properties which speed up some XPath-specific operations. Domlette was actually designed with XPath in mind. But besides this, you may have no choice but to use 4Suite for XPath and XSLT processing: PyXML's versions of 4XPath and 4XSLT are currently broken and are unlikely to be fixed until a future alignment of the 4Suite and PyXML code bases. Again, this is not the authors' fault. At the time they wrote the book, their examples worked fine with PyXML.

The example on page 115 is a typical example of the book's XPath usage:

import sys



from xml.dom.ext.reader import PyExpat

from xml.xpath import Evaluate



path0 = "ship/captain"  # all captain elements



reader = PyExpat.Reader()

dom = reader.fromStream(sys.stdin)



captain_elements = Evaluate(path0, dom.documentElement)

for element in captain_elements:

  print "Element: ", element

Notice that this time the proper 4DOM creation interface is used, although I suggest using Domlette instead. In order to get this working with current code you need to install 4Suite and then change the XPath module for import from xml.xpath to Ft.Xml.XPath. Finally, to use Domlette, use the idiom I demonstrated earlier. After these adjustments, the example looks as follows:

import sys



from Ft.Xml.Domlette import NonvalidatingReader

from Ft.Xml.XPath import Evaluate



path0 = "ship/captain"  # all captain elements



dom = NonvalidatingReader.parseStream(sys.stdin)



captain_elements = Evaluate(path0, dom.documentElement)

for element in captain_elements:

  print "Element: ", element

What's new in XSLT

My comments on the Python/XSLT API examples in chapter 6 are similar to my comments in the last secton. Using the XSLT implementation in 4Suite requires a change in imports from xml.xslt to Ft.Xml.Xslt. There are also a few minor changes to the API of the 4XSLT processor object. The following example comes from page 146:

from xml.xslt.Processor import Processor



xsltproc = Processor()



xsltproc.appendStylesheetUri("story.xsl")

html = xsltproc.runUri("story.xml")

After making suggested adjustments, the example will look like

from Ft.Xml.Xslt.Processor import Processor

from Ft.Xml.InputSource import DefaultFactory



xsltproc = Processor()



xsltproc.appendStylesheet(DefaultFactory.fromUri("story.xsl"))

html = xsltproc.run(DefaultFactory.fromUri("story.xml"))

The main change, in addition to the imports , is that I specify the stylesheet and source file to the processor using input sources I create on the fly, rather than specifying the URIs directly. If any of this is unfamiliar to you, review my recent article on 4Suite.

Conclusion

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

There are other code examples in the book to which the suggestions I have made might well apply. A few other things have changed since the book came out. For example, in chapter 9 the authors cover SOAP.py when discussing SOAP for Python, hinting that it is the only cross-platform option for Python. ZSI has since emerged as a strong alternative SOAP library in Python. There is also at least one warning to attach to my suggestions to use Domlette. The book makes frequent use of the DOM method getElementsByTagName, which is not supported in Domlette, for simplicity. An equivalent function is very easy to write for yourself. I would suggest an implementation that uses Python generators and I will present just such an implementation in a forthcoming article.

Python & XML is a very handy book. The examples are especially clear, and in the latter part of the book the authors develop a sample application which uses much of the book's contents very practically. My main complaint is that it covers XML namespaces so sparsely. Namespaces are very hard to avoid these days in XML processing, regardless of what you may think of them. More examples and coverage of where namespaces intersect DOM, XPath, XSLT, and so on would help a lot of readers. I plan to write an article focusing on XML namespaces in Python processing.

Python & XML is the victim of recent flux in the state of Python-XML. I believe most of this flux has been progress, but it may confuse users nonetheless. I hope this article helps people use the book more effectively with current software releases.

Python-XML Happenings

Fredrik Lundh announced release 1.2 of ElementTree, a library for managing any sort of hierarchical Python objects in specialized data structures based on XML elements.

David Mertz announced the 1.0.5 release of gnosis XML tools, which includes bug fixes and updates to work with PyXML 0.8.x.

It's not an XML tool, but Cliff Wells's Python-DSV is worth mentioning. It's a tool for importing and exporting comma-separated values (CSV) files. I have come across many projects for interchanging XML with CSV. There is also Dave Cole's csv module which is specilized for Microsoft tool exports.