A Python & XML Companion
December 11, 2002
Python & XML, written by Christopher Jones and Fred Drake, Jr. (O'Reilly and Associates, 2002), introduces Python programmers to XML processing. Drake is a core developer of PyXML, and Jones is an experienced developer in Python and XML, as well as an experienced author. As you would expect from such a team, this book is detailed and handy; however, I have a few notes, amplifications, and updates (the book was released in December of 2001) to offer -- all of which are distinct from the errata that the authors maintain. In this article I will provide updates, additional suggestions, and other material to serve as a companion to the book. You don't have to have the book in order to follow along.
Chapter 2 offers a brief introduction to XML, but if you are not very familiar with XML, I would recommend you find a general XML book such as Learning XML by Eric Ray (O'Reilly and Associates, 2001). The book offers no introduction to Python, so if you are not very familiar with Python, you will want to consult other books. Python & XML doesn't do anything too esoteric with Python, so a basic text such as Learning Python by Mark Lutz and David Ascher (O'Reilly and Associates, 1999) will serve you well.
More on DOM
Chapter 4 covers DOM APIs for Python, focusing on 4DOM and minidom. It doesn't discuss performance considerations very much. In particular, one important warning to be added about 4DOM is that it is very slow. It goes out of its way to be DOM compliant, though strict compliance is probably not important to most applications. This strict conformance compromises performance. I was one of the original authors of 4DOM, but I have long since come to the view that DOM is a specification to be taken in moderation. In trying to be platform and language neutral, it fails to take advantage of particular strengths of host languages like Python. I now prefer DOM-like systems that use familiar DOM method names and idioms in general, but which are specialized for Python. Thus I do most of my work in Domlette, which comes with 4Suite and is, because it's written in C, very fast. Domlette falls-back to a Python implementation on platforms where the C library cannot be compiled. I introduced Domlette in my recent article on 4Suite. PyRXP and Python/LTXML are faster than Domlette, but they deviate enough from DOM that I shall save discussion of them for future articles.
I understand that the authors' choices were largely motivated by the desire to minimize the amount of software readers need to install. If the book's examples used Domlette, PyRXP, LTXML, or the like, the reader would have to install more than just Python and PyXML. Nevertheless, many examples in the book use the old, deprecated system for creating 4DOM instances. For example, on page 86 I find the following code:
from xml.dom.ext.reader.Sax2 import FromXmlStream doc = FromXmlStream(sys.stdin)
FromXmlStream function is deprecated (as are others like it). The
preferred way to write this code is
from xml.dom.ext.reader import Sax2 reader = Sax2.Reader() doc = reader.fromStream(sys.stdin)
reader object has the advantage of being reusable, with some savings in
As an illustration of the performance advantage of Domlette, I ported the example on page 94. An earlier example demonstrated the construction of a simple XML format for file system metadata, such as the following snippet:
<IndexedFiles> <file name="/home/fdrake/memos/oreilly/pythonxml/ch07-edits-2001-09-01.txt"> <userID>4104</userID> <groupID>4104</groupID> <size>1887</size> <lastAccessed>Sat Sep 1 03:09:07 2001</lastAccessed> <lastModified>Sat Sep 1 03:09:07 2001</lastModified> <created>Sat Sep 1 03:09:07 2001</created> <extension>.txt</extension> <contents> /home/fdrake/memos/oreilly/pythonxml/ch07-edits-2001-09-01.txt: ASCII English text </contents> </file> <!-- ... snip more entries like this ... --> </IndexedFiles>
The example DOM code on page 94 removes elements named
groupID from the
file elements. The script on page 94 is as follows:
#!/usr/bin/env python import sys from xml.dom.ext.reader.Sax2 import FromXmlStream from xml.dom.ext import PrettyPrint # get DOM object doc = FromXmlStream(sys.stdin) # remove unwanted nodes by traversing Node tree for node1 in doc.childNodes: for node2 in node1.childNodes: node2.normalize() node3 = node2.firstChild while node3 is not None: next = node3.nextSibling name = node3.nodeName if name in ("contents", "extension", "userID", "groupID"): # remove unwanted nodes here via the parent node2.removeChild(node3) node3 = next PrettyPrint(doc)
In order to do the same thing with Domlette, I created the
doc node with the
following code instead:
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint doc = NonvalidatingReader.parseStream(sys.stdin)
This code will issue a warning because I do not provide a URI for the input source I'm creating. See my 4Suite article for more details. You can safely ignore these warnings for now.
This change also uses the Domlette version of
PrettyPrint, which is, by
default, the C-coded version. The original code using 4DOM takes 8.4 seconds to process
24KB file on my 1.13GHz Pentium Laptop running Red Hat 8.0 with 512MB RAM. It takes
seconds to run the cDomlette version. The Domlette version is not only faster in this
but scales better, too. I quadrupled the size of the XML file to be processed. The
script took 23.7 seconds to run on the enlarged file; the Domlette script took just
seconds. minidom is also much faster than 4DOM, though not as much so as Domlette.
minidom parser and the
PrettyPrint from 4DOM, the times were 1.5 seconds for
the smaller file and 4.7 seconds for the larger file.
What's new in XPath
Chapter 5 covers XPath and also uses 4DOM. In the case of XPath, Domlette has advantages in addition to DOM-processing speed. For example, it supports document index properties which speed up some XPath-specific operations. Domlette was actually designed with XPath in mind. But besides this, you may have no choice but to use 4Suite for XPath and XSLT processing: PyXML's versions of 4XPath and 4XSLT are currently broken and are unlikely to be fixed until a future alignment of the 4Suite and PyXML code bases. Again, this is not the authors' fault. At the time they wrote the book, their examples worked fine with PyXML.
The example on page 115 is a typical example of the book's XPath usage:
import sys from xml.dom.ext.reader import PyExpat from xml.xpath import Evaluate path0 = "ship/captain" # all captain elements reader = PyExpat.Reader() dom = reader.fromStream(sys.stdin) captain_elements = Evaluate(path0, dom.documentElement) for element in captain_elements: print "Element: ", element
Notice that this time the proper 4DOM creation interface is used, although I suggest
Domlette instead. In order to get this working with current code you need to install
and then change the XPath module for import from
Ft.Xml.XPath. Finally, to use Domlette, use the idiom I demonstrated earlier.
After these adjustments, the example looks as follows:
import sys from Ft.Xml.Domlette import NonvalidatingReader from Ft.Xml.XPath import Evaluate path0 = "ship/captain" # all captain elements dom = NonvalidatingReader.parseStream(sys.stdin) captain_elements = Evaluate(path0, dom.documentElement) for element in captain_elements: print "Element: ", element
What's new in XSLT
My comments on the Python/XSLT API examples in chapter 6 are similar to my comments
last secton. Using the XSLT implementation in 4Suite requires a change in imports
Ft.Xml.Xslt. There are also a few minor changes to
the API of the 4XSLT processor object. The following example comes from page 146:
from xml.xslt.Processor import Processor xsltproc = Processor() xsltproc.appendStylesheetUri("story.xsl") html = xsltproc.runUri("story.xml")
After making suggested adjustments, the example will look like
from Ft.Xml.Xslt.Processor import Processor from Ft.Xml.InputSource import DefaultFactory xsltproc = Processor() xsltproc.appendStylesheet(DefaultFactory.fromUri("story.xsl")) html = xsltproc.run(DefaultFactory.fromUri("story.xml"))
The main change, in addition to the imports , is that I specify the stylesheet and source file to the processor using input sources I create on the fly, rather than specifying the URIs directly. If any of this is unfamiliar to you, review my recent article on 4Suite.
Also in Python and XML
There are other code examples in the book to which the suggestions I have made might
apply. A few other things have changed since the book came out. For example, in chapter
the authors cover SOAP.py when discussing SOAP for Python, hinting that it is the
cross-platform option for Python. ZSI has since emerged as a strong alternative SOAP
in Python. There is also at least one warning to attach to my suggestions to use Domlette.
The book makes frequent use of the DOM method
getElementsByTagName, which is
not supported in Domlette, for simplicity. An equivalent function is very easy to
yourself. I would suggest an implementation that uses Python generators and I will
just such an implementation in a forthcoming article.
Python & XML is a very handy book. The examples are especially clear, and in the latter part of the book the authors develop a sample application which uses much of the book's contents very practically. My main complaint is that it covers XML namespaces so sparsely. Namespaces are very hard to avoid these days in XML processing, regardless of what you may think of them. More examples and coverage of where namespaces intersect DOM, XPath, XSLT, and so on would help a lot of readers. I plan to write an article focusing on XML namespaces in Python processing.
Python & XML is the victim of recent flux in the state of Python-XML. I believe most of this flux has been progress, but it may confuse users nonetheless. I hope this article helps people use the book more effectively with current software releases.
It's not an XML tool, but Cliff Wells's Python-DSV is worth mentioning. It's a tool for importing and exporting comma-separated values (CSV) files. I have come across many projects for interchanging XML with CSV. There is also Dave Cole's csv module which is specilized for Microsoft tool exports.