XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Introducing PyXML

Introducing PyXML

September 25, 2002

As you probably noticed in the table of Python software in the previous column, many of the Python tools for XML processing are available in the PyXML package. There are also XML libraries built into Python, but these are pretty well covered in the official documentation.


One of the things I'm going to do in these columns is provide brief information on significant new happenings relevant to Python-XML development, including significant software releases.

PyXML 0.8.1 has been released. Major changes include updated DOM support and the disabling of the bundled XSLT library from the default install. More on PyXML below.

The Fredrik "effbot" Lundh has kicked off a project to develop a graphical RSS newsreader in Python. Also, Mark Nottingham developed RSS.py, a Python module for processing RSS 0.9.x and 1.0 feeds. Mark is the author of an excellent RSS tutorial.

Eric Freese has released SemanText, a program for developing and managing semantic networks using XML Topic Maps. It can also build such topic maps contextually from general XML formats. It comes with a GUI for knowledge-base management.

There has been some talk on the XML-SIG about developing a general type library architecture in Python useful for plugging data types into schema, Web services and other projects.

Finally, in the previous column, I mentioned that the cDomlette XML parser supports XInclude and XML Base, but I neglected to mention that libxml's parser does as well. libxml also supports XML Catalogs. xmlproc also supports XML catalogs. Recent builds of 4Suite add XML and OASIS catalog support to cDomlette.

Setting up PyXML

PyXML is available as source, Windows installer, and RPM. Python itself is the only requirement. Any version later than 2.0 will do, I recommend 2.2.1 or later. If your operating environment has a C compiler, it's easy to build from source. Download the archive, unpack it to a suitable location and run

python setup.py install

Which is the usual way to set up Python programs. It is automatically installed to the package directory of the python executable that is used to run the install command. You will see a lot of output as it compiles and copies files. If you'd rather suppress this, use "python setup.py --quiet install".

PyXML overlays the xml module that comes with Python. It replaces all the contents of the Python module with its own versions. This is usually OK because the two code bases are the same, except that PyXML has usually incorporated more features and bug fixes. It's also possible that PyXML has introduced new bugs as well. If you ever need to uninstall PyXML, just remove the directory "_xmlplus" in the "site-packages" directory of your Python installation. The original Python xml module will be left untouched.


The SAX support in PyXML is basically the same code as the built-in SAX library in Python. The main added feature is a really clever module, xml.sax.writers, which provides facilities for re-serializing SAX events to XML and even SGML. This is especially useful as the end point of a SAX filter chain, and I'll cover it in more detail in a future article on Python SAX filters. PyXML also adds the xmlproc validating XML parser which can be selected using the make_parser function. Base Python does not support validation.


PyXML builds a good deal of machinery over the skeleton of the DOM Node class that comes with Python. First of all, there is 4DOM, a big DOM implementation which tries to be more faithful to the DOM spec than to be naturally Pythonic. It supports DOM Level 2, both XML (core) and HTML modules, including events, ranges, and traversal. PyXML also provides more frequently updated versions of minidom and pulldom. Because there are many DOM implementations available for Python, PyXML also adds a system in xml.dom.domreg for choosing between DOM implementations according to desired features. Only minidom and 4DOM are currently registered, so it is not yet generally useful, but this should change. There is also a lot of material in the xml.dom.ext module, which is set aside, generally, for DOM extensions specific to PyXML. These extensions include mechanisms for reading and writing serialized XML that predate the DOM Level 3 facilities for this.

Creating 4DOM nodes is a matter of using the modules in xml.dom.ext.readers. The general pattern is to create a reader object, which can then be used to parse from multiple sources. For parsing XML, one would usually use xml.dom.ext.readers.Sax2, and for HTML, xml.dom.ext.readers.HtmlLib. Listing 1, demonstrates reading XML.

Listing 1: Creating a 4DOM node by reading XML from a string

from xml.dom.ext.reader import Sax2

DOC = """<?xml version="1.0" encoding="UTF-8"?>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>

#Create an XML reader object
reader = Sax2.Reader()
#Create a 4DOM document node parsed from XML in a string
doc_node = reader.fromString(DOC)

#You can execute regular DOM operations on the document node
verse_element = doc_node.documentElement
#And you can even use "Pythonic" shortcuts for things like
#Node lists and named node maps
#The first child of the verse element is a white space text node
#The second is the attribution element
attribution_element = verse_element.childNodes[1]
#attribution_string becomes "Christopher Okibgo"
attribution_string = attribution_element.firstChild.data

Listing 2 demonstrates reading HTML. You need to be connected to the Internet to run it successfully.

Listing 2: Creating a 4DOM HTML node by reading from a URL

from xml.dom.ext.reader import HtmlLib

#Create an HTML reader object
reader = HtmlLib.Reader()
#Create a 4DOM document node parsed from HTML at a URL
doc_node = reader.fromUri("http://www.python.org")

#Get the title of the HTML document
title_elem = doc_node.documentElement.getElementsByTagName("TITLE")[0]
#title_string becomes "Python Language Website"
title_string = title_elem.firstChild.data

The HTML parser is pretty forgiving, but not as much so as most Web browsers. Some non-standard HTML will cause errors.

One can also output XML using the xml.dom.ext.Print and xml.dom.ext.PrettyPrint functions. One can output proper XHTML from appropriate XML and HTML DOM nodes using xml.dom.ext.XHtmlPrint and xml.dom.ext.XHtmlPrettyPrint. Pure white space nodes between elements can be removed using xml.dom.ext.StripXml and xml.dom.ext.StripHtml. There are other small utility functions in this module. Listing 3 is a continuation of listing 1 which takes the document node that was read, strips it of white space, and then re-serializes the result to XML.

Pages: 1, 2

Next Pagearrow