Menu

The State of the Python-XML Art, 2003

September 10, 2003

Uche Ogbuji

It's now been a year since the first Python and XML article, in which I made a broad survey of all things Python-XML. Since then I have picked various techniques, packages, and other useful resources to examine in detail. I haven't, of course, been able to look at every interesting or useful package or technique, but stay tuned for the next year of articles. This month I update the overall Python-XML survey to encompass notable developments over the past year, many of which I've mentioned in passing in prior articles. I hope this article serves as a ready and rapid index to folks who want to process XML using (in my opinion) the best language available for the purpose.

Books

I mentioned the books which substantially discussed Python-XML in my first article. There haven't been many additions to that list this year. Chapter five of Dive Into Python by Mark Pilgrim covers XML processing. This book is a freely available electronic text and a very valuable resource overall. Text Processing in Python, by David Mertz (Addison Wesley), covers XML only briefly, but the things it does discuss in depth will almost inevitably come in handy if you do a lot of XML processing. If you bought Python & XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly and Associates), remember that I presented a companion and update to that book in an earier article.

Software

The following table lists the currently available Python-XML software that I judge to be significant. It is not a list of every bit of software in Python that has anything to do with XML; for example, I do not list pyglade, which is software for generating user interfaces in the GNOME desktop system for Unix. The user interface specifications in question are in XML, but this is not really enough to call it an XML processing tool for Python. The criteria for inclusion are, first, whether a tool implements a technology or set of technologies strongly associated with XML; second, whether the tool does so in a way that is useful for any arbitrary XML file I may want to process.

I've organized the table according to the areas of XML technology. This will give newcomers to Python a quick look at the coverage of XML technologies in Python and should serve as a quick guide to where to go to address any particular XML processing need. I rate the vitality of each listed project as either "weak", "steady", or "strong" according to the recent visible activity on each project: mailing list traffic, releases, articles, other projects that use it, etc.

A year ago I reported 34 Python-XML projects. This year I have added 24, each of which is marked with a *. Most of the additions, though, point to the impressive activity that continues on the Python-XML front.

XML processing software for Python
name description vitality

XML parsing engines

Parsing engines are characterized by offering unique low-level parsers. Many offer other capabilities, but this section mainly documents the various low-level XML parsers for Python, on which other packages then build.

PyLTXML PyLTXML is a Python extension wrapping the LTXML parser. It supports DTD validation. steady
cDomlette cDomlette is part of 4Suite. It is a fast C-based DOM implementation with a Python API, and includes a wrapper of the expat parser. It supports DTD validation. It also supports XInclude and XML Base and XML entity catalogs. strong
libxml/python This Python extension module is a wrapper for libxml. It supports DTD validation, XInclude (plus XPointer), XML Base and XML Catalogs. strong
pyRXP pyRXP is a Python extension wrapping the RXP XML parser. It supports DTD validation. steady
pyexpat Pyexpat is part of PyXML and is a wrapper of the expat parser. It supports DTD validation. strong
qp_xml qp_xml is part of PyXML. It is a simple parser written entirely in Python with no validation support. steady
xmlproc xmlproc is part of PyXML. It is a parser written entirely in Python. It supports DTD validation and XML catalogs. It provides API access to parsed DTD constructs. steady

DOM

The Document Object Model is probably the best-known API for XML, and very well represented in the Python world.

4DOM 4DOM is part of PyXML. It is a comprehensive implementation of W3C DOM Level 2. steady
cDomlette See the "XML parsing" section strong
minidom Python versions from 2.0 up bundle a minidom module. Minidom is a lightweight DOM implementation that is more pythonic. It follows the general lines of DOM Level 2. strong
pulldom Python versions from 2.0 up bundle a pulldom module. Pulldom is a special DOM-like implementation that only loads parts of an XML document as requested. strong
pxdom * pxdom is a pure-Python DOM implementation and non-validating parser, supporting DOM Level 3 Core, XML, Load and Save specifications. pxdom passes the DOM Level 1 and 2 Core Test Suite. strong

Data bindings and specialized APIs

SAX and DOM are perhaps the best known XML processing APIs, but there are many projects that strive for an API that focuses on the strengths of Python.

Anobind * Anobind is a data binding which provides for customized bindings using XPath and Python patterns. It supports a subset of XPath on the data structures, and re-serialization of XML. strong
ElementTree * A library for managing any sort of hierarchical Python objects in specialized data structures based on XML elements. It supports a subset of XPath on the data structures. strong
PAX * The Pythonic API for XML parse an XML file into a Pythonic data structure, using iterators for some APIs. It also provides a transformation engine. weak
POM * The Python Object Model for XML is a DOM-like library, but more closely follows Python conventions. POM objects can also enforce DTD constraints dynamically during API manipulations. POM is a component of PyNMS, a collection of Python (and some C) modules for use in network management applications. weak
Satine * Satine converts XML documents to Python lists of objects which have Python attributes mirroring the XML element attributes, called the "xlist" data structure. It also has a web services module which supports plain XML and SOAP over HTTP. steady
Skyron * Skyron is a Python module that transforms XML documents according to simple "recipe" files expressed in XML. These recipes bind XML data to handler code in Python. steady
XBind * XBind is an XML vocabulary for specifying language-independent data bindings. It comes with a a prototype Python implementation (see section 7 of the XBind tutorial for a link). weak
XElf * XElf is a set of modules dedicated to XML processing for Python. It currently features a Python XOM implementation, including support for Namespaces and XMLBase. XOM is Elliotte Rusty Harold's XML object module for Java intended to improve upon DOM and JDOM. steady
generateDS.py * generateDS.py is a tool for generating Python data structures from W3C XML Schema definitions. steady
gnosis.xml.objectify * This module in Gnosis Utilities turns arbitrary XML documents into Python objects, allowing for user customization of the conversion. strong
xmlite * xmlite is a light weight XML parser and printer that emits simple nested lists. steady
xmltramp * xmltramp turns an XML document into a Python data structure with heavy use of dictionaries. strong

XPath and XSLT

XPath and XSLT are perhaps the most universal XML processing tools. XSLT is not just a styling tool but a full-blown (if verbose) scripting language for XML. XPath is embedded in almost every other XML technology you can think of.

4XSLT 4XSLT is part of 4Suite, as is 4XPath and 4XPointer. 4XSLT supports a large portion of EXSLT. strong
Pyana Pyana is a Python extension module wrapping the Xalan XSLT engine. strong
libxslt/Python This Python extension module is a wrapper for libxslt. It supports a large portion of EXSLT strong

Schema languages (not built into parsers)

Schema languages all ow one to communicate XML formats, validate that instances match the constraints and even ass convenience features for the XML formats. DTD is the original schema language, and is usually implemented in XML parser (and so most but implementations are covered in the section on parsers).

XSV XSV is a W3C XML Schema (WXS) implementation. It is actually one of the first WXS implementations, and drives the W3C's on-line validator. steady
XVIF XVIF implements RELAX NG, enhanced with the XML Validation Interoperability Framework for XML processing pipelining. It includes an implementation of XML Regular Fragmentations. 4Suite includes experimental RELAX NG and XVIF integration through this software. steady
gnosis.xml.validity * This module in Gnosis Utilities represents XML DTD validity constraints as Python objects. strong

Protocols

One of the earliest and most discussed uses of XML is to transmit data from one application and/or machine to another. These tools provide such XML protocol facilities for use in Python.

Python Web Services This is a collection of Python modules for SOAP, WSDL, and related technologies. steady
WDDX/Python PyXML comes with a WDDX module for Python.    
XMLTP Light * XMLTP/L is a light weight XML-like RPC protocol (it actually only allows a subset of XML). XMLTP/L is primarily designed for fast RPC calls to a database server over an intranet. It is implemented in Python and C, although bindings can also be written in Java. steady
wsdl4py wsdl4py is a simple Python library for WSDL processing. See also uddi4py. steady
xmlrpclib Python versions from 2.1 up bundle XML-RPC client and server modules. strong

RDF and Topic Maps

The Resource Description Framework is a system for managing metadata. Its primary serialization syntax is an XML vocabulary. These are Python tools for processing this RDF/XML syntax.

4RDF 4RDF is part of 4Suite. It includes an RDF/XML and NTriples parser, RDF store system, Python triples API and an implementation of the Versa query language. strong
RDFLib * RDFLib, which used to be part of Redfoot, is an RDF/XML parser and triple store. strong
Redfoot Redfoot is an RDF server written in Python. weak
Redland/Python This is a Python interface for the Redland RDF Application Framework. strong
TRAMP * TRAMP is a data-binding-like map between RDF/XML documents and Python objects. steady
rdfxml.py * A lightweight SAX-based RDF/XML parser. steady
tmproc tmproc is a Python implementation of XML Topic Maps, based on ISO/IEC 13250 Topic Maps. weak

Miscellany

These are libraries that implement various XML technologies for Python.

4XLink 4XLink is part of 4Suite. It implements a portion of XLink. weak
4XUpdate 4XUpdate is part of 4Suite. It is a Python implementation of XUpdate. It can be used to apply difference patches generated by XMLDiff. strong
Berkeley DB XML Python Module * Berkeley DB XML is an XML DBMS and it includes a Python API that mirrors the C++ and Java APIs. steady
JAXML * JAXML provides a Python function invocation syntax for generating XML or HTML. steady
PXTL * PXTL ("Python XML Templating Language") is a tool for producing XML, HTML and other text-based document types using XML templates. strong
Pyxie Pyxie is a line-oriented XML processor. weak
XIST XIST, "object oriented XSLT", uses an easily extensible, DOM-like view of source and target XML documents to do tree transformations. strong
XMLTools XMLTools is a small suite of tools that includes a graphical XML tree viewer and editor for the GTK windowing library. weak
XMLdiff XMLdiff is a Python tool that figures out the significant differences between two XML files or DOM trees. It can generate XUpdate output. strong
c14n.py c14n.py is part of PyXML. It implements XML canonicalization. strong
gnosis.xml.indexer* This module in Gnosis Utilities creates full-text indexes of XML or plain text files. strong
xml.sax Python versions from 2.0 up bundle a SAX module. strong
xmlSiteMakerPy * xmlSiteMakerPy is a Python-based XML and XSLT framework for offline (i.e. static) site generation. strong
xmlarch xmlarch is a XML architectural forms processor written in Python, using SAX. weak
xmlprinter * A lightweight Python module to help write out well-formed XML, inspired by Perl's XML::Writer module. weak

And so back to the fray...

Bloggers to watch

Many of the active contributors to Python-XML are also active webloggers, and though they talk about all manner of other topics as well, they're still a rich resource for Python-XML news and goodies.

I've undoubtedly missed some resources in this article. If you know of any I've neglected, please mention them in a comment to this article and I'll be sure to take note of them for future updates. I mention new or newly discovered resources at the end of each article, and I compile the updates yearly. And for those working on new Python and XML goodies, do not forget to post announcements to the Python XML SIG mailing list. This is the best way to be sure that I and a lot of others are aware of your work.

Andrew Clover is certainly in the spotlight for this month's round of updates. I've reported on his remarkable efforts summarizing the state of compliance of the various Python DOM implementations. This matrix is now available in an HTML document, DOM Standards compliance.

Clover also announced an new package, pxdom, a "stand-alone pure-Python DOM implementation and non-validating parser, supporting Level 3 Core, XML, Load and Save specifications". pxdom's emphasis is on standards compliance. "It supports the W3C specifications fully, with only very minor deviations...pxdom passes the DOM L1/2 Core Test Suite". However, "The emphasis of pxdom is not on efficiency. General speed and memory usage can be expected to be somewhere between minidom and 4DOM levels, nowhere near cDomlette." pxdom is compatible with Python 1.5.2 and later and available under a BSD-style license.

pxdom was created as an engine for PXTL, which Clover also announced. PXTL ("Python XML Templating Language") is a tool for producing XML, HTML, and other text-based document types using XML templates. PXTL is also available under a BSD-style license.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Walter Dörwald announced XIST 2.2, an XML-based extensible markup generator written in Python (version 2.3 now required). This release adds support for XSL-FO output and there are many other core fixes and improvements.

I discovered xmltramp, another Python-XML data binding. There is also TRAMP which is provides a data binding-like mapping between RDF/XML documents and Python objects.

I also came across PyMeld, a "lightweight system for manipulating [HTML] using a Pythonic object model". It also claims to handle XML, but it uses a set of clearly non-compliant regular expressions. I do not recommend this tool for use with XML.

A few months ago I mentioned Pete Ohler's announcement of xmlite, a small validating XML parser for Python. As I said, he neglected to make the module available. I've now found xmlite on SourceForge. Notice: xmlite is not a vaidating parser, as I mistakenly said earier.

Sam Ruby announced a new XML API for Python, Lazy DOM. It's not a DOM impementation at all, but rather a specialized data structure representation. As Ruby says "the basic metaphor is that everything is an array, where the indexes are either an integer or a tag name or an attribute name". To be more precise, Ruby means "dictionary" when he says array, and all the keys are strings, including strings given through a pretty clever mechanism for representing namespaces.

In my last article I introduced xmlSiteMakerPy. The address I gave is obsolete, and you should use this updated updated link.

Andrew Dalke announced PyRSS2Gen, an RSS 2.0 generator for Python. Dalke developed it to be both highly compliant and to support a broad range of RSS features. See the announcement for more details.