Menu

The State of Python-XML in 2004

October 13, 2004

Uche Ogbuji

The table below lists the currently available Python-XML software that I judge to be significant. It is not a list of every bit of software in Python that has anything to do with XML. For example, I do not list pyglade (part of PyGTK), which is software for generating user interfaces in the GNOME desktop system for UNIX. The user interface specifications in question are in XML, but this is not really enough to call it an XML processing tool for Python. However, you can certainly use the tools I mention for convenient manipulation of pyglade specifications.

The general rules of thumb for including software are, firstly, whether it implements a technology or set of technologies strongly associated with XML; and secondly, whether it does so in a way that is useful for any arbitrary XML file I may want to process.

Another example of a project that doesn't fit these parameters is Mark Pilgrim's excellent Universal Feed Parser, which parses almost every known form of RSS and Atom newsfeed formats, including some that are not well-formed XML. This package is not a general-purpose tool for XML processing, but rather focused on a specific XML vocabulary. I did make a bit of a compromise on this principle to cover RDF packages, since even though RDF/XML is a specific XML vocabulary, it is generally acknowledged as a valid way to express the data in any XML.

I organize the table according to selected areas of XML technology. This will give newcomers to Python a quick look at the coverage of XML technologies in Python and should serve as a quick guide to where to go to address any particular XML processing need. I have added reference links to column articles on software I've covered in this column. I have set a "heartbeat" rating for each project. One heart means the project is almost inactive and three means the project is very active. I judge this rating subjectively, according to recent activity I can find for each project: mailing list traffic, releases, articles, other projects that use it, etc.

In 2002 I reported 34 Python-XML projects. Last year I added 24 and this year 16 (marked with an asterisk) for a grand total of 74. This month alone two new projects have emerged, showing the continuing interest in Python processing of XML. This year I added a new category, for XML generators, with 9 entries. There has been a bloom in Python packages for generating XML. An existing category that keeps on growing is in Pythonic APIs or data bindings. There are 15 as of this year's count. There is no doubt that patience for non-Pythonic ways of processing XML has worn thin, but considering that my list may not even be complete (rumor has it Guido van Rossum has a data-binding tool of his own), one wonders whether this area is ripe for consolidation. At this point I leave you to judge such matters for yourself.

XML Processing Software for Python

XML Parsng Engines

Parsing engines offer unique, low-level parsers. Many packages offer additional capabilities, but this section mainly documents the various low-level XML parsers for Python, on which other packages then build. Packages do not support DTD validation unless such support is explicitly stated. Note: There are no Python parsers that I know of that support XML 1.1, although, as many have remarked (1), (2), (3), XML 1.1 is probably in trouble as far as adoption is concerned.

Name Description Vitality
PyLTXML PyLTXML is a Python extension wrapping the LTXML parser. It supports DTD validation.
cDomlette cDomlette is part of 4Suite. It is a fast, C-based DOM implementation with a Python API, and includes a wrapper of the expat parser. It supports RELAX NG validation. It also supports XInclude and XML Base and XML entity catalogs. [1]. [2].
libxml2/python This Python extension module is a wrapper for libxml. It supports DTD validation, RELAX NG, WXS, XInclude (plus XPointer), XML Base, and XML Catalogs. [1]. [2].
pyRXPU pyRXPU is a Python extension wrapping the RXP XML parser. It supports DTD validation. Unfortunately, pyRXPU is only an optional mode of building PyRXP, which in its default build falsely claims to be an XML parser. [1].
pyexpat Pyexpat is part of PyXML and is a wrapper of the expat parser.
qp_xml qp_xml is part of PyXML. It is a simple parser written entirely in Python with no validation support.
xmlproc xmlproc is part of PyXML. It is a parser written entirely in Python. It supports DTD validation and XML catalogs. It provides API access to parsed DTD constructs.

DOM

The Document Object Model is probably the best-known API for XML, and is very well-represented in the Python world.

Name Description Vitality
4DOM 4DOM is part of PyXML. It is a comprehensive implementation of W3C DOM Level 2.
PIRXX A Python extension module for interface with Xerces and Xalan.
cDomlette See the "XML parsing" section.
domhelper.py * domhelper.py is a DOM helper module with functions to provide some common operations on DOM, including looking up namespace URIs and prefixes, non-recursively getting text or child elements of a given node.
minidom Python versions from 2.0 up bundle a minidom module. Minidom is a lightweight DOM implementation that is more Pythonic. It follows the general lines of DOM Level 2. [1]. [2]. [3].
pulldom Python versions from 2.0 up bundle a pulldom module. Pulldom is a special DOM-like implementation that only loads parts of an XML document as requested. [1].
pxdom pxdom is a pure-Python DOM implementation and non-validating parser, supporting DOM Level 3 Core, XML, Load and Save specifications. pxdom passes the DOM Level 1 and 2 Core Test Suite. [1].
xmlapi * xmlapi is a lightweight XML DOM implementation similar to minidom.

Data Bindings and Specialized APIs

SAX and DOM are perhaps the best-known XML processing APIs, but there are many projects that strive for an API that focuses on the strengths of Python.

Name Description Vitality
Anobind Anobind is a data-binding that provides for customized bindings using XPath and Python patterns. It supports a subset of XPath on the data structures, and re-serialization of XML. [1].
ElementTree A library for managing any sort of hierarchical Python objects in specialized data structures based on XML elements. It suports a subset of XPath on the data structures. [1]. [2]. [3].
PAX Part of OpenTAL, a Python-based templating system for manipulation of XMLish data, the Pythonic API for XML (PAX) parses an XML file into a Pythonic data structure, using iterators for some APIs. It also provides a transformation engine.
POM The Python Object Model for XML is a DOM-like library, but more closely follows Python conventions. POM objects can also enforce DTD constraints dynamicaly during API manipulations. POM is a component of PyNMS, a collection of Python (and some C) modules for use in network management applications.
Python XML Marshaller * Python XML Marshaller is a Python data-binding for XML with some WXS support, including the ability to generate WXS from Python data structures. It also offers some features for customizing the binding.
SOX * Simple Objects from XML (SOX) is a part of the Python Enterprise Application Kit (PEAK). SOX uses SAX events to build a Python object the user can define based on specialized classes.
Satine Satine converts XML documents to Python lists of objects that have Python attributes mirroring the XML element attributes, called the "xlist" data structure. It also has a web services module that supports plain XML and SOAP over HTTP.
Skyron Skyron is a Python module that transforms XML documents according to simple "recipe" files expressed in XML. These recipes bind XML data to handler code in Python.
XBind XBind is an XML vocabulary for specifying language-independent data bindings. It comes with a a prototype Python implementation (see section 7 of the XBind tutorial for a link).
XElf XElf is a set of modules dedicated to XML processing for Python. It currently features a Python XOM implementation, including support for Namespaces and XMLBase. XOM is Elliotte Rusty Harold's XML object module for Java intended to improve upon DOM and JDOM.
XMLObject * XMLObject allows you to map from customized Python classes to XML, and vice versa.
generateDS.py generateDS.py is a tool for generating Python data structures from W3C XML Schema definitions. [1].
gnosis.xml.objectify This module in Gnosis Utilities turns arbitrary XML documents into Python objects, allowing for user customization of the conversion. [1].
xmlite xmlite is a light weight XML parser and printer that emits simple nested lists.
xmltramp xmltramp turns an XML document into a Python data structure with heavy use of dictionaries. [1].

XPath and XSLT

XPath and XSLT are perhaps the most universal XML processing tools. XSLT is not just a styling tool but a full-blown (if verbose) scripting language for XML. XPath is embedded in almost every other XML technology you can think of.

Name Description Vitality
4XSLT 4XSLT is part of 4Suite, as is 4XPath and 4XPointer. 4XSLT supports a large portion of EXSLT. [1]. [2]. [3].
Pyana Pyana is a Python extension module wrapping the Xalan XSLT engine.
libxslt/Python This Python extension module is a wrapper for libxslt. It supports a large portion of EXSLT

Schema Languages (Not Built into Parsers)

Schema languages allow one to communicate XML formats, validate that instances match the constraints, and even assess convenience features for the XML formats. DTD is the original schema language, and is usually implemented in XML parser (and so most implementations are covered in the section on parsers).

Name Description Vitality
Scimitar * Scimitar is an ISO Schematron implementation that works by compiling a Schematron schema into a Python validator script.
XSV XSV is a W3C XML Schema (WXS) implementation. It is actually one of the first WXS implementations, and drives the W3C's on-line validator.
XVIF XVIF implements RELAX NG, enhanced with the XML Validation Interoperability Framework for XML processing pipelining. It includes an implementation of XML Regular Fragmentations. 4Suite includes experimental RELAX NG and XVIF integration through this software.
gnosis.xml.validity This module in Gnosis Utilities represents XML DTD validity constraints as Python objects.
minixsv * minixsv is a lightweight W3C XML Schema validator written in pure Python. It implements a small but core subset of the language.

XML Generators

These are Python tools that can be used to generate XML.

Name Description Vitality
Atox * Atox allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. Changes since 0.1 include language improvements, added support for config files, and XSLT fragments in Atox format files.
GraphPath * GraphPath is a little XPath-like language for analysing graph-structured data, especially RDF. The implementation is python and works with rdflib or the python binding of Redland. It includes a query evaluator and a goal-driven inference engine.
JAXML JAXML is a Python module that provides a Python function invocation syntax for generating of XML or HTML. [1].
Martel * Martel is a tool for working flat-file text-based formats into XML, inspired by data formats popular in used in bioinformatics. It essentially generates SAX events from the results of applying regular expressions to text.
PXTL PXTL ("Python XML Templating Language" is a tool for producing XML, HTML and other text-based document types using XML templates.
PyGenx * PyGenx is a Python wrapper for Genx an canonical XML generation library written in C.
XMLBuilder * XMLBuilder is an XML generator that works by interpreting data in Python dictionaries.
handyxml * handyxml is a Python module that wraps XML parsers and parsed DOM implementations into objects with added Pythonic features.
xmlprinter A lightweight Python module to help write out well-formed XML, inspired by Perl's XML::Writer module. [1].

Protocols

One of the earliest and most discussed uses of XML is to transmit data from one application or machine to another. These tools provide such XML protocol facilities for use in Python.

Name Description Vitality
Python Web Services This is a collection of Python modules for SOAP, WSDL and related technologies.
WDDX/Python PyXML comes with a WDDX module for Python.
XMLTP Light XMLTP/L is a light weight XML-like RPC protocol (it actually only allows a subset of XML). XMLTP/L is primarily designed for fast RPC calls to a database server over an intranet. It is implemented in Python and C, although bindings can also be written in Java.
wsdl4py wsdl4py is a simple Python library for WSDL processing. See also uddi4py.
xmlrpclib Python versions from 2.1 up bundle XML-RPC client and server modules.

RDF and Topic Maps

The Resource Description Framework is a system for managing metadata. Its primary serialization syntax is an XML vocabulary. These are Python tools for processing this RDF/XML syntax.

name description vitality
4RDF 4RDF is part of 4Suite. It includes an RDF/XML and NTriples parser, RDF store system, Python triples API and an implementation of the Versa query language.
Pyrple * Pyrple is a small RDF API in Python,with support for parsing RDF/XML, N3, and N-Triples formats.
RDFLib RDFLib, which used to be part of Redfoot, is an RDF/XML parser and RDF triple store.
Redfoot Redfoot is an RDF server written in Python.
Redland/ Python This is a Python interface for the Redland RDF Application Framework.
Rx4RDF * Rx4RDF is a specification and reference implementation for querying, transforming, and updating W3C's RDF by specifying a deterministic mapping of the RDF model to the XML data model defined by XPath. Rx4RDF shields developers from the complexity of RDF by enabling you to use familar XML technologies such as XPath, XSLT, and XUpdate. Rx4RDF also forms the basis of Racoon, similar to the popular Cocoon framework, but using RDF and Python rather than XML/XSLT and Java.
TRAMP TRAMP is a data-binding-like map between RDF/XML documents and Python objects.
rdfxml.py A lightweight SAX-based RDF/XML parser.
tmproc tmproc is a Python implementation of XML Topic Maps, based on ISO/IEC 13250 Topic Maps.

Miscellany

In this category is software that does not fall into any other area.

Name Description Vitality
4XLink 4XLink is part of 4Suite. It implements a portion of XLink.
4XUpdate 4XUpdate is part of 4Suite. It is a Python implementation of XUpdate. It can be used to apply difference patches generated by XMLDiff. [1].
Berkeley DB XML Python Module Berkeley DB XML is an XML DBMS and it incudes a Python API that mirrors the C++ and Java APIs.
Pyxie Pyxie is a line-oriented XML processor.
XIST XIST is a Python web-page generator that operates using a DOM-like view of source XML documents.
XMLFilter * XMLFilter provides a fallback SAX parser/driver to avoid SAXReaderNotAvailable errors that users encounter on some platforms. It also offers a safety net against the XMLGenerator bug that bit me earlier in this series. Its main feature, however, is a framework for SAX filters. [1].
XMLTools XMLTools is a small suite of tools that includes a graphical XML tree viewer and editor for the GTK windowing library.
XMLdiff XMLdiff is a Python tool that figures out the significant differences between two XML files or DOM trees. It can generate XUpdate output.
c14n.py c14n.py is part of PyXML. It implements XML canonicalization. [1].
gnosis.xml.indexer This module in Gnosis Utilities creates full-text indexes of XML or plain-text files.
xml.sax Python versions from 2.0 up bundle a SAX module. [1]. [2]. [3]. [4]. [5].
xmlSiteMakerPy xmlSiteMakerPy is a Python-based XML and XSLT framework for offline (i.e. static) site generation.
xmlarch xmlarch is a XML architectural forms processor written in Python, using SAX.

The Community Marches On

I'm sure I've missed some resources in this update article. If you know of any I've neglected, please mention them in a comment to this article and I'll be sure to take note of them for future updates. I mention new or newly discovered resources at the end of each column article, and I compile the updates yearly. Certainly anyone working where Python meets XML should participate on the Python XML SIG mailing list, and post announcements there. Doing so is the best way to be sure that I and a lot of others are aware of your work.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

This month's regular update starts with something mind-bogglingly brilliant (if odd). My xmlhack colleague Oleg Paraschenko created Pysch, a scheme runtime environment in Python that he wrote expressely with the purpose of running Scheme tools SXPath and SXSLT under Python. Psych already runs these target packages, and according to Paraschenko, "I think that Pysch can be used to run any Scheme code, after first using third-party tools to process the Scheme code and save it in XML format for parsing by Psych." But there is the expected limitation: "Pysch is very slow. I'm not going to fix it yet. I use Pysch for research goals and not in production."

Mike Hostetler announced XMLBuilder 1.3. "You create an XMLBuilder object, send it some dictionary data, and it will generate the XML for you." I just mentioned the 1.1 release last month and I only post consecutive updates upon major changes. That certainly is the case here. It appears this is the first actually usable version of XMLBuilder. The announcement says "Support for non-ascii character." I hadn't realized such a limitation in earlier releases. I applaud the author and contributors for putting in the work to establish the "XML" in "XMLBuilder."

As I always vehemently argue, it ain't XML if it doesn't support Unicode. I probably have to weaken this rule a bit for XML generation code, saving the full strictness for XML parsing code, but I'm not comfortable with and can't recommend XML generation code that doesn't support the full character model. See the XMLBuilder announcement.

Roland Leuthe released minixsv 0.2, "a lightweight XML schema validator written in pure Python. It implements only a subset of the W3C XML schema [WXS] 1.0 recommendation." The WXS subset is very limited, but Leuthe admits the package is "pre-alpha," and I'll keep an eye out for further developments. minixsv works with the standard minidom or elementtree. As the page says, "Other DOM implementations can be easily adapted by implementing a newly derived XML interface class."

The major update rule also applies to my release of Scimitar 0.9.0. Scimitar is a fast ISO Schematron implementation that works by compiling a Schematron schema into a Python validator script. It now supports the full draft ISO Schematron spec, including variables and abstract patterns. See the announcement.

Philippe Normand announced XMLObject 0.1.3, a data-binding tool that allows you to map from customized Python classes to XML, and vice versa. See the announcement.

Fredrik Lundh released ElementTree 1.2.1. He says: "ElementTree 1.2.1 is 1.2 plus code that takes advantage of new expat features in newer versions of Python. As a result, the parser is now 20-30% faster on many kinds of XML documents. Enjoy!"

For users of various .NET Python tools, Srijit Kumar Bhadra posted some useful sample code for generating XML output. Later he posted some corrections to the code comments.