It's now been a year since the first Python and XML article, in which I made a broad survey of all things Python-XML. Since then I have picked various techniques, packages, and other useful resources to examine in detail. I haven't, of course, been able to look at every interesting or useful package or technique, but stay tuned for the next year of articles. This month I update the overall Python-XML survey to encompass notable developments over the past year, many of which I've mentioned in passing in prior articles. I hope this article serves as a ready and rapid index to folks who want to process XML using (in my opinion) the best language available for the purpose.
I mentioned the books which substantially discussed Python-XML in my first article. There haven't been many additions to that list this year. Chapter five of Dive Into Python by Mark Pilgrim covers XML processing. This book is a freely available electronic text and a very valuable resource overall. Text Processing in Python, by David Mertz (Addison Wesley), covers XML only briefly, but the things it does discuss in depth will almost inevitably come in handy if you do a lot of XML processing. If you bought Python & XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly and Associates), remember that I presented a companion and update to that book in an earier article.
The following table lists the currently available Python-XML software that I judge to be significant. It is not a list of every bit of software in Python that has anything to do with XML; for example, I do not list pyglade, which is software for generating user interfaces in the GNOME desktop system for Unix. The user interface specifications in question are in XML, but this is not really enough to call it an XML processing tool for Python. The criteria for inclusion are, first, whether a tool implements a technology or set of technologies strongly associated with XML; second, whether the tool does so in a way that is useful for any arbitrary XML file I may want to process.
I've organized the table according to the areas of XML technology. This will give newcomers to Python a quick look at the coverage of XML technologies in Python and should serve as a quick guide to where to go to address any particular XML processing need. I rate the vitality of each listed project as either "weak", "steady", or "strong" according to the recent visible activity on each project: mailing list traffic, releases, articles, other projects that use it, etc.
A year ago I reported 34 Python-XML projects. This year I have added 24, each of which is marked with a
*. Most of the additions, though, point to the impressive activity that continues on the Python-XML front.
XML parsing engines
Parsing engines are characterized by offering unique low-level parsers. Many offer other capabilities, but this section mainly documents the various low-level XML parsers for Python, on which other packages then build.
|PyLTXML||PyLTXML is a Python extension wrapping the LTXML parser. It supports DTD validation.||steady|
|cDomlette||cDomlette is part of 4Suite. It is a fast C-based DOM implementation with a Python API, and includes a wrapper of the expat parser. It supports DTD validation. It also supports XInclude and XML Base and XML entity catalogs.||strong|
|libxml/python||This Python extension module is a wrapper for libxml. It supports DTD validation, XInclude (plus XPointer), XML Base and XML Catalogs.||strong|
|pyRXP||pyRXP is a Python extension wrapping the RXP XML parser. It supports DTD validation.||steady|
|pyexpat||Pyexpat is part of PyXML and is a wrapper of the expat parser. It supports DTD validation.||strong|
|qp_xml||qp_xml is part of PyXML. It is a simple parser written entirely in Python with no validation support.||steady|
|xmlproc||xmlproc is part of PyXML. It is a parser written entirely in Python. It supports DTD validation and XML catalogs. It provides API access to parsed DTD constructs.||steady|
The Document Object Model is probably the best-known API for XML, and very well represented in the Python world.
|4DOM||4DOM is part of PyXML. It is a comprehensive implementation of W3C DOM Level 2.||steady|
|cDomlette||See the "XML parsing" section||strong|
|minidom||Python versions from 2.0 up bundle a minidom module. Minidom is a lightweight DOM implementation that is more pythonic. It follows the general lines of DOM Level 2.||strong|
|pulldom||Python versions from 2.0 up bundle a pulldom module. Pulldom is a special DOM-like implementation that only loads parts of an XML document as requested.||strong|
|pxdom *||pxdom is a pure-Python DOM implementation and non-validating parser, supporting DOM Level 3 Core, XML, Load and Save specifications. pxdom passes the DOM Level 1 and 2 Core Test Suite.||strong|
Data bindings and specialized APIs
SAX and DOM are perhaps the best known XML processing APIs, but there are many projects that strive for an API that focuses on the strengths of Python.
|Anobind *||Anobind is a data binding which provides for customized bindings using XPath and Python patterns. It supports a subset of XPath on the data structures, and re-serialization of XML.||strong|
|ElementTree *||A library for managing any sort of hierarchical Python objects in specialized data structures based on XML elements. It supports a subset of XPath on the data structures.||strong|
|PAX *||The Pythonic API for XML parse an XML file into a Pythonic data structure, using iterators for some APIs. It also provides a transformation engine.||weak|
|POM *||The Python Object Model for XML is a DOM-like library, but more closely follows Python conventions. POM objects can also enforce DTD constraints dynamically during API manipulations. POM is a component of PyNMS, a collection of Python (and some C) modules for use in network management applications.||weak|
|Satine *||Satine converts XML documents to Python lists of objects which have Python attributes mirroring the XML element attributes, called the "xlist" data structure. It also has a web services module which supports plain XML and SOAP over HTTP.||steady|
|Skyron *||Skyron is a Python module that transforms XML documents according to simple "recipe" files expressed in XML. These recipes bind XML data to handler code in Python.||steady|
|XBind *||XBind is an XML vocabulary for specifying language-independent data bindings. It comes with a a prototype Python implementation (see section 7 of the XBind tutorial for a link).||weak|
|XElf *||XElf is a set of modules dedicated to XML processing for Python. It currently features a Python XOM implementation, including support for Namespaces and XMLBase. XOM is Elliotte Rusty Harold's XML object module for Java intended to improve upon DOM and JDOM.||steady|
|generateDS.py *||generateDS.py is a tool for generating Python data structures from W3C XML Schema definitions.||steady|
|gnosis.xml.objectify *||This module in Gnosis Utilities turns arbitrary XML documents into Python objects, allowing for user customization of the conversion.||strong|
|xmlite *||xmlite is a light weight XML parser and printer that emits simple nested lists.||steady|
|xmltramp *||xmltramp turns an XML document into a Python data structure with heavy use of dictionaries.||strong|
XPath and XSLT
XPath and XSLT are perhaps the most universal XML processing tools. XSLT is not just a styling tool but a full-blown (if verbose) scripting language for XML. XPath is embedded in almost every other XML technology you can think of.
|4XSLT||4XSLT is part of 4Suite, as is 4XPath and 4XPointer. 4XSLT supports a large portion of EXSLT.||strong|
|Pyana||Pyana is a Python extension module wrapping the Xalan XSLT engine.||strong|
|libxslt/Python||This Python extension module is a wrapper for libxslt. It supports a large portion of EXSLT||strong|
Schema languages (not built into parsers)
Schema languages all ow one to communicate XML formats, validate that instances match the constraints and even ass convenience features for the XML formats. DTD is the original schema language, and is usually implemented in XML parser (and so most but implementations are covered in the section on parsers).
|XSV||XSV is a W3C XML Schema (WXS) implementation. It is actually one of the first WXS implementations, and drives the W3C's on-line validator.||steady|
|XVIF||XVIF implements RELAX NG, enhanced with the XML Validation Interoperability Framework for XML processing pipelining. It includes an implementation of XML Regular Fragmentations. 4Suite includes experimental RELAX NG and XVIF integration through this software.||steady|
|gnosis.xml.validity *||This module in Gnosis Utilities represents XML DTD validity constraints as Python objects.||strong|
One of the earliest and most discussed uses of XML is to transmit data from one application and/or machine to another. These tools provide such XML protocol facilities for use in Python.
|Python Web Services||This is a collection of Python modules for SOAP, WSDL, and related technologies.||steady|
|WDDX/Python||PyXML comes with a WDDX module for Python.|
|XMLTP Light *||XMLTP/L is a light weight XML-like RPC protocol (it actually only allows a subset of XML). XMLTP/L is primarily designed for fast RPC calls to a database server over an intranet. It is implemented in Python and C, although bindings can also be written in Java.||steady|
|wsdl4py||wsdl4py is a simple Python library for WSDL processing. See also uddi4py.||steady|
|xmlrpclib||Python versions from 2.1 up bundle XML-RPC client and server modules.||strong|
RDF and Topic Maps
The Resource Description Framework is a system for managing metadata. Its primary serialization syntax is an XML vocabulary. These are Python tools for processing this RDF/XML syntax.
|4RDF||4RDF is part of 4Suite. It includes an RDF/XML and NTriples parser, RDF store system, Python triples API and an implementation of the Versa query language.||strong|
|RDFLib *||RDFLib, which used to be part of Redfoot, is an RDF/XML parser and triple store.||strong|
|Redfoot||Redfoot is an RDF server written in Python.||weak|
|Redland/Python||This is a Python interface for the Redland RDF Application Framework.||strong|
|TRAMP *||TRAMP is a data-binding-like map between RDF/XML documents and Python objects.||steady|
|rdfxml.py *||A lightweight SAX-based RDF/XML parser.||steady|
|tmproc||tmproc is a Python implementation of XML Topic Maps, based on ISO/IEC 13250 Topic Maps.||weak|
These are libraries that implement various XML technologies for Python.
|4XLink||4XLink is part of 4Suite. It implements a portion of XLink.||weak|
|4XUpdate||4XUpdate is part of 4Suite. It is a Python implementation of XUpdate. It can be used to apply difference patches generated by XMLDiff.||strong|
|Berkeley DB XML Python Module *||Berkeley DB XML is an XML DBMS and it includes a Python API that mirrors the C++ and Java APIs.||steady|
|JAXML *||JAXML provides a Python function invocation syntax for generating XML or HTML.||steady|
|PXTL *||PXTL ("Python XML Templating Language") is a tool for producing XML, HTML and other text-based document types using XML templates.||strong|
|Pyxie||Pyxie is a line-oriented XML processor.||weak|
|XIST||XIST, "object oriented XSLT", uses an easily extensible, DOM-like view of source and target XML documents to do tree transformations.||strong|
|XMLTools||XMLTools is a small suite of tools that includes a graphical XML tree viewer and editor for the GTK windowing library.||weak|
|XMLdiff||XMLdiff is a Python tool that figures out the significant differences between two XML files or DOM trees. It can generate XUpdate output.||strong|
|c14n.py||c14n.py is part of PyXML. It implements XML canonicalization.||strong|
|gnosis.xml.indexer*||This module in Gnosis Utilities creates full-text indexes of XML or plain text files.||strong|
|xml.sax||Python versions from 2.0 up bundle a SAX module.||strong|
|xmlSiteMakerPy *||xmlSiteMakerPy is a Python-based XML and XSLT framework for offline (i.e. static) site generation.||strong|
|xmlarch||xmlarch is a XML architectural forms processor written in Python, using SAX.||weak|
|xmlprinter *||A lightweight Python module to help write out well-formed XML, inspired by Perl's XML::Writer module.||weak|
And so back to the fray...
Bloggers to watch
Many of the active contributors to Python-XML are also active webloggers, and though they talk about all manner of other topics as well, they're still a rich resource for Python-XML news and goodies.
I've undoubtedly missed some resources in this article. If you know of any I've neglected, please mention them in a comment to this article and I'll be sure to take note of them for future updates. I mention new or newly discovered resources at the end of each article, and I compile the updates yearly. And for those working on new Python and XML goodies, do not forget to post announcements to the Python XML SIG mailing list. This is the best way to be sure that I and a lot of others are aware of your work.
Andrew Clover is certainly in the spotlight for this month's round of updates. I've reported on his remarkable efforts summarizing the state of compliance of the various Python DOM implementations. This matrix is now available in an HTML document, DOM Standards compliance.
Clover also announced an new package, pxdom, a "stand-alone pure-Python DOM implementation and non-validating parser, supporting Level 3 Core, XML, Load and Save specifications". pxdom's emphasis is on standards compliance. "It supports the W3C specifications fully, with only very minor deviations...pxdom passes the DOM L1/2 Core Test Suite". However, "The emphasis of pxdom is not on efficiency. General speed and memory usage can be expected to be somewhere between minidom and 4DOM levels, nowhere near cDomlette." pxdom is compatible with Python 1.5.2 and later and available under a BSD-style license.
pxdom was created as an engine for PXTL, which Clover also announced. PXTL ("Python XML Templating Language") is a tool for producing XML, HTML, and other text-based document types using XML templates. PXTL is also available under a BSD-style license.
Also in Python and XML
Walter Dörwald announced XIST 2.2, an XML-based extensible markup generator written in Python (version 2.3 now required). This release adds support for XSL-FO output and there are many other core fixes and improvements.
I also came across PyMeld, a "lightweight system for manipulating [HTML] using a Pythonic object model". It also claims to handle XML, but it uses a set of clearly non-compliant regular expressions. I do not recommend this tool for use with XML.
A few months ago I mentioned Pete Ohler's announcement of xmlite, a small validating XML parser for Python. As I said, he neglected to make the module available. I've now found xmlite on SourceForge. Notice: xmlite is not a vaidating parser, as I mistakenly said earier.
Sam Ruby announced a new XML API for Python, Lazy DOM. It's not a DOM impementation at all, but rather a specialized data structure representation. As Ruby says "the basic metaphor is that everything is an array, where the indexes are either an integer or a tag name or an attribute name". To be more precise, Ruby means "dictionary" when he says array, and all the keys are strings, including strings given through a pretty clever mechanism for representing namespaces.
- Very good :)
2007-01-22 05:33:44 Aukcje
- Correction regarding libxml2/python
2003-09-15 06:13:45 Daniel Veillard
- Correction regarding pyexpat
2003-09-12 08:53:36 Fred Drake